Installing a glite 3.1 WMS on Centos4 (64 bit)

This install log refers to 3.1.30-0.slc4.
WMS Guide
How tos and Guides

1. Installing the software
cd /etc/yum.repos.d
wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/glite-WMS.repo
change $basearch to i386
wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/lcg-CA.repo
I also need the DAG repo for log4cpp.
yum install lcg-CA
To avoid having the whole thing fail due to unsigned packages etc set gpgcheck=0 in /etc/yum.conf
yum install glite-WMS
This will install without error, unfortunately when running yaim this error will appear:
Syntax error on line 17 of /opt/glite/etc/glite_wms_wmproxy_httpd.conf:
Cannot load /usr/lib/httpd/modules/mod_log_config.so into server: /usr/lib/httpd/modules/mod_log_config.so: cannot open shared object file: No such file or directory
This is because it needs httpd.i386 and not httpd.x86_64.
By the time I discover this it has installed httpd.x86_64 and won't let me install httpd.i386, though I add the 32 bit repo.
In the end, this worked:
yum remove httpd.x86_64
yum install httpd.i386
yum install mod_ssl.i386
yum install glite-WMS (to add all the packages it had removed when doing 'yum remove httpd.x86_64 back in)

2. Host certificate
cd /etc/grid-security
openssl pkcs12 -clcerts -nokeys -out hostcert.pem -in wms01.grid.p12
openssl pkcs12 -nocerts -nodes -out hostkey.pem -in wms01.grid.p12
chmod 600 hostkey.pem

3. Configuration
(a) The WMS users are distinct from the grid users on the rest of the system - they just follow the same naming convention out of convenience. It pays to do a getent passwd at the beginning to make sure that only local users exist.
As listed under ' known issues ' the WMS cannot use static accounts, but must use pool accounts instead. And despite the fact that sgm and prd users don't have special privileges on the WMS they must have their own unix group.
Unfortunately yaim is not very good at cleaning up after itself, so in case of 'user confusion', to get back to a clean slate do:
for user in `getent passwd | grep -E "lt2-" | awk 'BEGIN{ FS=":"}{ print $1 }'`; do /usr/sbin/userdel $user; done
for group in `cat /etc/group | grep -E "lt2-" | awk 'BEGIN{ FS=":"}{print $1 }'`; do /usr/sbin/groupdel $group; done
cd /etc/grid-security/gridmapdir
rm -rf *
cd /home
rm -rf lt2* (do not delete glite, edguser and edginfo!)
before re-running yaim.
Here are the links to the current users.conf, groups.conf and the program (and config file) that made it. The config file is [VO name] [unix group] [gid] [first user id].

(b)
mkdir /opt/glite/yaim/siteinfo
chmod 600 /opt/glite/yaim/siteinfo
Here's the siteinfo.def (plus the standard vo.d (must be in the same directory as siteinfo.def)

Done ? Then run yaim:
/opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/siteinfo/siteinfo-wms01.def -n glite-WMS

4. Open ports
# wms
-A RH-Firewall-1-INPUT -p tcp --dport 20000:25000 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 7443 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 2811 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 2170 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 9000:9003 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 8443 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 5120 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 9618 -j ACCEPT
-A RH-Firewall-1-INPUT -p udp --dport 9618 -j ACCEPT
# mysql
-A RH-Firewall-1-INPUT -p tcp --dport 3306 -j ACCEPT

5. Hacks
By default only the input, but not the output sandbox size is limited, therefore nothing is stopping a user to drag several GB home in his output sandbox. It pays to set /opt/glite/etc/glite_wms.conf: MaxOutputSandboxSize = 30000000; (i.e. 3x the input size).

MyProxy
To be able to use the myproxy server at RAL, you need to email support [insert obvious character here] gridpp.rl.ac.uk with the DN of the WMS.

** Maintenance **

Jobs getting stuck
To check what's actually going on:
condor_q
To forcefully delete a job: condor_rm [job id], followed by condor_rm -forcex [job id].
The job id can be retrieved e.g. from condor_q -hold:
-- Submitter: wms01.hep.ph.ic.ac.uk:
ID OWNER HELD_SINCE HOLD_REASON
51178.0 glite 7/7 15:57 Failed to get expiration time of proxy
51226.0 glite 7/7 15:57 Failed to get expiration time of proxy
51254.0 glite 7/7 15:57 Failed to get expiration time of proxy
51255.0 glite 7/7 15:57 Failed to get expiration time of proxy
52946.0 glite 7/13 11:32 Globus error 12: the connection to the serv
61774.0 glite 8/7 20:51 Globus error 131: the user proxy expired (j
62339.0 glite 8/10 06:00 Globus error 131: the user proxy expired (j

Other useful combinations: condor_q -hold -long | grep -i usersubject

My clean up held jobs script:
[root@wms01 ~]# cat clean_held_jobs.sh
#!/bin/bash


CONDOR_CRAP=`condor_q -hold | grep glite | awk '{print $1}'`

for JOB_ID in $CONDOR_CRAP
do
echo "Removing job: " $JOB_ID
condor_rm $JOB_ID
# sleep 2
condor_rm -forcex $JOB_ID
done

How to drain a WMS
My WMS tends to be self draining, by the time I realize there's a problem, all the jobs have gone :-(
Otherwise:
cat /var/glite/.drain
<file>
<gacl>
  <entry>
    <any-user/>
    <deny><exec/></deny>
  </entry>
</gacl>
</file>


How to deal with a mysql database that got too big
(Not the most elegant way, but this is the grid we are talking about here):
To check the size:
ls -l /var/lib/mysql/
stop gLite and mysqld then do:
rm -rf /var/lib/mysql/*
rerun yaim

Glasgow's WMS tips.
How to find out which user a job belongs to
If it's not on hold in the condor queue.....
Go to the LB.
Find your way through the database:
mysql -p
use lbserver20;
show tables;
[to explore anything in tables:]
SHOW INDEX FROM jobs;
Find a user:
select userid from jobs where jobid='UoTC9SF01bakVz1qJWMltA';
and use result:
select cert_subj from users where userid='058aa5d770a13ec524c3f78fa2108e5e';

select * from states where jobid='UoTC9SF01bakVz1qJWMltA';
is not ideal but a step in the right direction.
Get some stats
/opt/glite/bin/queryStats -f "2011-02-08 00:00:00"


Previous version:

Installing a 3.1 WMS

This refers to glite-WMS-3.1.27-0.

1. Operating System
This machine is special as it has Scientific Linux SL release 4.8 (Beryllium) hand installed. We ain't doing that again. It uses the following repos:
adobe.repo, dag.repo, dries.repo, jpackage.repo, slc-base.repo, slc-update.reposl-fastbugs.repo, sl-testing.repo, atrpms.repo, glite-WMS.repo, lcg-CA.repo, sl-contrib.repo, sl-errata.repo, sl-errata.repo.exclude, sl.repo
which might be overkill and come back to haunt us, but we'll see. httpd* and mod_ssl are currently excluded in sl-errata.repo, due to this bug.
The (not very elegant) downgrade procedure was:
yum erase httpd
yum erase mod_ssl
cd downgrade/
[root@wms01 downgrade]# ls
httpd-2.0.52-41.sl4.6.i386.rpm httpd-suexec-2.0.52-41.sl4.6.i386.rpm mod_ssl-2.0.52-41.sl4.6.i386.rpm
rpm -ivh *.rpm
A more modern version of yum should handle this more gracefully.


2. Installing the software
(a) Install the host certificate.
(b) Get the repos:
cd /etc/yum.repos.d
wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/glite-WMS.repo
wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/lcg-CA.repo
(c) Get the software:
yum install lcg-CA
yum install glite-WMS

3. Users
The WMS users are distinct from the grid users on the rest of the system - they just follow the same naming convention out of convenience. It pays to do a getent passwd at the beginning to make sure that only local users exist.
As listed under ' known issues ' the WMS cannot use static accounts, but must use pool accounts instead. And despite the fact that sgm and prd users don't have special privileges on the WMS they must have their own unix group.
Unfortunately yaim is not very good at cleaning up after itself, so in case of 'user confusion', to get back to a clean slate do:
for user in `getent passwd | grep -E "lt2-" | awk 'BEGIN{ FS=":"}{ print $1 }'`; do /usr/sbin/userdel $user; done
for group in `cat /etc/group | grep -E "lt2-" | awk 'BEGIN{ FS=":"}{print $1 }'`; do /usr/sbin/groupdel $group; done
cd /etc/grid-security/gridmapdir
rm -rf *
cd /home
rm -rf lt2* (do not delete glite, edguser and edginfo!)
before re-running yaim.
Here are the links to the current users.conf, groups.conf and the program (and config file) that made it. The config file is [VO name] [unix group] [gid] [first user id].

4. Configuration
The necessary variables according to the documentation . Additionally it currently needs: GRIDFTP_CONNECTIONS_MAX=500. The MYSQL_PASSWORD should be unique to the machine. i.e. it does not need to match the LB.
Here is a link to the siteinfo.def. It also need the VOs in the vo.d directory. The fact that it needs a SW_DIR for each VO is a bug and these variables can be filled with dummy values.
/opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/siteinfo/siteinfo-wms01.def -n glite-WMS

5. Open ports
(No guarantees.)
-A RH-Firewall-1-INPUT -p tcp --dport 20000:25000 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 7443 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 2811 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 2170 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 9000:9003 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 8443 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 5120 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 9618 -j ACCEPT
-A RH-Firewall-1-INPUT -p udp --dport 9618 -j ACCEPT
# mysql
-A RH-Firewall-1-INPUT -p tcp --dport 3306 -j ACCEPT


6. Hacks
(a) In glite_wms.conf: change ftpconn from to 300 and WorkloadManagerProxy LogLevel from 5 to 6 (known issues).
(b) As we run the WMS and the LB on two different machines, we regularly (~5 min) saw the error "glite-lb-bkserverd: Database call failed (Access denied for user 'lbserver'@'localhost' to database 'lbserver20')" which I traced down to /opt/glite/etc/init.d/glite-lb-bkserverd returning 1 with the error "glite-lb-notif-interlogd not running". (At some point the package containing the software was spilt in two and the complete set only included in glite-LB.) Apparently this can be ignored so I set /opt/glite/etc/init.d/glite-lb-bkserverd to return 0 in status() despite the "glite-lb-notif-interlogd not running" message.

7. And finally (myproxy) ...
RAL runs the myproxy server for the UK (lcgrbp01.gridpp.rl.ac.uk). It can't be queried directly (like any other myproxy server out there), but only via the RAL bdii:
ldapsearch -x -H ldap://site-bdii.gridpp.rl.ac.uk:2170 -b mds-vo-name=RAL-LCG2,o=grid | perl -00pe 's/\r*\n //g' | grep -i myproxy | grep ic.ac.uk
The host DN (here /C=UK/O=eScience/OU=Imperial/L=Physics/CN=wms01.hep.ph.ic.ac.uk) needs to be explicitly listed, so if you change the name of the WMS RAL needs to be told: support [insert obvious character here] gridpp.rl.ac.uk