Cream CE with SGE and shared home directories


How to deal with glite update 30 (3.2)
(1) On the cream ces: yum install gridengine-qmaster
This provides /usr/share/gridengine/bin/lx26-amd64/qacct
tomcat now needs to be able to run qstat, hence it needs a certificate.
(2) The new apel parser needs up to date accounting logs, not the filtered ones for APEL. These are in /usr/share/gridengine/default/common/. I make a new variable in siteinfo.def: BATCH_LOG_DIR_APEL=/usr/share/gridengine/apel and the change BATCH_LOG_DIR (but not BATCH_LOGS_DIR !) in config_apel_sge Then use set the default BATCH_LOG_DIR to /usr/share/gridengine/default/common/
While I am at it, also correct the time the apel cron job is run to 10 10 instead of the default 35 01 - I don't like my middleware doing stuff in the night, it needs watching ;-)

(3) This install now looks for /usr/share/java/jakarta-commons-logging-api-1.1.jar linked from /usr/share/tomcat5/common/lib Apperently we only have 1.0.4 installed so far. This doesn't really seem to be a problem however.

(4) Something in the glite init scripts is screwed up (bug is logged in Savannah), so I end up with lots of tomcat owned processes that refuse to go away and prevent cream from starting properly. The prosaic solution seems to be to su - tomcat and kill all its processes, providing a clean start *before* running yaim.
(5) Unrelated: In config_gip_sge change Production to $CREAM_CE_STATE in the section labelled "Build info-reporter.conf file". That way it uses the CREAM_CE_STATE set in the siteinfo.def file instead of setting it to Production every time I rerun yaim. On closer look this seems to set the state in /etc/sge-jobmanager/cluster.state which is then ignored by the bdii. Sigh. A closer look seems to suggest that it only gets used for the following three states: Production, Draining and Closed.

Nagios monitoring of ceprod05.grid.hep.ph.ic.ac.uk.

(0) Documentation
1a) The new glite CREAM page.
1b) The old glite CREAM page.
2) The cream homepage.

(1) Get the repositories
The underlying operating system is CentOS 5.6.
cd /etc/yum.repos.d/
Certificates, current version:
wget http://repository.egi.eu/sw/production/cas/1/current/repo-files/eg i-trustanchors.repo
wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-CREAM.repo
Certificates, old version:
wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/lcg-CA.repo
wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-CREAM.repo
wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-SGE_utils.repo


(2) Install the software
yum install lcg-CA (still works)
yum install glite-CREAM
yum install glite-SGE_utils
Put the hostcert.pem and hostkey.pem files in /etc/grid-security:
openssl pkcs12 -clcerts -nokeys -out hostcert.pem -in cetest00.p12
openssl pkcs12 -nocerts -nodes -out hostkey.pem -in cetest00.p12
chmod 600 hostkey.pem


(3) Prepare for configuration

(a) Users: As we make our own users ("CONFIG_USERS=no" in siteinfo.def), make some key users first:
groupadd -g 200 glexec
useradd -m -g glexec glexec
groupadd -g 155 glite
useradd -m -u 155 -g glite glite
Mainly to stop yaim from winging:
groupadd -g 151 infosys

(b) mkdir /opt/glite/yaim/siteinfo/
Set the permissions to read and write for user only: chmod 600 siteinfo
Then there's siteinfo.def, users.conf, groups.conf and and the vo.d directory. (I use this C++ program to make my users.conf and groups.conf files.)

(c) After running yaim for the first time: Mount the sandbox dir across all the worker nodes:
/opt/glite/var/cream_sandbox/

(d) Hyphen in users names: See here. [This has been resolved in newer versions of cream.]
Edit config_cream_sudoers:
upper_group_name=`echo ${a_group} | sed 's/-//g' |tr '[:lower:]' '[:upper:]'`
(2 occurrences)
and copy the new version (wget http://www.pd.infn.it/~andreett/pub/patches/glite-ce-common-java.jar) over /var/lib/tomcat5/webapps/ce-cream/WEB-INF/lib/glite-ce-common-java.jar after you have run yaim and restart tomcat.
(e) Give slapd a chance (this issue persists):
semanage fcontext -a -t slapd_db_t "/var/bdii(/.*)?"; restorecon -vR /var/bdii/
(f) Give the bdii a chance, part two: add "slapd: ALL" in hosts.allow
(g) Give the bdii a chance, part three: semanage port -a -t ldap_port_t -p tcp 2170

(4) SGE
(a) Link sgemaster to port6444
[root@ceprod01 sgeCA]# pwd
/var/sgeCA
[root@ceprod01 sgeCA]# ls -l
total 12
lrwxrwxrwx 1 root root 11 Jun 29 11:10 port6444 -> sge_qmaster
drwxr-xr-x 3 root root 4096 Jun 22 22:49 sge_qmaster

(b) To source the grid environment script for the WN tarball install, edit /var/lib/tomcat5/webapps/ce-cream/WEB-INF/jobwrapper.tpl (yaim will override this) - it's the section marked 'hack'.

(c) All jobs need to run a script cream-sge.sh located in /usr/bin on the worker nodes.
[root@ceprod01 ~]# qconf -sq grid.q | egrep '(prolog|epilog)'
prolog /usr/bin/cream-sge.sh prolog
epilog /usr/bin/cream-sge.sh epilog

(d) Shared home directories: Edit /opt/glite/bin/sge_filestaging. Here is my version.
(e) edguser needs to be able to run qstat:
chown -R edguser:sgeadmin /var/sgeCA/sge_qmaster/default/userkeys/edguser

(5) Run Yaim
/opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/siteinfo/site-info-ceprod01.def -n creamCE -n SGE_utils
After running yaim, note points 4(b) and 3(d) and make sure the accounting hasn't gone haywire.

(6) Firewall
The official webpage: Open Ports for CREAM
Here the relevant section of the iptables.

(7) Accounting
(a) Set the accounting string to ceprod05 (i.e. the machine name) in /opt/glite/bin/sge_submit.sh
(jobID=`qsub -A "ceprod05.grid.hep.ph.ic.ac.uk" $bls_tmp_file 2> /dev/null | perl -ne 'print $1 if /^Your job (\d+) /;'` # actual submission)
(b) Use a filter script to filter out the jobs relevant to ceprod01 and make sure the directory apel looks in is set to the filtered and not the original output: Filter script, parser-config-yaim.xml file.
(c) On the mon box update the IP tables to give the CE access:
-A RH-Firewall-1-INPUT -s [ip address] -p tcp -m tcp --dport 3306 -j ACCEPT
/etc/init.d/iptables restart
(d) Update the mysql tables: [this refers to the old MON box - for glite-APEL, check here. Unfortunately they don't seem to be quite current right now (May 2011).
mysql -u root -p
grant all privileges on accounting.* to accounting@ceprod05.grid.hep.ph.ic.ac.uk identified by 'put_the_apel_password_here';
flush privileges;
(e) Check that the CEs have actually written something to the mon box:
mysql -u accounting -p
use accounting;
select Max(EventDate), SubmitHost from EventRecords group by SubmitHost;

(8) Odds and ends
(a) Shared software area: /opt/edg/var/info needs to be mounted from the central storage
(b) Stick the machine in the GOCDB.
(c) Update the site bdii.
(d) The default mysql settings are no good. Here's the tweaked one.
(9) Debugging
See if it advertises its services: Webpage.
(b) Edit /opt/glite/etc/glexec.conf:
log_file = /var/log/glexec/glexec.log (this will stop the messages going to /var/log/messages)
Increase all the log_levels from 1 to 2 (up to 5)
(c) More logs in /opt/glite/var/log/
(d) In /opt/glite/etc/glite-ce-cream/log4j.properties set log4j.logger.org.glite.security=debug, fileout