How to install an lcg-CE on Centos 5 (you know you want to !!!)


(1) Get the repositories - we are using SGE with shared home dirs
wget -O lcg-CA.repo "http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/lcg-CA.repo"
wget -O lcg-CE.repo "http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/lcg-CE.repo"
wget -O glite-SGE_utils.repo "http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/glite-SGE_utils.repo"
Apart from lcg-CA, set the basearch to i386.


(2) Install the software
Get some lost packages:
yum -y install openssl097a log4cxx jdk
yum -y --nogpgcheck install bouncycastle log4j
yum -y install perl-XML-Twig
yum -y install lcg-CA lcg-CE glite-yaim-sge-utils glite-info-dynamic-sge glite-apel-sge
At this point it might be wise to disable the glite repos, apart from lcg-CA which should be left enabled.

(3) Install the hostcert in /etc/grid-security/
openssl pkcs12 -clcerts -nokeys -out hostcert.pem -in ceprodNN.p12
openssl pkcs12 -nocerts -nodes -out hostkey.pem -in ceprodNN.p12
chmod 600 hostkey.pem


(4) Configure the node
Compulsory Yaim Variables for a lcg-CE.
Here is a link to my siteinfo.def, vo.d, users.conf and groups.conf. I use this programme to generate them. (Unfortunately I have not yet tried if my hack to mapped both cms production and cms hiproduction users to prd accounts works as I've put them in by hand as described in section 'special requests'.)
/opt/glite/yaim/bin/yaim -d6 -c -s site-info-prodNN.def -n lcg-CE
/opt/glite/yaim/bin/yaim -d6 -r -s site-info-prodNN.def -n SGE_utils -f config_gip_sge


(5) Hacks

(5.1) bdii doesn't start
Remove all flags from DB_CONFIG, the remaining file should look similar to this:
set_lk_max_locks 10000
set_tas_spins 100
set_cachesize 0 50000000 1


(5.2) bdii reports gatekeeper is stopped, even when it's running
Problem:
/etc/init.d/functions differs from SL4 - hence the e.g. the globus-gatekeeper/gridftp will show up as 'stopped' when the status command is issued as edguser and not as root.
Solution:
At the beginning of /etc/init.d/globus-gatekeeper and globus-gridftp add
__pids_pidof() {
/sbin/pidof -o $$ -o $PPID -o %PPID -x "$1" || \
/sbin/pidof -o $$ -o $PPID -o %PPID -x "${1##*/}"
}


(5.3) After a reboot bdii reports status OK, when it's clearly not
The bdii needs to be restarted by hand after everything has finished booting, it's quite determined not to let you do that automatically.

(6) Special requests
So CMS wants some priorityusers, hiproduction users and other users that just say: Actually can we have a cms only cluster, please. To avoid rerunning yaim, edit gridmapdir, groupmapfile, voms-grid-mapfile and edg-mkgridmap.conf, then restart globus-gatekeeper.




Kept for historical reasons:

ceprod00 is now working.
To reconfigure it, after updating the siteinfo.def file run
/opt/glite/yaim/bin/yaim -d6 -c -s site-info-prod00.def -n lcg-CE
/opt/glite/yaim/bin/yaim -d6 -r -s site-info-prod00.def -n SGE_utils -f config_gip_sge
and check: /opt/lcg/etc/cleanup-grid-accounts.conf
Also make sure the hack for edguser described in (c7) is still in place.



Install log:
After successfully installing a test CE, setting up a production CE should be (relatively) straight forward, no ?

yaim
The functions run by yaim to configure a CE can be found in:
/opt/glite/yaim/node-info.d/lcg-ce
Compulsory Yaim Variables for a CE.

(User) Accounts
Here is the documentation wrt to pool accounts: Pool accounts Note that even if I set CONFIG_USERS to NO in the siteinfo.def file, I still need to provide a users.conf file to create the gridmapfile.
On a CE users (except sgm, prd) are mapped in /etc/grid-security/gridmapdir.
Even if the user exists elsewhere (test e.g. with getent passwd), the grid will ignore him/her/it as long as the account does not appear in gridmapdir. If I users already exists in the gridmapdir YAIM will not remove it even if it's not in the users.conf anymore.


*** Here we go. ***

(a) yum
No 'protectbase' on CentOS-Base.repo and ICHEP.repo, some of the grid software needs very new versions of the rpms.
wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/lcg-CA.repo
wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.1/lcg-CE.repo

No 64 bit architecture, so change $basearch to i386

yum install lcg-CA
yum install lcg-CE

Kostas installs some rpms....
wget ftp://ftp.mirrorservice.org/sites/mirror.centos.org/5/updates/i386/RPMS/openssl097a-0.9.7a-9.el5_2.1.i386.rpm
rpm -ivh --test openssl097a-0.9.7a-9.el5_2.1.i386.rpm
yum install log4cxx.i686

Now try yum install lcg-CE again

(b) siteinfo.def, users.conf, groups.conf
These seemed to produce a somewhat working CE (no tests for prod yet). I use this script to make my groups.conf and users.conf. (I am really not proud of it, but after 10 years of C++ it's hard to let go). (This script makes the corresponding input files. And here's the list of VOs I am currently contemplating.

/opt/glite/yaim/bin/yaim -d6 -c -s site-info-prod00.def -n lcg-CE

(c) SGE
The master node is sgemaster01.

* The information system *
yum install glite-info-dynamic-sge (from glite-SGE_utils)
yum install glite-apel-sge
yum install perl-XML-Twig

Note: config_gip_ce tries to set /opt/glite/etc/gip/plugin/glite-info-dynamic-ce, but has no proper entry for 'sge', config_gip_sge will reset it. To avoid having to mess with the information system everytime I run yaim lcg-CE, comment out the line
cat << EOF > ${INSTALL_ROOT}/glite/etc/gip/plugin/glite-info-dynamic-ce in config_gip_ce. (Yum update will override this -- check !! And it doesn't -- -- seem to work anymore ???)
[root@ceprod00 functions]# more /opt/glite/etc/gip/plugin/glite-info-dynamic-ce
#!/bin/sh
/opt/lcg/libexec/lcg-info-dynamic-scheduler -c /opt/glite/etc/lcg-info-dynamic-scheduler.conf
This should get rid of "Spec for lrms backend cmd missing in config file" in /var/log/messages.
/opt/glite/yaim/bin/yaim -d6 -r -s site-info-prod00.def -n SGE_utils -f config_gip_sge

(c1) edguser needs to be able to run qstat.
(c2) /opt/glite/libexec/sge_helper -- other issues:
opt/glite/libexec/sge_helper --vomaxjobs -c
/opt/glite/etc/lcg-info-dynamic-scheduler.conf
error: commlib error: can't set CA chain file
Solution (Kostas):
Fixed by linking /var/sgeCA/port6444 to /var/sgeCA/sge_qmaster I'll have to think what is the best solution for this but it's caused by sge_helper sourcing /usr/share/gridengine/default/common/settings.sh
(c3) removing all $twig->dispose; in /opt/glite/libexec/sge_helper
(c4) in /opt/bdii/sbin/bdii-update add '-s' option to '$slapadd -c -d'
(c5) Garbage reported in software tags:
GlueHostApplicationSoftwareRunTimeEnvironment:: Vk8tcGhlbm8taGVyd2lnLTItMC0wCQ
It turns out the new CE cannot handle whitespaces in the [experiment].list file (ce00 can)
After removing all the whitespaces (the files reside on sedsk09 and are mounted as from the CEs as /opt/edg/var/info/) in /srv/grid/vos/infotags/experiment/experiment.list
(c6) The ce needs to be added on the bdii.
On bdii01, add it as and re-run yaim, on bdii00 which has a more dodgy setup, add it to /opt/glite/etc/gip/site-urls.conf (and nowhere else ?). If it's not in the bdii, the 'js' sam test will fail.
(c7)globus-gatekeper the globus-gridftp-server report being stopped when queried by edguser, i.e. the bdii, but report running when queried as root. This is due to a difference in /etc/rc.d/init.d/functions between SL4 (which the CE is meant to be installed on) and SL5. Details can be found here.

* Reconciling grid and sge *
on ceprod00: /etc/sge-jobmanager
cluster.state, info-reporter.conf, jobmanager.conf, job.postamble, job.preamble, vqueues.conf

* Accounting *
The 'new' style accounting uses the following files:
on the CE:
/opt/edg/var/gatekeeper/grid-jobmap_[date]
from sgemaster01:
/usr/share/gridengine/default/common/accounting
The grid-jobmap files are made by lcg-dgas-tools and need to be called in sge.pm.
The cron accounting cron job is:
/etc/cron.d/edg-apel-sge-parser and the config file /opt/glite/etc/glite-apel-sge/parser-config-yaim.xml
On lcgmon00, need to open port to allow ceprod00 to write to it, similar to
-A RH-Firewall-1-INPUT -s 155.198.216.206 -p tcp -m tcp --dport 3306 -j ACCEPT
(this is ce00)


* The jobmanager *
/opt/globus/lib/perl/Globus/GRAM/JobManager/sge.pm
This gets used by globus, log files in /opt/globus/var/log/

(d) Open ports
A very basic test to run is:
lx07: uberftp ceprod00.hep.ph.ic.ac.uk
globus_xio: Unable to connect to ceprod00.hep.ph.ic.ac.uk:2811
I guess I need to open port 2811: In /etc/sysconfig/iptables add
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 2811 -j ACCEPT


(e) List of rpms on ceprod00
Here.

(f) Testing

ldapsearch -x -H ldap://ceprod00.hep.ph.ic.ac.uk:2170 -b mds-vo-name=resource,o=grid

glite-wms-job-submit -a -r ceprod00.hep.ph.ic.ac.uk:2119/jobmanager-sge-long -o [some_logfile_name] glite-submit.jdl
glite-wms-job-status -i [some_logfile_name]