EMI 2 CREAMCE

EMI 2 CREAMCE

This refers to version 1.14.2 ( Update 7) on CentOS 6.3
The batch system used is SGE.

(0) Documentation
(a) Link to EMI2 CREAMCE documentation.
(b) Known issues.
(c) Sysadmin guide to cream.

(1) Issues relevant to Update 7
(a) SGE bdii scripts (still) need fixing: GGUS 83352
(b) Keep an eye on the inodes (df -i): GGUS 87264
(c) Bug in the post install script resets sandbox permissions GGUS 91688
(d) yaim changes the context of /etc/my.cnf which then gets (silently) ignored. After running yaim do: restorecon -v /etc/my.cnf

(2) Preliminaries
yum install yum-protectbase
yum install yum-priorities

(3) Repos and Software
(a) CAs
wget http://repository.egi.eu/sw/production/cas/1/current/repo-files/EGI-trustanchors.repo -O /etc/yum.repos.d/EGI-trustanchors.repo
yum install ca-policy-egi-core
(b) EPEL
wget http://www.nic.funet.fi/pub/mirrors/fedora.redhat.com/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
rpm -i epel-release-6-8.noarch.rpm
(c) EMI2
rpm --import http://emisoft.web.cern.ch/emisoft/dist/EMI/2/RPM-GPG-KEY-emi
wget http://emisoft.web.cern.ch/emisoft/dist/EMI/2/sl6/x86_64/base/emi-release-2.0.0-1.sl6.noarch.rpm
yum localinstall emi-release-2.0.0-1.sl6.noarch.rpm
Here is the list of repos.
yum clean all
yum install emi-cream-ce
yum install emi-ge-utils
yum install gridengine-qmaster

(4) Certificates
Note the issue wrt openssl in SL6: The hostcert/key cannot be generated on the CE, it needs an SL5 machine (or somewhere else using OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008).
on SL5:
openssl pkcs12 -clcerts -nokeys -out hostcert.pem -in ceprod08.p12
openssl pkcs12 -nocerts -nodes -out hostkey.pem -in ceprod08.p12
on the CE
chmod 600 /etc/grid-security/hostcert.pem
chmod 400 /etc/grid-security/hostkey.pem

(5) Special users
In EMI2 installing the rpms generates the glite and glexec users:
[root@ceprod07 ~]# getent passwd glite
glite:x:495:495:gLite user:/var/glite:/bin/bash
[root@ceprod07 ~]# getent passwd glexec
glexec:x:496:496:gLExec user account to be used with /usr/sbin/glexec:/:/sbin/nologin
It still complains about 'edguser', especially as the default values used in /opt/glite/yaim/examples/edgusers.conf (which is used as an input file in the configuration as far as I can tell) are already used on our system.
I make sure the following user exist before running yaim (uid matching edgusers.conf):
groupadd -g 252 edguser
useradd -m -u 252 -g edguser edguser
It seems to ignore the user glite that is also defined in edgusers.conf.

(6) SGE specifics
Link port6444 to sge_master (in /var/sgeCA/) ls -l
lrwxrwxrwx 1 root root 11 Nov 30 16:21 port6444 -> sge_qmaster
drwxr-xr-x 3 root root 4096 Oct 13 14:09 sge_qmaster

everybody and their grandmother need to be able to run qstat:
chown -R ldap:sgeadmin /var/sgeCA/sge_qmaster/default/userkeys/ldap
chown -R edguser:sgeadmin /var/sgeCA/sge_qmaster/default/userkeys/edguser (not sure this is still true on emi2?)
chown -R tomcat:sgeadmin /var/sgeCA/sge_qmaster/default/userkeys/tomcat

On the worker nodes: Edit the cream-sge.sh script located in /usr/bin on the worker nodes to recognise the new CE as a cream CE.

(7) bdii
this is apparently not needed anymore (though it seems to be needed on SL5 EMI2):
semanage fcontext -a -t slapd_db_t "/var/run/bdii(/.*)?"; restorecon -vR /var/run/bdii/
still needed: add "slapd:ALL" in /etc/hosts.allow
To test the bdii run:
ldapsearch -LLL -x -H ldap://ceprod07.grid.hep.ph.ic.ac.uk:2170 -b mds-vo-name=resource,o=grid

(8) Run yaim
mkdir /opt/glite/yaim/siteinfo
chmod 700 /opt/glite/yaim/siteinfo
/opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/siteinfo/siteinfo_cetest00.def -n creamCE -n SGE_utils
Note: When first running yaim, it fails as /etc/lrms/scheduler.conf does not exist. (I filed this bug a year ago!)
Also, it complains as $SGE_ROOT is not set (even though it is present in the siteinfo.def, sigh).
( INFO: SGE instalation will not be configured since we assume everything is already working via the SHARED INSTALLATION!
critical error: Please set the environment variable SGE_ROOT.
(standard_in) 1: syntax error
export SGE_ROOT=/usr/share/gridengine/ before running yaim does the trick.
To make users.conf and groups.conf I use a python script. It assumes all users already exists on the system.
(9) Post yaim hacks (local configuration part I)
List of modified files and script to be run after running yaim.
At first install, do:
mkdir /usr/share/gridengine/apel/
This directory should contain filter_accounting_files.sh
sge_filestaging: Shared home dirs.
sge_submit.sh: Add ce name for accounting purposes.
jobwrapper.tpl: Add script to access tarball worker nodes
parser-config-yaim.xml: Ensure accounting only sees filtered/gzipped accounting files (different to qstat)
filter_accounting_files.sh/edg-apel-sge-parser: bzip -> gzip
cleanup-grid-accounts: remove cron job, cleanup is done centrally
glite_cream_load_monitor.conf: gridftp limits are too low
glite-info-dynamic-ge: to avoid taking the short queue's CPU limit as the overall limit: change minval to maxval in: $QUEUE_minlimits{$q}->{'cpu'} = &minval( $QUEUE_minlimits{$q}->{'cpu'}, $cputime );

(10) Hacks (local configuration part II)
(a) By default cream keeps to few and too small log files: Change in /etc/glite-ce-cream/log4j.properties
(b) The default /etc/pam.d/crond will not allow any users with uid < 500 (apart from root) to run cron jobs. Currently this is not an issue on the CE, but it pays to double check an update hasn't added a new cron job. To be able to run cron jobs as non-root user, /etc/pam.d/crond needs to be updated to this version.
(c) Response times are calculated in /usr/libexec/lcg-info-dynamic-scheduler. For CMS we need to round them a bit, otherwise they will send all jobs to the CE with the lowest response time, even if the difference is negligible. The CMS hack can be found here.