Setting up a HTCondorCE 'from scratch' at UKI-LT2-IC-HEP
1. Preliminaries
- Start with a node configured and an existing schedd:
# condor_q
...
-- Schedd: mynewce.grid.hep.ph.ic.ac.uk : <146.179.232.123:5678> @ 01/01/70 00:11:22
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
...
- Create the users (this file is in puppet, you may want to run it in screen as it takes an hour plus if no accounts already exist):
# /root/ceusers.py
2. Basic grid stuff
-
Ensure the node has hostcert/key with the correct permissions:
# ls -l /etc/grid-security/host*
> -rw-r--r--. 1 root root 1631 Feb 17 13:33 /etc/grid-security/hostcert.pem
> -rw-------. 1 root root 1675 Feb 17 13:33 /etc/grid-security/hostkey.pem
-
We need the UMD release for the APEL packages, this will also pull in EGI-trustanchors.
yum install yum-priorities
yum install http://repository.egi.eu/sw/production/umd/4/centos7/x86_64/updates/umd-release-4.1.3-1.el7.centos.noarch.rpm
yum install ca-policy-egi-core fetch-crl
systemctl enable fetch-crl-cron; systemctl start fetch-crl-cron
fetch-crl -v
-
Note:
/etc/grid-security/vomsdir
is setup by puppet.
-
Stop the condor warfare in its tracks: The UMD repo contains a lot of condor stuff too, we should try to avoid that
breaking the base batch system in the future:
# echo 'exclude=*condor*' >> /etc/yum.repos.d/UMD-4-updates.repo
# echo 'exclude=*condor*' >> /etc/yum.repos.d/UMD-4-base.repo
3. Install the HTCondorCE
-
Most things should be available in the "htcondor-stable" repo.
yum install htcondor-ce htcondor-ce-apel htcondor-ce-bdii htcondor-ce-condor
-
Re-load the main condor configuration after the install to update the parameters in the local schedd;
condor_reconfig
4. Quarantine the spool directory
# cd /var/lib/condor-ce
# mv spool spool.old
# mkdir spool
# lvcreate -L 20g -n spoolvol ceprod01vg (adjust ce-name as appropriate)
# mkfs.ext4 /dev/mapper/ceprod01vg-spoolvol
# tune2fs -c 0 -i 0 /dev/mapper/ceprod01vg-spoolvol
# Add mapping to fstab:
/dev/mapper/ceprod01vg-spoolvol /var/lib/condor-ce/spool ext4 defaults 1 2
# mount -a
# chown condor:condor spool
only needed if condor has been started previously
# mv spool.old/* spool/
# mv spool.old/.schedd_* spool/
otherwise continue here:
# rmdir spool.old
# restorecon -r spool
5. Authentication and user mapping
First install all the necessary packages. We are implementing LCMAPS with a gridmapdir (just like on the CREAMCEs).
yum install lcmaps lcas-lcmaps-gt4-interface lcmaps-plugins-basic lcmaps-plugins-voms
yum install lcas lcas-plugins-basic lcas-plugins-voms
yum install fetch-ban (from our local repo, to retrieve the list of banned users)
-
The host
The HTCondorCE used in the example is called ceprod01.grid.hep.ph.ic.ac.uk and has a matching DN.
Add a mapping line in /etc/condor-ce/condor_mapfile for the host that's being
configured's DN, this must be the first entry of the file:
GSI "/C=UK/O=eScience/OU=Imperial/L=Physics/CN=ceprod01.grid.hep.ph.ic.ac.uk" ceprod01.grid.hep.ph.ic.ac.uk@daemon.htcondor.org
- Configure fetchban
The default fetch-ban config is about right, edit /etc/fetch-ban.conf and
uncomment the [ce_ban] section. Then create/update the output files ready for
the next steps:
# fetch-ban -fn
# ls -l /etc/lcas/ban_users.db
-rw-r--r--. 1 root root 371 Feb 21 09:43 /etc/lcas/ban_users.db
-
The users
-
Add the following line to the /etc/condor-ce/condor_mapfile user mapping file,
so that the GSI plugin system is called to map users. This should replace the
GSI line for "unmapped.htcondor.org" (order is important in this file):
GSI (.*) GSS_ASSIST_GRIDMAP
-
We now need to set-up the user mappings, this can be configured to work just
like the CREAM-CE used to: using LCMAPS with a gridmapdir. Without YAIM,
we'll need to do a lot of manual steps.
echo "globus_mapping liblcas_lcmaps_gt4_mapping.so lcmaps_callout" > /etc/grid-security/gsi-authz.conf
Create the LCAMPS and LCAS config files. These are based off the old CREAM
ones, but simplified for the new cluster. They should look like this:
# cat /etc/lcas/lcas.db
pluginname=/usr/lib64/lcas/lcas_userban.mod,pluginargs=ban_users.db
pluginname=/usr/lib64/lcas/lcas_voms.mod,pluginargs="-vomsdir /etc/grid-security/vomsdir/ -certdir /etc/grid-security/certificates/
-authfile /etc/grid-security/grid-mapfile -authformat simple -use_user_dn"
[`root@ceprod01 config.d]# cat /etc/lcmaps.db
path = /usr/lib64/lcmaps
vomslocalgroup = "lcmaps_voms_localgroup.mod"
" -groupmapfile /etc/grid-security/groupmapfile"
" -mapmin 0"
vomslocalaccount = "lcmaps_voms_localaccount.mod"
" -gridmapfile /etc/grid-security/grid-mapfile"
" -use_voms_gid"
vomspoolaccount = "lcmaps_voms_poolaccount.mod"
" -gridmapfile /etc/grid-security/grid-mapfile"
" -gridmapdir /etc/grid-security/gridmapdir"
" -do_not_use_secondary_gids"
good = "lcmaps_dummy_good.mod"
bad = "lcmaps_dummy_bad.mod"
authorize_only:
vomslocalgroup -> vomslocalaccount
vomslocalaccount -> good | vomspoolaccount
vomspoolaccount -> good | bad
-
/etc/grid-security/grid-mapfile and groupmapfile are now made by our user script
-
Populate the gridmapdir. Note that this directory has to be owned by
condor:condor as it has already dropped root access by the user mapping stage.
Here is a brute force example for dteam, adjust userprefix as necessary.
# mkdir -p /etc/grid-security/gridmapdir
# cd /etc/grid-security/gridmapdir
# for i in {001..049}; do touch cc-dteam$i; done
# chown -R condor:condor /etc/grid-security/gridmapdir
-
Debugging:
If you need to debug LCAS/LCMAPS add the GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION=0
to the condor-ce config.d to disable caching of results. Remember to remove
this again for production use! They mainly log to /var/log/messages when the
user connects. We'll set it to 10 minutes so it has a specific value in
production; it can't be too high or banning will be too slow!.
echo "GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION=600" > /etc/condor-ce/config.d/99-ichep-gsscache.conf
6. Tell the world about it: BDII
-
Make a suitable config file like below. Yes, this
really does have to be in "/etc/condor" rather than condor-ce. And should list all the VOs supported,
not just the special ones used for testing.
# cat /etc/condor/config.d/99-ichep-bdii.conf
HTCONDORCE_VONames = dteam, solidexperiment.org
HTCONDORCE_SiteName = UKI-LT2-IC-HEP
HTCONDORCE_HEPSPEC_INFO = 10.00-HEP-SPEC06
HTCONDORCE_CORES = 16
GLUE2DomainID = UKI-LT2-IC-HEP
-
Add the following lines to
/etc/sysconfig/bdii
to set the condor security settings (these are all used in our condor_config,local, they’re not generic condor settings):
export CONDOR_USER_SEC=OPTIONAL
- Edit
/etc/hosts.allow
and add “slapd: ALL” to the end, just like on the old CREAM-CE.
- Then simply enable and start the BDII:
chkconfig bdii on; service bdii start
- Don't forget to add it to the site BDII! (It doesn’t need anything special, even though it’s GLUE2 only, just add the standard o=grid URL just like any other service to the site-urls.conf --all the CEs have now been added).
7. Accounting: APEL Config
Based losely on HtCondorCeAccounting at CERN.
- enable CE in database on lcgmon01 (done for all 4 CEs)
- open firewall for CE on lcgmon01 (done for all 4 CEs)
- edit /etc/apel/parser.cfg (copy from existing node, contains passwords)
scp -3 root@ceprod01:/etc/apel/parser.cfg root@ceprod00:/etc/apel/parser.cfg
-
create cron job on ceprod01:
[root@ceprod01 cron.daily]# cat apel.cron
#!/bin/bash
export PATH=/bin:/sbin:/usr/bin:/usr/sbin
/usr/share/condor-ce/condor_blah.sh
/usr/share/condor-ce/condor_batch.sh
/usr/bin/apelparser
- on lcgmon01: /etc/apel/client.cfg
in spec update section add manual entries for htcondorces
8. Some more subtle tweaks
Currently there are two specific tweaks to the router config:
/etc/condor-ce/config.d/99-ichep-router.conf (this overrides 01-ce-router.conf)
# edited to avoid jobs that have been removed being put back in the held state (3rd condition). This might be a straight out bug, no ?
SYSTEM_PERIODIC_HOLD = (x509userproxysubject =?= UNDEFINED) || (x509UserProxyExpiration =?= UNDEFINED)
|| (time() > x509UserProxyExpiration && (JobStatus =!= 2 && JobStatus =!= 3))
|| (RoutedBy is null && JobUniverse =!= 1 && JobUniverse =!= 5 && JobUniverse =!= 7 && JobUniverse =!= 12)
|| ((JobStatus =?= 1 && CurrentTime - EnteredCurrentStatus > 1800) && RoutedToJobId is null && RoutedJob =!= true)
# We also need more than 2000 queued jobs
JOB_ROUTER_DEFAULTS = $(JOB_ROUTER_DEFAULTS) [maxIdleJobs = 25000;]
And one to keep the log files or a manageable size:
# Keep fewer audit log files as these are too big for /var/log
in /etc/condor-ce/config.d/99-ichep-misc.conf
MAX_NUM_COLLECTOR_AUDIT_LOG = 45
MAX_NUM_SCHEDD_AUDIT_LOG = 45
Note: After reconfiguring the CE, reload the config: condor_ce_reconfig -all
.
9. All done.
Once the user mapping is done, we can plumb the condor-ce instance to the
backend condor system (this is needed because our central collector isn't on
the CE node):
# echo "JOB_ROUTER_SCHEDD2_POOL=htcmaster02.grid.hep.ph.ic.ac.uk:9618" > /etc/condor-ce/config.d/99-ichep-condor.conf
Now start the condor ce: systemctl enable condor-ce; systemctl start condor-ce;
.
10. Some notes on testing
There are a number of ways to test HTCondorCE, all of which require the grid +
condor_ce UI. This isn't readily available on CVMFS yet. We currently have a condor UI on a dedicated cloud node (htcui).
You will need a valid proxy:
source /cvmfs/grid.cern.ch/umd-c7ui-latest/etc/profile.d/setup-c7-ui-example.sh
voms-proxy-init --voms dteam
Test 1: Ping the CE
You can ping the CE, which tells you the mapped username (great for debugging
authentication problems). The main field is "Remote Mapping", which should have
a real username. A value of "gsi@unmapped" indicates a problem.
$ condor_ce_ping -pool ceprod01.grid.hep.ph.ic.ac.uk:9619 -name ceprod01.grid.hep.ph.ic.ac.uk -verbose
> WARNING: Missing argument, defaulting to DC_NOP
> Remote Version: $CondorVersion: 8.8.7 Dec 24 2019 BuildID: 493225 PackageID: 8.8.7-1 $
> Local Version: $CondorVersion: 8.8.7 Dec 24 2019 BuildID: 493225 PackageID: 8.8.7-1 $
> Session ID: ceprod01:123456:1582280460:3
> Instruction: DC_NOP
> Command: 60011
> Encryption: none
> Integrity: MD5
> Authenticated using: GSI
> All authentication methods: GSI
> Remote Mapping: lt2-dteam000@users.htcondor.org
> Authorized: TRUE
Test 2: Trace the CE
If ping works, you can try using the trace command, which runs a series of
tests all the way through to submitting a simple "env" job, waiting and getting
the output:
$ condor_ce_trace ceprod01.grid.hep.ph.ic.ac.uk
> Testing HTCondor-CE authorization...
...
> - Job was successful
Test 3: A basic test job
A very basic job: condor_ce_run -r ceprod01.grid.hep.ph.ic.ac.uk:9619 /bin/hostname
(should print the name of the worker node it ran on)
A slightly less basic job
Make a sub file:
[centos@htcui ~]$ cat ce_test.sub
# Required for local HTCondor-CE submission
universe = vanilla
use_x509userproxy = true
+Owner = undefined
# grid_resource = condor ceprod01.grid.hep.ph.ic.ac.uk ceprod01.grid.hep.ph.ic.ac.uk:9619
# Files
executable = ce_test.sh
output = ce_test.out
error = ce_test.err
log = ce_test.log
# File transfer behavior
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
# Run job once
queue
To submit use:
condor_submit -remote ceprod01.grid.hep.ph.ic.ac.uk -pool ceprod01.grid.hep.ph.ic.ac.uk:9619 ce_test.sub
Submitting job(s).
1 job(s) submitted to cluster 389081.
11. My favourite condor CE commands
condor_ce_q
condor_ce_q -l [jobid]
condor_ce_history [jobid]
condor_ce_rm
systemctl status condor-ce
condor_ce_reconfig
condor_ce_config_val -dump (OK, that's desparation)