HTCondorCE

Setting up a HTCondorCE 'from scratch' at UKI-LT2-IC-HEP

1. Preliminaries

Start with a node configured and an existing schedd: # condor_q ... -- Schedd: mynewce.grid.hep.ph.ic.ac.uk : <146.179.232.123:5678> @ 01/01/70 00:11:22 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS ...
Create the users (this file is in puppet, you may want to run it in screen as it takes an hour plus if no accounts already exist):
# /root/ceusers.py

2. Basic grid stuff

Ensure the node has hostcert/key with the correct permissions:
# ls -l /etc/grid-security/host* > -rw-r--r--. 1 root root 1631 Feb 17 13:33 /etc/grid-security/hostcert.pem > -rw-------. 1 root root 1675 Feb 17 13:33 /etc/grid-security/hostkey.pem
We need the UMD release for the APEL packages, this will also pull in EGI-trustanchors.
yum install yum-priorities yum install http://repository.egi.eu/sw/production/umd/4/centos7/x86_64/updates/umd-release-4.1.3-1.el7.centos.noarch.rpm yum install ca-policy-egi-core fetch-crl systemctl enable fetch-crl-cron; systemctl start fetch-crl-cron fetch-crl -v
Note: /etc/grid-security/vomsdir is setup by puppet.
Stop the condor warfare in its tracks: The UMD repo contains a lot of condor stuff too, we should try to avoid that breaking the base batch system in the future:
# echo 'exclude=*condor*' >> /etc/yum.repos.d/UMD-4-updates.repo # echo 'exclude=*condor*' >> /etc/yum.repos.d/UMD-4-base.repo

3. Install the HTCondorCE

Most things should be available in the "htcondor-stable" repo.
yum install htcondor-ce htcondor-ce-apel htcondor-ce-bdii htcondor-ce-condor
Re-load the main condor configuration after the install to update the parameters in the local schedd;
condor_reconfig

4. Quarantine the spool directory


    # cd /var/lib/condor-ce 

    # mv spool spool.old 

    # mkdir spool 

    # lvcreate -L 20g -n spoolvol ceprod01vg (adjust ce-name as appropriate) 

    # mkfs.ext4 /dev/mapper/ceprod01vg-spoolvol 

    # tune2fs -c 0 -i 0 /dev/mapper/ceprod01vg-spoolvol 

    # Add mapping to fstab: 

    /dev/mapper/ceprod01vg-spoolvol /var/lib/condor-ce/spool ext4    defaults        1 2 

    # mount -a 

    # chown condor:condor spool

only needed if condor has been started previously


    # mv spool.old/* spool/ 

    # mv spool.old/.schedd_* spool/

otherwise continue here:


    # rmdir spool.old 

    # restorecon -r spool

5. Authentication and user mapping

First install all the necessary packages. We are implementing LCMAPS with a gridmapdir (just like on the CREAMCEs).


      yum install lcmaps lcas-lcmaps-gt4-interface lcmaps-plugins-basic lcmaps-plugins-voms 

      yum install lcas lcas-plugins-basic lcas-plugins-voms 

      yum install fetch-ban (from our local repo, to retrieve the list of banned users)

The host
The HTCondorCE used in the example is called ceprod01.grid.hep.ph.ic.ac.uk and has a matching DN.
Add a mapping line in /etc/condor-ce/condor_mapfile for the host that's being configured's DN, this must be the first entry of the file:
GSI "/C=UK/O=eScience/OU=Imperial/L=Physics/CN=ceprod01.grid.hep.ph.ic.ac.uk" ceprod01.grid.hep.ph.ic.ac.uk@daemon.htcondor.org
Configure fetchban
The default fetch-ban config is about right, edit /etc/fetch-ban.conf and uncomment the [ce_ban] section. Then create/update the output files ready for the next steps: # fetch-ban -fn # ls -l /etc/lcas/ban_users.db -rw-r--r--. 1 root root 371 Feb 21 09:43 /etc/lcas/ban_users.db
The users
Add the following line to the /etc/condor-ce/condor_mapfile user mapping file, so that the GSI plugin system is called to map users. This should replace the GSI line for "unmapped.htcondor.org" (order is important in this file):
GSI (.*) GSS_ASSIST_GRIDMAP

We now need to set-up the user mappings, this can be configured to work just like the CREAM-CE used to: using LCMAPS with a gridmapdir. Without YAIM, we'll need to do a lot of manual steps.


	  echo "globus_mapping liblcas_lcmaps_gt4_mapping.so lcmaps_callout" > /etc/grid-security/gsi-authz.conf

Create the LCAMPS and LCAS config files. These are based off the old CREAM ones, but simplified for the new cluster. They should look like this:


	  	    # cat /etc/lcas/lcas.db
	    pluginname=/usr/lib64/lcas/lcas_userban.mod,pluginargs=ban_users.db
	    pluginname=/usr/lib64/lcas/lcas_voms.mod,pluginargs="-vomsdir /etc/grid-security/vomsdir/ -certdir /etc/grid-security/certificates/ 
	    -authfile /etc/grid-security/grid-mapfile -authformat simple -use_user_dn"


	  	    [`root@ceprod01 config.d]# cat  /etc/lcmaps.db
	    path = /usr/lib64/lcmaps
	    vomslocalgroup = "lcmaps_voms_localgroup.mod"
            " -groupmapfile /etc/grid-security/groupmapfile"
            " -mapmin 0"
	    vomslocalaccount = "lcmaps_voms_localaccount.mod"
            " -gridmapfile /etc/grid-security/grid-mapfile"
            " -use_voms_gid"
	    vomspoolaccount = "lcmaps_voms_poolaccount.mod"
            " -gridmapfile /etc/grid-security/grid-mapfile"
                  " -gridmapdir /etc/grid-security/gridmapdir"
            " -do_not_use_secondary_gids"
	    good = "lcmaps_dummy_good.mod"
	    bad = "lcmaps_dummy_bad.mod"

	    authorize_only:
	    vomslocalgroup -> vomslocalaccount
	    vomslocalaccount -> good | vomspoolaccount
	    vomspoolaccount -> good | bad

/etc/grid-security/grid-mapfile and groupmapfile are now made by our user script
Populate the gridmapdir. Note that this directory has to be owned by condor:condor as it has already dropped root access by the user mapping stage. Here is a brute force example for dteam, adjust userprefix as necessary. # mkdir -p /etc/grid-security/gridmapdir # cd /etc/grid-security/gridmapdir # for i in {001..049}; do touch cc-dteam$i; done # chown -R condor:condor /etc/grid-security/gridmapdir
Debugging: If you need to debug LCAS/LCMAPS add the GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION=0 to the condor-ce config.d to disable caching of results. Remember to remove this again for production use! They mainly log to /var/log/messages when the user connects. We'll set it to 10 minutes so it has a specific value in production; it can't be too high or banning will be too slow!.
echo "GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION=600" > /etc/condor-ce/config.d/99-ichep-gsscache.conf

6. Tell the world about it: BDII

Make a suitable config file like below. Yes, this really does have to be in "/etc/condor" rather than condor-ce. And should list all the VOs supported, not just the special ones used for testing.
# cat /etc/condor/config.d/99-ichep-bdii.conf HTCONDORCE_VONames = dteam, solidexperiment.org HTCONDORCE_SiteName = UKI-LT2-IC-HEP HTCONDORCE_HEPSPEC_INFO = 10.00-HEP-SPEC06 HTCONDORCE_CORES = 16 GLUE2DomainID = UKI-LT2-IC-HEP
Add the following lines to /etc/sysconfig/bdii to set the condor security settings (these are all used in our condor_config,local, they’re not generic condor settings): export CONDOR_USER_SEC=OPTIONAL
Edit /etc/hosts.allow and add “slapd: ALL” to the end, just like on the old CREAM-CE.
Then simply enable and start the BDII: chkconfig bdii on; service bdii start
Don't forget to add it to the site BDII! (It doesn’t need anything special, even though it’s GLUE2 only, just add the standard o=grid URL just like any other service to the site-urls.conf --all the CEs have now been added).

7. Accounting: APEL Config

Based losely on HtCondorCeAccounting at CERN.

enable CE in database on lcgmon01 (done for all 4 CEs)
open firewall for CE on lcgmon01 (done for all 4 CEs)
edit /etc/apel/parser.cfg (copy from existing node, contains passwords)
scp -3 root@ceprod01:/etc/apel/parser.cfg root@ceprod00:/etc/apel/parser.cfg
create cron job on ceprod01:
[root@ceprod01 cron.daily]# cat apel.cron #!/bin/bash export PATH=/bin:/sbin:/usr/bin:/usr/sbin /usr/share/condor-ce/condor_blah.sh /usr/share/condor-ce/condor_batch.sh /usr/bin/apelparser
on lcgmon01: /etc/apel/client.cfg
in spec update section add manual entries for htcondorces

8. Some more subtle tweaks
Currently there are two specific tweaks to the router config:
/etc/condor-ce/config.d/99-ichep-router.conf (this overrides 01-ce-router.conf)

	# edited to avoid jobs that have been removed being put back in the held state (3rd condition). This might be a straight out bug, no ?
	SYSTEM_PERIODIC_HOLD = (x509userproxysubject =?= UNDEFINED) || (x509UserProxyExpiration =?= UNDEFINED) 
	|| (time() > x509UserProxyExpiration && (JobStatus =!= 2 && JobStatus =!= 3)) 
	|| (RoutedBy is null && JobUniverse =!= 1 && JobUniverse =!= 5 && JobUniverse =!= 7 && JobUniverse =!= 12) 
	|| ((JobStatus =?= 1 && CurrentTime - EnteredCurrentStatus > 1800) && RoutedToJobId is null && RoutedJob =!= true)

	# We also need more than 2000 queued jobs
	JOB_ROUTER_DEFAULTS = $(JOB_ROUTER_DEFAULTS) [maxIdleJobs = 25000;]

And one to keep the log files or a manageable size:


      # Keep fewer audit log files as these are too big for /var/log

      in /etc/condor-ce/config.d/99-ichep-misc.conf 

      MAX_NUM_COLLECTOR_AUDIT_LOG = 45 
      
      MAX_NUM_SCHEDD_AUDIT_LOG = 45

Note: After reconfiguring the CE, reload the config: condor_ce_reconfig -all.

9. All done.

Once the user mapping is done, we can plumb the condor-ce instance to the backend condor system (this is needed because our central collector isn't on the CE node):
# echo "JOB_ROUTER_SCHEDD2_POOL=htcmaster02.grid.hep.ph.ic.ac.uk:9618" > /etc/condor-ce/config.d/99-ichep-condor.conf
Now start the condor ce: systemctl enable condor-ce; systemctl start condor-ce;.

10. Some notes on testing

There are a number of ways to test HTCondorCE, all of which require the grid + condor_ce UI. This isn't readily available on CVMFS yet. We currently have a condor UI on a dedicated cloud node (htcui).
You will need a valid proxy:


      source /cvmfs/grid.cern.ch/umd-c7ui-latest/etc/profile.d/setup-c7-ui-example.sh

      voms-proxy-init --voms dteam

Test 1: Ping the CE
You can ping the CE, which tells you the mapped username (great for debugging authentication problems). The main field is "Remote Mapping", which should have a real username. A value of "gsi@unmapped" indicates a problem.


      	$ condor_ce_ping -pool ceprod01.grid.hep.ph.ic.ac.uk:9619 -name ceprod01.grid.hep.ph.ic.ac.uk -verbose
	> WARNING: Missing  argument, defaulting to DC_NOP
	  > Remote Version:              $CondorVersion: 8.8.7 Dec 24 2019 BuildID: 493225 PackageID: 8.8.7-1 $
	  > Local  Version:              $CondorVersion: 8.8.7 Dec 24 2019 BuildID: 493225 PackageID: 8.8.7-1 $
	  > Session ID:                  ceprod01:123456:1582280460:3
	  > Instruction:                 DC_NOP
	  > Command:                     60011
	  > Encryption:                  none
	  > Integrity:                   MD5
	  > Authenticated using:         GSI
	  > All authentication methods:  GSI
	  > Remote Mapping:              lt2-dteam000@users.htcondor.org
	  > Authorized:                  TRUE

Test 2: Trace the CE
If ping works, you can try using the trace command, which runs a series of tests all the way through to submitting a simple "env" job, waiting and getting the output:


      	$ condor_ce_trace ceprod01.grid.hep.ph.ic.ac.uk                            
	> Testing HTCondor-CE authorization...
	...
	> - Job was successful

Test 3: A basic test job
A very basic job: condor_ce_run -r ceprod01.grid.hep.ph.ic.ac.uk:9619 /bin/hostname
(should print the name of the worker node it ran on)

A slightly less basic job
Make a sub file:

	[centos@htcui ~]$ cat ce_test.sub 
	# Required for local HTCondor-CE submission
	universe = vanilla
	use_x509userproxy = true
	+Owner = undefined
	
	# grid_resource = condor ceprod01.grid.hep.ph.ic.ac.uk ceprod01.grid.hep.ph.ic.ac.uk:9619

	# Files
	executable = ce_test.sh
	output = ce_test.out
	error = ce_test.err
	log = ce_test.log
	
	# File transfer behavior
	ShouldTransferFiles = YES
	WhenToTransferOutput = ON_EXIT

	# Run job once
	queue

To submit use:


    condor_submit -remote ceprod01.grid.hep.ph.ic.ac.uk  -pool ceprod01.grid.hep.ph.ic.ac.uk:9619 ce_test.sub 

    Submitting job(s). 

    1 job(s) submitted to cluster 389081.

11. My favourite condor CE commands


      condor_ce_q

      condor_ce_q -l [jobid]

      condor_ce_history [jobid]

      condor_ce_rm

      systemctl status condor-ce

      condor_ce_reconfig

      condor_ce_config_val -dump (OK, that's desparation)