Setting up a HTCondorCE 'from scratch' at UKI-LT2-IC-HEP

1. Preliminaries

2. Basic grid stuff

3. Install the HTCondorCE

4. Quarantine the spool directory

# cd /var/lib/condor-ce
# mv spool spool.old
# mkdir spool
# lvcreate -L 20g -n spoolvol ceprod01vg (adjust ce-name as appropriate)
# mkfs.ext4 /dev/mapper/ceprod01vg-spoolvol
# tune2fs -c 0 -i 0 /dev/mapper/ceprod01vg-spoolvol
# Add mapping to fstab:
/dev/mapper/ceprod01vg-spoolvol /var/lib/condor-ce/spool ext4 defaults 1 2
# mount -a
# chown condor:condor spool
only needed if condor has been started previously
# mv spool.old/* spool/
# mv spool.old/.schedd_* spool/
otherwise continue here:
# rmdir spool.old
# restorecon -r spool


5. Authentication and user mapping

First install all the necessary packages. We are implementing LCMAPS with a gridmapdir (just like on the CREAMCEs).
yum install lcmaps lcas-lcmaps-gt4-interface lcmaps-plugins-basic lcmaps-plugins-voms
yum install lcas lcas-plugins-basic lcas-plugins-voms
yum install fetch-ban (from our local repo, to retrieve the list of banned users)

6. Tell the world about it: BDII


7. Accounting: APEL Config

Based losely on HtCondorCeAccounting at CERN.

8. Some more subtle tweaks
Currently there are two specific tweaks to the router config:
/etc/condor-ce/config.d/99-ichep-router.conf (this overrides 01-ce-router.conf)
	# edited to avoid jobs that have been removed being put back in the held state (3rd condition). This might be a straight out bug, no ?
	SYSTEM_PERIODIC_HOLD = (x509userproxysubject =?= UNDEFINED) || (x509UserProxyExpiration =?= UNDEFINED) 
	|| (time() > x509UserProxyExpiration && (JobStatus =!= 2 && JobStatus =!= 3)) 
	|| (RoutedBy is null && JobUniverse =!= 1 && JobUniverse =!= 5 && JobUniverse =!= 7 && JobUniverse =!= 12) 
	|| ((JobStatus =?= 1 && CurrentTime - EnteredCurrentStatus > 1800) && RoutedToJobId is null && RoutedJob =!= true)
    
	# We also need more than 2000 queued jobs
	JOB_ROUTER_DEFAULTS = $(JOB_ROUTER_DEFAULTS) [maxIdleJobs = 25000;]
    
And one to keep the log files or a manageable size:
# Keep fewer audit log files as these are too big for /var/log
in /etc/condor-ce/config.d/99-ichep-misc.conf
MAX_NUM_COLLECTOR_AUDIT_LOG = 45
MAX_NUM_SCHEDD_AUDIT_LOG = 45

Note: After reconfiguring the CE, reload the config: condor_ce_reconfig -all.

9. All done.

Once the user mapping is done, we can plumb the condor-ce instance to the backend condor system (this is needed because our central collector isn't on the CE node):
# echo "JOB_ROUTER_SCHEDD2_POOL=htcmaster02.grid.hep.ph.ic.ac.uk:9618" > /etc/condor-ce/config.d/99-ichep-condor.conf
Now start the condor ce: systemctl enable condor-ce; systemctl start condor-ce;.

10. Some notes on testing

There are a number of ways to test HTCondorCE, all of which require the grid + condor_ce UI. This isn't readily available on CVMFS yet. We currently have a condor UI on a dedicated cloud node (htcui).
You will need a valid proxy: source /cvmfs/grid.cern.ch/umd-c7ui-latest/etc/profile.d/setup-c7-ui-example.sh
voms-proxy-init --voms dteam
Test 1: Ping the CE
You can ping the CE, which tells you the mapped username (great for debugging authentication problems). The main field is "Remote Mapping", which should have a real username. A value of "gsi@unmapped" indicates a problem.
	$ condor_ce_ping -pool ceprod01.grid.hep.ph.ic.ac.uk:9619 -name ceprod01.grid.hep.ph.ic.ac.uk -verbose
	> WARNING: Missing  argument, defaulting to DC_NOP
	  > Remote Version:              $CondorVersion: 8.8.7 Dec 24 2019 BuildID: 493225 PackageID: 8.8.7-1 $
	  > Local  Version:              $CondorVersion: 8.8.7 Dec 24 2019 BuildID: 493225 PackageID: 8.8.7-1 $
	  > Session ID:                  ceprod01:123456:1582280460:3
	  > Instruction:                 DC_NOP
	  > Command:                     60011
	  > Encryption:                  none
	  > Integrity:                   MD5
	  > Authenticated using:         GSI
	  > All authentication methods:  GSI
	  > Remote Mapping:              lt2-dteam000@users.htcondor.org
	  > Authorized:                  TRUE
      

Test 2: Trace the CE
If ping works, you can try using the trace command, which runs a series of tests all the way through to submitting a simple "env" job, waiting and getting the output:
	$ condor_ce_trace ceprod01.grid.hep.ph.ic.ac.uk                            
	> Testing HTCondor-CE authorization...
	...
	> - Job was successful
      

Test 3: A basic test job
A very basic job: condor_ce_run -r ceprod01.grid.hep.ph.ic.ac.uk:9619 /bin/hostname
(should print the name of the worker node it ran on)

A slightly less basic job
Make a sub file:
	[centos@htcui ~]$ cat ce_test.sub 
	# Required for local HTCondor-CE submission
	universe = vanilla
	use_x509userproxy = true
	+Owner = undefined
	
	# grid_resource = condor ceprod01.grid.hep.ph.ic.ac.uk ceprod01.grid.hep.ph.ic.ac.uk:9619

	# Files
	executable = ce_test.sh
	output = ce_test.out
	error = ce_test.err
	log = ce_test.log
	
	# File transfer behavior
	ShouldTransferFiles = YES
	WhenToTransferOutput = ON_EXIT

	# Run job once
	queue
    
To submit use:
condor_submit -remote ceprod01.grid.hep.ph.ic.ac.uk -pool ceprod01.grid.hep.ph.ic.ac.uk:9619 ce_test.sub
Submitting job(s).
1 job(s) submitted to cluster 389081.

11. My favourite condor CE commands

condor_ce_q
condor_ce_q -l [jobid]
condor_ce_history [jobid]
condor_ce_rm
systemctl status condor-ce
condor_ce_reconfig
condor_ce_config_val -dump (OK, that's desparation)