Online Operations Manual
Table of Contents
HTCondor
HTCondor is a job scheduler for High Throughput Computing.
It finds available machines for executing jobs submitted by a user from a submit/head node, eg. headnode01 or headnode02 on the ICDS cluster at PSU.
The jobs are matched with idle machines based on their memory, disk and other requirements that can be specified in a submit file, which is named job.sub
by convention.
For GstLAL analyses, we use DAGMan workflows.
Since we need different kinds of jobs to run in sequence for the analysis to finish through, DAGMan is a powerful tool.
The input file for a DAGMan workflow, called *.dag
, describes the jobs, and sets the PARENT and CHILD dependencies for each job which informs the scheduler of the sequence in which jobs must be submitted to execute nodes.
Each class of job in the input file has a submit file which specifies the job arguments and requirements.
For GstLAL workflows, the input *.dag
has a corresponding *.sh
file which lists the bash commands for each job. It is not used by DAGMan, but is a good reference for the user.
An example workflow:
Input file, example.dag
lists two classes of jobs with their dependencies.
JOB gstlal_inspiral_svd_bank.00000 gstlal_inspiral_svd_bank.sub
JOB gstlal_inspiral_svd_bank.00001 gstlal_inspiral_svd_bank.sub
JOB gstlal_svd_bank_checkerboard.00000 gstlal_svd_bank_checkerboard.sub
JOB gstlal_svd_bank_checkerboard.00001 gstlal_svd_bank_checkerboard.00001.sub
PARENT gstlal_inspiral_svd_bank.00000 CHILD gstlal_svd_bank_checkerboard.00000
PARENT gstlal_inspiral_svd_bank.00001 CHILD gstlal_svd_bank_checkerboard.00001
Submitting and monitoring a DAG
To submit a dag, the user runs condor_submit_dag example.dag
.
This submits one job to the scheduler. This DAGMan job then starts to match and submit jobs to available machines on the cluster.
To see if the DAGMan job is running, try condor_q -dag
. The following information is displayed for the user.
[albert.einstein@comp-hd-001 ]$ condor_q -dag
-- Schedd: comp-hd-001.gwave.ics.psu.edu : <10.136.28.201:9618?... @ 01/20/22 23:27:25
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
gstlalcbc example.dag+17641477 1/19 14:59 0 2 2 4 17641483.0 ... 17664392.0
To display all the jobs in queue for a specific dag, run condor_q -dag $JOBID -nobatch
.
Replace -nobatch
with -run
, -idle
, or -hold
to display the jobs according to their status.
HTCondor creates the following files when a user submits a DAGMan job:
example.dag.condor.sub example.dag.dagman.out example.dag.lib.out example.dag.nodes.log
example.dag.dagman.log example.dag.lib.err example.dag.lock
The user can query these files to monitor the workflow’s progress in real time.
example.dag.dagman.out
tail -f example.dag.dagman.out
shows the latest entries in the file.
To query the status of jobs, especially to check if jobs are failing, try:
$ cat example.dag.dagman.out | grep Failed -A2
01/20/22 23:41:49 Done Pre Queued Post Ready Un-Ready Failed
01/20/22 23:41:49 === === === === === === ===
01/20/22 23:41:49 2 0 2 0 0 0 0
The user can also check the status of jobs, and find failed jobs using a text editor, eg. vi example.dag.dagman.out
.
example.dag.nodes.log
This file contains information about job submission, execution and termination.
The user can query the following information using the $JOBID
:
$ cat example.dag.nodes.log | grep 17641584
000 (17641584.000.000) 2022-01-19 14:59:38 Job submitted from host: <10.136.28.201:9618?CCBID=10.136.28.213:9618%3fPrivNet%3dGWAVE%26addrs%3d10.136.28.213-9618%26alias%3dcomp-ad-001.gwave.ics.psu.edu%26noUDP%26sock%3dcollector#2154&PrivAddr=%3c10.136.28.201:9618%3fsock%3dschedd_3718687_8da8%3e&PrivNet=GWAVE&addrs=10.136.28.201-9618&alias=comp-hd-001.gwave.ics.psu.edu&noUDP&sock=schedd_3718687_8da8>
001 (17641584.000.000) 2022-01-19 14:59:46 Job executing on host: <10.136.29.65:9618?sock=startd_13997_5b33>
005 (17641584.000.000) 2022-01-19 15:48:30 Job terminated.
It also contains information about the conditions under which the job is terminated, and it’s disk, memory and CPU usage.
005 (17641584.000.000) 2022-01-19 15:48:30 Job terminated.
(1) Normal termination (return value 0)
Usr 0 03:21:21, Sys 0 00:12:07 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 03:21:21, Sys 0 00:12:07 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 6.42 2 2
Disk (KB) : 47 47 402424
Memory (MB) : 9348 9000 9088
Job terminated of its own accord at 2022-01-19T20:48:30Z.
Held jobs
To find held jobs, and their HOLD_REASON
run:
$ condor_q -dag $JOBID -hold
-- Schedd: comp-hd-001.gwave.ics.psu.edu : <10.136.28.201:9618?... @ 01/21/22 00:06:47
ID OWNER HELD_SINCE HOLD_REASON
17654621.0 albert.einstein 1/21 00:06 The job attribute PeriodicHold expression '(MemoryUsage >= ((RequestMemory) * 3 / 2))' evaluated to TRUE
Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Total for all users: 8750 jobs; 0 completed, 0 removed, 4096 idle, 4647 running, 7 held, 0 suspended
The user can edit the job requirements for the held job using condor_qedit
and subsequently release the job using condor_release -all
.
Querying allocations
This section provides commands to quickly get information about the Cpus in an allocation. Please note that this does not take into account memory or disk availability. Select a cluster below:
CIT has three allocations: Gstlal_Edward
, Gstlal_Renee
, and Gstlal_Early
. The text $ALLOCATION
in the following commands can be replaced with one of these names to get information about that allocation.
Cpus total
condor_status -constraint "$ALLOCATION" -af Cpus | xargs | tr ' ' '+' | bc
Cpus in use
condor_status -constraint "$ALLOCATION && SlotType==\"Dynamic\"" -af Cpus | xargs | tr ' ' '+' | bc
Cpus free
condor_status -constraint "$ALLOCATION && SlotType==\"Partitionable\"" -af Cpus | xargs | tr ' ' '+' | bc
On ICDS, the “allocation” is all the slots with SlotID==2.
Cpus total
condor_status -constraint 'SlotID==2' -af Cpus | xargs | tr ' ' '+' | bc
Cpus in use
condor_status -constraint 'SlotID==2 && SlotType=="Dynamic"' -af Cpus | xargs | tr ' ' '+' | bc
Cpus free
condor_status -constraint 'SlotID==2 && SlotType=="Partitionable"' -af Cpus | xargs | tr ' ' '+' | bc
On Nemo, the “allocation” is all the slots with SlotID==1.
Cpus total
condor_status -constraint 'SlotID==1' -af Cpus | xargs | tr ' ' '+' | xargs -I INPUT python3 -c "print(INPUT)"
Cpus in use
condor_status -constraint 'SlotID==1 && SlotType=="Dynamic"' -af Cpus | xargs | tr ' ' '+' | xargs -I INPUT python3 -c "print(INPUT)"
Cpus free
condor_status -constraint 'SlotID==1 && SlotType=="Partitionable"' -af Cpus | xargs | tr ' ' '+' | xargs -I INPUT python3 -c "print(INPUT)"
SciTokens
Introduction
There are two types of tokens involved:
- A “bearer token” is used to authenticate to services (gwdatafind, GraceDB, etc.), and is only valid for up to three hours. This is sometimes called an access token, or just a SciToken.
- A “vault token” is used to get a bearer token from the vault server. This has a default lifetime of one week.
HTCondor jobs can get bearer tokens from two types of SciToken issuers:
- The Vault/IGWN issuer is the vault server. This should be thought of as the “default” issuer, and it’s always used for command line operations.
- A Local/AP issuer allows a head node to issue its own bearer tokens for HTCondor jobs (and only for HTCondor jobs). AP issuers are easier to use with HTCondor, but they give everyone the same bearer token, so they can’t give you any special permissions such as uploading to GraceDB as GstLAL.
The type of issuer supported for HTCondor jobs on a head node is determined by admin configuration. Right now, all normal head nodes at LIGO Lab sites (CIT, LHO, LLO) are AP issuers, and all head nodes at other sites use the IGWN/Vault issuer. The gstlal
node at CIT uses the Vault issuer to support online analyses. The login message at LIGO Lab sites shows the issuer for each node.
How it works
Getting tokens
The htgettoken
command tries to use an existing vault token to get a bearer token from vault. If there is no valid vault token, htgettoken must authenticate using either Kerberos or the “OIDC workflow” (browser authentication) to get a vault token. The vault token can then be reused to get new bearer tokens with no other authentication until the vault token expires.
HTCondor
HTCondor runs htgettoken
periodically to give a bearer token to a job and keep it valid for as long as the job runs. To do this, HTCondor stores its own vault token internally. That internal vault token needs to be renewed periodically so that HTCondor can keep requesting bearer tokens for jobs.
The condor_vault_storer
command (soon to be replaced by igwn-robot-get
) tells HTCondor to run htgettoken
. It also copies HTCondor’s internal tokens to their default locations.
Files
Default file locations:
- Vault token:
/tmp/vt_u$(id -u)
- Bearer token:
/run/user/$(id -u)/bt_u$(id -u)
(deleted when logged out)- If this directory doesn’t exist, the bearer token is put here:
/tmp/bt_u$(id -u)
- If this directory doesn’t exist, the bearer token is put here:
Note that if a token needs to be accessible from inside a container, you must bind its directory when entering the container. For example:
singularity shell -B /run my-container.sif
GstLAL
The GstLAL config parser supports a use-scitokens
option in the condor
section which formats appropriate lines in submit files based on its value and the account being used.
Getting a bearer token for your terminal
Personal accounts
htgettoken -a vault.ligo.org -i igwn
Shared accounts
~/get_gstlal_scitoken.sh
Using SciTokens with HTCondor
With a Local/AP issuer
Put this in your GstLAL configuration file:
condor:
use-scitokens: true
With an IGWN/Vault issuer
First, have HTCondor store a vault token:
condor_vault_storer "igwn&options=--vaulttokenttl 1000000s --vaulttokenminttl 999999s"
Then put this in your GstLAL configuration file:
condor:
use-scitokens: <credkey>
For a personal account, your credkey is your albert.einstein
username. For a shared account, use one of our robot credkeys.
Account | Cluster | Credkey |
---|---|---|
gstlalcbc.online |
CIT | gstlalcbc_online_cit-scitoken/robot/gstlal.ligo.caltech.edu |
gstlalcbc |
ICDS | gstlalcbc_icds/robot/ligo-hd-01.gwave.ics.psu.edu |
gstlalcbc.online |
UWM | gstlalcbc_online_uwm-scitoken/robot/submit.nemo.uwm.edu |
gstlalcbc.offline |
CIT | gstlalcbc_offline/robot/gstlal.ligo.caltech.edu |
gstlalcbc.offline |
ICDS | gstlalcbc_offline_psu-scitoken/robot/ligo-hd-01.gwave.ics.psu.edu |
gstlalcbc.offline |
UWM | gstlalcbc_offline_uwm-scitoken/robot/submit.nemo.uwm.edu |
Some useful commands
-
condor_ssh_to_job $JOBID
: ssh into the execute node where the job is running. -
condor_watch_q
: Interactive mode forcondor_q
-
condor_qedit albert.einstein -constraint "jobstatus==5" RequestMemory 5000
edits the requested memory for held jobs submitted by useralbert.einstein
. -
condor_release -all
releases all held jobs.
For more information, check the HTCondor users-manual.
Grafana
Grafana is a web app that we use to visualize data and monitor the status of our online analyses. The Grafana dashboard gets data from Influx databases, which are specified in the configuration of the online analyses before launching. Data is continually updated so we can see the evolving status of the analysis at any time. The dashboards are loaded from JSON files that specify everything you see - all the panels and their position within the dashboard, as well as the queries they use to get data and the visual configurations.
Below are some basic tips on how to use Grafana.
How to
General usage tips
- You can change how often the page automatically refreshes by choosing a time value from the dropdown menu next to the refresh icon in the top right corner.
- To view a larger verssion of a full panel, select the dropdown menu next to the panel name and select “view”.
- To zoom in on a certain time range of a panel, just click and drag over the section of the panel you want to see.
- You can toggle on/off different data series on a panel by clicking their name in the legend.
- To view a panel description, mouse over the info icon in the top left corner of the panel.
Set up a new dashboard
To set up a new dashboard from scratch you will go to https://gstlal.ligo.caltech.edu/grafana/ and select the “plus” on the left hand side, then select dashboard. You can then give your dashboard a name (upper left) and proceed to the next section for tips on adding panels.
It is more likely that you will want to open a new dashboard by importing an existing JSON template. For example, we keep JSON dashboard templates checked in to the online-analysis repo for each of the typical analysis configurations. Instructions for importing and setting up these dashboards is found in the Deployment Section. FIXME Fix this link.
Add or edit panels
- To add a new panel, you can either:
- start from scratch by selecting the “add panel” icon (top right next to the “save” button)
- then select “add empty panel” and configure the panel type, queries, and panel options.
- duplicate an existing panel by selecting the dropdwon menu next to the panel name, and then select “More->Duplicate”
- then from the dropdown menu of the new panel, select “Edit” and configure the panel queries, options, etc.
- Panel types: There are many built-in panel types available to use. You can find these on the Edit page of any panel at the top of the right panel. The most common dashboard we use is the timeseries panel. If you find that you want a new kind of panel that doesn’t exist in this list, you can either:
- write a new custom Plotly panel yourself
- find an existing Grafana panel and ask Ron Tapia to install it.
- Editing Panels:
- adding new queries: There is a “+ Query” button on the bottom left. The queries are written in SQL, if you’re not familar with writing queries like this, it is easiest to use the visual editor mode and just make edits to an existing similar panel until you get what you want.
- add Title and description: you can fill this in on the top right of the edit page. Descriptions are viewed by mousing over the “i” icon on the top left corner of a panel.
- change legend names, display colors: these options are configured under “Panel Options” on the right side of the edit page
- Display section: toggle lines, points, and area fill of timeseries
- Series overrides: here you can change the display properties of a single time series. Select “add series override” Select the series name from the menu next to “Alias or regex”. Select the plus button and then choose from the dropdown which overrides you want.
- Axes section: you can toggle axes on/off, select units, scale, min/max values and axes labels.
Debugging
- How to find the console output
Save and export dashboard
If you are working on development of a Grafana dashboard for the online-analyses, you should save and export the JSON template and keep it up-to-date in the online analysis repo. To do this you will:
- Click the “save dashboard” icon in the upper right corner once you have made a change.
- Click the “share dashboard” icon in the upper left corner.
- Then choose “export” and “save to file”.
- This will download a JSON file to your computer, then you will check this file into one of the online analysis branches as
web/online_dashboard.json
. Commit and push the file.
Git
Git is a platform used for version control and collaborative work. Our philosophy is that everything we do (code written, configurations used, project organization and tasks to-do) should be tracked in git.
Please see notes on the contributing workflow for GstLAL. This process should be followed for any GstLAL or GstLAL related development work.
Other documentation
GraceDb
GraceDb is extensively documented. See the reference manual. Specifically, see information on querying for events and superevents via the web interface, and information on using the GraceDb client via Python.
EM Follow Docs
WRITE ME
How to decide your accounting group
Use this link: https://ldas-gridmon.ligo.caltech.edu/ldg_accounting/user
Contacts
See the list of LVC mailing lists at https://sympa.ligo.org/. Members of the GstLAL group should at least subscribe to:
gstlal-discuss@sympa.ligo.org
- reminders about regular GstLAL emails
- discussions about GstLAL development, proposed projects, and getting help with errors/blockers
cbc@sympa.ligo.org
- wider cbc group discussions and meeting reminders
For questions about the clusters you can always email: ldas_admin_all@ligo.caltech.edu
, ldas_admin_cit@ligo.caltech.edu
, ldas_admin_llo@ligo.caltech.edu
, and ldas_admin_lho@ligo.caltech.edu
.