Online Operations Manual

HTCondor

HTCondor is a job scheduler for High Throughput Computing. It finds available machines for executing jobs submitted by a user from a submit/head node, eg. headnode01 or headnode02 on the ICDS cluster at PSU. The jobs are matched with idle machines based on their memory, disk and other requirements that can be specified in a submit file, which is named job.sub by convention.

For GstLAL analyses, we use DAGMan workflows. Since we need different kinds of jobs to run in sequence for the analysis to finish through, DAGMan is a powerful tool. The input file for a DAGMan workflow, called *.dag, describes the jobs, and sets the PARENT and CHILD dependencies for each job which informs the scheduler of the sequence in which jobs must be submitted to execute nodes. Each class of job in the input file has a submit file which specifies the job arguments and requirements.
For GstLAL workflows, the input *.dag has a corresponding *.sh file which lists the bash commands for each job. It is not used by DAGMan, but is a good reference for the user.

An example workflow:

Input file, example.dag lists two classes of jobs with their dependencies.

JOB  gstlal_inspiral_svd_bank.00000  gstlal_inspiral_svd_bank.sub
JOB  gstlal_inspiral_svd_bank.00001  gstlal_inspiral_svd_bank.sub
JOB  gstlal_svd_bank_checkerboard.00000  gstlal_svd_bank_checkerboard.sub 
JOB  gstlal_svd_bank_checkerboard.00001  gstlal_svd_bank_checkerboard.00001.sub
PARENT gstlal_inspiral_svd_bank.00000 CHILD gstlal_svd_bank_checkerboard.00000
PARENT gstlal_inspiral_svd_bank.00001 CHILD gstlal_svd_bank_checkerboard.00001

Submitting and monitoring a DAG

To submit a dag, the user runs condor_submit_dag example.dag.

This submits one job to the scheduler. This DAGMan job then starts to match and submit jobs to available machines on the cluster.

To see if the DAGMan job is running, try condor_q -dag. The following information is displayed for the user.

[albert.einstein@comp-hd-001 ]$ condor_q -dag


-- Schedd: comp-hd-001.gwave.ics.psu.edu : <10.136.28.201:9618?... @ 01/20/22 23:27:25
OWNER     BATCH_NAME             SUBMITTED   DONE   RUN    IDLE  TOTAL   JOB_IDS
gstlalcbc example.dag+17641477   1/19 14:59   0      2      2      4    17641483.0 ... 17664392.0

To display all the jobs in queue for a specific dag, run condor_q -dag $JOBID -nobatch . Replace -nobatch with -run, -idle, or -hold to display the jobs according to their status.

HTCondor creates the following files when a user submits a DAGMan job:

example.dag.condor.sub  example.dag.dagman.out  example.dag.lib.out  example.dag.nodes.log
example.dag.dagman.log  example.dag.lib.err     example.dag.lock

The user can query these files to monitor the workflow’s progress in real time.

example.dag.dagman.out

tail -f example.dag.dagman.out shows the latest entries in the file.

To query the status of jobs, especially to check if jobs are failing, try:

$ cat example.dag.dagman.out | grep Failed -A2

01/20/22 23:41:49  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
01/20/22 23:41:49   ===     ===      ===     ===     ===        ===      ===
01/20/22 23:41:49     2       0        2       0       0          0        0

The user can also check the status of jobs, and find failed jobs using a text editor, eg. vi example.dag.dagman.out.

example.dag.nodes.log

This file contains information about job submission, execution and termination. The user can query the following information using the $JOBID:

$ cat example.dag.nodes.log | grep 17641584 

000 (17641584.000.000) 2022-01-19 14:59:38 Job submitted from host: <10.136.28.201:9618?CCBID=10.136.28.213:9618%3fPrivNet%3dGWAVE%26addrs%3d10.136.28.213-9618%26alias%3dcomp-ad-001.gwave.ics.psu.edu%26noUDP%26sock%3dcollector#2154&PrivAddr=%3c10.136.28.201:9618%3fsock%3dschedd_3718687_8da8%3e&PrivNet=GWAVE&addrs=10.136.28.201-9618&alias=comp-hd-001.gwave.ics.psu.edu&noUDP&sock=schedd_3718687_8da8>
001 (17641584.000.000) 2022-01-19 14:59:46 Job executing on host: <10.136.29.65:9618?sock=startd_13997_5b33>
005 (17641584.000.000) 2022-01-19 15:48:30 Job terminated.

It also contains information about the conditions under which the job is terminated, and it’s disk, memory and CPU usage.

005 (17641584.000.000) 2022-01-19 15:48:30 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 03:21:21, Sys 0 00:12:07  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 03:21:21, Sys 0 00:12:07  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
        Partitionable Resources :    Usage  Request Allocated
           Cpus                 :     6.42        2         2
           Disk (KB)            :    47          47    402424
           Memory (MB)          :  9348        9000      9088

        Job terminated of its own accord at 2022-01-19T20:48:30Z.

Held jobs

To find held jobs, and their HOLD_REASON run:

$ condor_q -dag $JOBID -hold

-- Schedd: comp-hd-001.gwave.ics.psu.edu : <10.136.28.201:9618?... @ 01/21/22 00:06:47
 ID          OWNER          HELD_SINCE  HOLD_REASON
17654621.0   albert.einstein       1/21 00:06 The job attribute PeriodicHold expression '(MemoryUsage >= ((RequestMemory) * 3 / 2))' evaluated to TRUE

Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended 
Total for all users: 8750 jobs; 0 completed, 0 removed, 4096 idle, 4647 running, 7 held, 0 suspended

The user can edit the job requirements for the held job using condor_qedit and subsequently release the job using condor_release -all.

Querying allocations

This section provides commands to quickly get information about the Cpus in an allocation. Please note that this does not take into account memory or disk availability. Select a cluster below:

CIT has three allocations: Gstlal_Edward, Gstlal_Renee, and Gstlal_Early. The text $ALLOCATION in the following commands can be replaced with one of these names to get information about that allocation.

Cpus total

condor_status -constraint "$ALLOCATION" -af Cpus | xargs | tr ' ' '+' | bc

Cpus in use

condor_status -constraint "$ALLOCATION && SlotType==\"Dynamic\"" -af Cpus | xargs | tr ' ' '+' | bc

Cpus free

condor_status -constraint "$ALLOCATION && SlotType==\"Partitionable\"" -af Cpus | xargs | tr ' ' '+' | bc

On ICDS, the “allocation” is all the slots with SlotID==2.

Cpus total

condor_status -constraint 'SlotID==2' -af Cpus | xargs | tr ' ' '+' | bc

Cpus in use

condor_status -constraint 'SlotID==2 && SlotType=="Dynamic"' -af Cpus | xargs | tr ' ' '+' | bc

Cpus free

condor_status -constraint 'SlotID==2 && SlotType=="Partitionable"' -af Cpus | xargs | tr ' ' '+' | bc

On Nemo, the “allocation” is all the slots with SlotID==1.

Cpus total

condor_status -constraint 'SlotID==1' -af Cpus | xargs | tr ' ' '+' | xargs -I INPUT python3 -c "print(INPUT)"

Cpus in use

condor_status -constraint 'SlotID==1 && SlotType=="Dynamic"' -af Cpus | xargs | tr ' ' '+' | xargs -I INPUT python3 -c "print(INPUT)"

Cpus free

condor_status -constraint 'SlotID==1 && SlotType=="Partitionable"' -af Cpus | xargs | tr ' ' '+' | xargs -I INPUT python3 -c "print(INPUT)"

SciTokens

Introduction

There are two types of tokens involved:

A “bearer token” is used to authenticate to services (gwdatafind, GraceDB, etc.), and is only valid for up to three hours. This is sometimes called an access token, or just a SciToken.
A “vault token” is used to get a bearer token from the vault server. This has a default lifetime of one week.

HTCondor jobs can get bearer tokens from two types of SciToken issuers:

The Vault/IGWN issuer is the vault server. This should be thought of as the “default” issuer, and it’s always used for command line operations.
A Local/AP issuer allows a head node to issue its own bearer tokens for HTCondor jobs (and only for HTCondor jobs). AP issuers are easier to use with HTCondor, but they give everyone the same bearer token, so they can’t give you any special permissions such as uploading to GraceDB as GstLAL.

The type of issuer supported for HTCondor jobs on a head node is determined by admin configuration. Right now, all normal head nodes at LIGO Lab sites (CIT, LHO, LLO) are AP issuers, and all head nodes at other sites use the IGWN/Vault issuer. The gstlal node at CIT uses the Vault issuer to support online analyses. The login message at LIGO Lab sites shows the issuer for each node.

How it works

Getting tokens

The htgettoken command tries to use an existing vault token to get a bearer token from vault. If there is no valid vault token, htgettoken must authenticate using either Kerberos or the “OIDC workflow” (browser authentication) to get a vault token. The vault token can then be reused to get new bearer tokens with no other authentication until the vault token expires.

HTCondor

HTCondor runs htgettoken periodically to give a bearer token to a job and keep it valid for as long as the job runs. To do this, HTCondor stores its own vault token internally. That internal vault token needs to be renewed periodically so that HTCondor can keep requesting bearer tokens for jobs.

The condor_vault_storer command (soon to be replaced by igwn-robot-get) tells HTCondor to run htgettoken. It also copies HTCondor’s internal tokens to their default locations.

Files

Default file locations:

Vault token: /tmp/vt_u$(id -u)
Bearer token: /run/user/$(id -u)/bt_u$(id -u) (deleted when logged out)
- If this directory doesn’t exist, the bearer token is put here: /tmp/bt_u$(id -u)

Note that if a token needs to be accessible from inside a container, you must bind its directory when entering the container. For example:

singularity shell -B /run my-container.sif

GstLAL

The GstLAL config parser supports a use-scitokens option in the condor section which formats appropriate lines in submit files based on its value and the account being used.

Getting a bearer token for your terminal

Personal accounts

htgettoken -a vault.ligo.org -i igwn

Shared accounts

~/get_gstlal_scitoken.sh

Using SciTokens with HTCondor

With a Local/AP issuer

Put this in your GstLAL configuration file:

condor:
  use-scitokens: true

With an IGWN/Vault issuer

First, have HTCondor store a vault token:

Don't do this on our shared accounts. Each shared account has a cron job which manages the stored token.

condor_vault_storer "igwn&options=--vaulttokenttl 1000000s --vaulttokenminttl 999999s"

Then put this in your GstLAL configuration file:

condor:
  use-scitokens: <credkey>

For a personal account, your credkey is your albert.einstein username. For a shared account, use one of our robot credkeys.

Account	Cluster	Credkey
`gstlalcbc.online`	CIT	`gstlalcbc_online_cit-scitoken/robot/gstlal.ligo.caltech.edu`
`gstlalcbc`	ICDS	`gstlalcbc_icds/robot/ligo-hd-01.gwave.ics.psu.edu`
`gstlalcbc.online`	UWM	`gstlalcbc_online_uwm-scitoken/robot/submit.nemo.uwm.edu`
`gstlalcbc.offline`	CIT	`gstlalcbc_offline/robot/gstlal.ligo.caltech.edu`
`gstlalcbc.offline`	ICDS	`gstlalcbc_offline_psu-scitoken/robot/ligo-hd-01.gwave.ics.psu.edu`
`gstlalcbc.offline`	UWM	`gstlalcbc_offline_uwm-scitoken/robot/submit.nemo.uwm.edu`

Some useful commands

condor_ssh_to_job $JOBID: ssh into the execute node where the job is running.
condor_watch_q: Interactive mode for condor_q
condor_qedit albert.einstein -constraint "jobstatus==5" RequestMemory 5000 edits the requested memory for held jobs submitted by user albert.einstein.
condor_release -all releases all held jobs.

For more information, check the HTCondor users-manual.

Grafana

Grafana is a web app that we use to visualize data and monitor the status of our online analyses. The Grafana dashboard gets data from Influx databases, which are specified in the configuration of the online analyses before launching. Data is continually updated so we can see the evolving status of the analysis at any time. The dashboards are loaded from JSON files that specify everything you see - all the panels and their position within the dashboard, as well as the queries they use to get data and the visual configurations.

Below are some basic tips on how to use Grafana.

How to

General usage tips

You can change how often the page automatically refreshes by choosing a time value from the dropdown menu next to the refresh icon in the top right corner.
To view a larger verssion of a full panel, select the dropdown menu next to the panel name and select “view”.
To zoom in on a certain time range of a panel, just click and drag over the section of the panel you want to see.
You can toggle on/off different data series on a panel by clicking their name in the legend.
To view a panel description, mouse over the info icon in the top left corner of the panel.

Set up a new dashboard

To set up a new dashboard from scratch you will go to https://gstlal.ligo.caltech.edu/grafana/ and select the “plus” on the left hand side, then select dashboard. You can then give your dashboard a name (upper left) and proceed to the next section for tips on adding panels.

It is more likely that you will want to open a new dashboard by importing an existing JSON template. For example, we keep JSON dashboard templates checked in to the online-analysis repo for each of the typical analysis configurations. Instructions for importing and setting up these dashboards is found in the Deployment Section. FIXME Fix this link.

Add or edit panels

To add a new panel, you can either:

start from scratch by selecting the “add panel” icon (top right next to the “save” button)
- then select “add empty panel” and configure the panel type, queries, and panel options.
duplicate an existing panel by selecting the dropdwon menu next to the panel name, and then select “More->Duplicate”
- then from the dropdown menu of the new panel, select “Edit” and configure the panel queries, options, etc.

Panel types: There are many built-in panel types available to use. You can find these on the Edit page of any panel at the top of the right panel. The most common dashboard we use is the timeseries panel. If you find that you want a new kind of panel that doesn’t exist in this list, you can either:

write a new custom Plotly panel yourself
find an existing Grafana panel and ask Ron Tapia to install it.

Editing Panels:

adding new queries: There is a “+ Query” button on the bottom left. The queries are written in SQL, if you’re not familar with writing queries like this, it is easiest to use the visual editor mode and just make edits to an existing similar panel until you get what you want.
add Title and description: you can fill this in on the top right of the edit page. Descriptions are viewed by mousing over the “i” icon on the top left corner of a panel.
change legend names, display colors: these options are configured under “Panel Options” on the right side of the edit page
- Display section: toggle lines, points, and area fill of timeseries
- Series overrides: here you can change the display properties of a single time series. Select “add series override” Select the series name from the menu next to “Alias or regex”. Select the plus button and then choose from the dropdown which overrides you want.
- Axes section: you can toggle axes on/off, select units, scale, min/max values and axes labels.

Debugging

How to find the console output

Save and export dashboard

If you are working on development of a Grafana dashboard for the online-analyses, you should save and export the JSON template and keep it up-to-date in the online analysis repo. To do this you will:

Click the “save dashboard” icon in the upper right corner once you have made a change.
Click the “share dashboard” icon in the upper left corner.
Then choose “export” and “save to file”.
This will download a JSON file to your computer, then you will check this file into one of the online analysis branches as web/online_dashboard.json. Commit and push the file.

Git

Git is a platform used for version control and collaborative work. Our philosophy is that everything we do (code written, configurations used, project organization and tasks to-do) should be tracked in git.

Please see notes on the contributing workflow for GstLAL. This process should be followed for any GstLAL or GstLAL related development work.

Online Operations Manual

Table of Contents

HTCondor

An example workflow:

Submitting and monitoring a DAG

example.dag.dagman.out

example.dag.nodes.log

Held jobs

Querying allocations

SciTokens

Introduction

How it works

Getting tokens

HTCondor

Files

GstLAL

Getting a bearer token for your terminal

Personal accounts

Shared accounts

Using SciTokens with HTCondor

With a Local/AP issuer

With an IGWN/Vault issuer

Some useful commands

Grafana

How to

General usage tips

Set up a new dashboard

Add or edit panels

Debugging

Save and export dashboard

Git

Other documentation

GraceDb

EM Follow Docs

How to decide your accounting group

Contacts