Online Operations Manual
Table of Contents
Working with ligo-scald
ligo-scald is a software package used for interacting with Kafka and Influx DB.
In GstLAL instead of reading or writing data to Kafka directly with the confluent_kafka python package, we use ligo-scald as a convenience wrapper.
Similarly, instead of duplicating the code to write data directly to an Influx database with line protocol, we instead use functions in ligo-scald to do this for us.
configuration
We use configuration yaml files to define the schemas that will get written to an Influx database. Additionally, the configuration file includes the database name, hostname, and port along with some other default properties.
The schemas define the structure of our data.
Each schema includes a measurement name. The column field defines the name of the column where data values are stored in the measurement.
There are two general structures of measurement that we can use: time series and triggers.
Time series are measurements in which each entry consists of a single time and data field.
In ligo-scald syntax, we always use a column called “data” for time series data.
Trigger measurements are structured like dictionaries, where each entry has a time field in addition to an arbitrary number of additional data fields.
Tags are strings of metadata associated with a measurement.
Tags are used for grouping data in queries, and for each tag key there should be a finite number of possible tag values.
In ligo-scald, we specify an aggregate field which defines a function to aggregate data by (for example min or max).
There is also a field in the schema called tag_key.
This field specifies which tag key to group data by when aggregating.
For example,
range_history:
measurement: range_history
column: data
tag: ifo
tag_key: ifo
aggregate: max
ifo:
- H1
- L1
- V1
In the above schema, the numerical values are stored in the “data” column of the “range_history” measurement. These data are aggregated by maximum value in each IFO. In other words, the range history data in H1, L1, and V1 are isolated from each other and each are aggregated individually.
Working with Kafka
The kafka server at CIT is: kafkagstlal.ldas.cit:9196
Note: The kafka server at ICDS is: rtdb-01.gwave.ics.psu.edu:9196
A useful tool for connecting to kafka is kcat. It’s not installed at CIT, but you can run it using the scimma/client container. To build and start the container:
singularity build -s scimma_client.sif docker://scimma/client:latest
singularity run scimma_client.sif /bin/bash
Here are some examples of what you can do with kcat:
List topics:
kcat -L -b kafkagstlal.ldas.cit:9196
This shows all topics which have ever been written to on the kafka server and will be way too much output to be useful. It is better to grep for the analysis tag that you are interested in to see just the topics that your analysis is writing to:
kcat -L -b kafkagstlal.ldas.cit:9196 | grep my_tag
Consume the first 5 messages on the gstlal.foo_online.L1_snr_history topic:
kcat -C -b kafkagstlal.ldas.cit:9196 -t 'gstlal.foo_online.L1_snr_history' -c 5
Consume the last 5 messages on the gstlal.foo_online.L1_snr_history topic:
kcat -C -b kafkagstlal.ldas.cit:9196 -t 'gstlal.foo_online.L1_snr_history' -o -5 -c 5
Consume the last message on the ‘gstlal.foo_online.L1_snr_history’ topic (on each partition) and then continue reading new messages:
kcat -C -b kafkagstlal.ldas.cit:9196 -t 'gstlal.foo_online.L1_snr_history' -o -1
If there is “science” data, you should get a constant stream of messages. If you don’t get a stream of messages, either there is really no data coming into the analysis or there is a problem with the analysis.
Consume all messages on the ‘gstlal.foo_online.L1_snr_history’ and exit when you hit the end (instead of waiting) (You might want to redirect this command to a file):
kcat -C -b kafkagstlal.ldas.cit:9196 -t 'gstlal.foo_online.L1_snr_history' -e
Consume the last message on the ‘gstlal.foo_online.L1_snr_history’ topic and print the key:
kcat -C -b kafkagstlal.ldas.cit:9196 -t 'gstlal.foo_online.L1_snr_history' -o -1 -K ::
Use keys and grep to retrieve data from a single job:
kcat -C -b kafkagstlal.ldas.cit:9196 -t 'gstlal.foo_online.L1_snr_history' -o -1 -K :: | grep 0000.noninj
Working with Influx DB
This section contains examples of commands that can be run at CIT or ICDS once you have sourced your influx_creds.sh file.
Note: To query an influx database on ICDS use the url:
https://influxdb.ligo.caltech.edu:8086
on CIT use:
https://influxdb.gwave.ics.psu.edu:8086
The example commands in this section will use the CIT url.
accessible databases
To see what databases your credentials can access:
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode 'q=SHOW DATABASES'
Sample output:
{
"results": [
{
"statement_id": 0,
"series": [
{
"name": "databases",
"columns": [
"name"
],
"values": [
[
"foo_online"
]
]
}
]
}
]
}
if you get authorization failed, then your username or your password is wrong (or at least doesn’t match what is in the influx configuration).
available measurements
You can see the measurements that are available in a database with a command like:
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode 'q=SHOW MEASUREMENTS ON my_database'
where my_database is replaced with one of your InfluxDB databases.
The data in these measurements is what appears on the Grafana monitoring dashboards for each GstLAL analysis. Only measurements that actually contain data will appear in this list. However, data stays in an influx database essentially forever so the above command does not tell you whether a certain measurement is being written to currently.
retention policies
Normally, retention policies define the duration over which data in an Influx DB measurement is kept. In ligo-scald, retention policies are used a bit differently. Note that the syntax for retrieving data from Influx DB is:
{retention_policy}.{measurement}
The retention policies are used to define durations over which to aggregate data. For example, a retention policy of “1s” with an aggregate function of “max” would keep the single maximum data point over each one second window.
You can see the retention policies on a database named foo_online with the command:
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode "q=show retention policies on foo_online"
This will return a list of retention policies with the following columns:
"columns": [
"name",
"duration",
"shardGroupDuration",
"replicaN",
"default"
],
query over a time range
Here is an example of how you can get data out of influxdb. Note that the timestamps in InfluxDB are nanoseconds since the epoch (not GPS time). To query the last 24 hours data for:
DB = foo_online
measurement = L1_snr_history
retention policy = 1s
you can use the command:
NOW=`date +%s`
NOW=`echo $NOW ' * 10^9' | bc`
THEN=`echo $NOW ' - 86400*10^9' | bc`
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode "q=SELECT data from foo_online.\"1s\".L1_snr_history WHERE aggregate='max' AND time >= $THEN AND time <= $NOW"
data transformation
You can also perform transformations on the data, for example count how many L1 snr history points have been stored in the last 24 hours:
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode "q=SELECT count(data) from foo_online.\"1s\".L1_snr_history WHERE aggregate='max' AND time >= $THEN AND time <= $NOW “
measurement tags
Tags are strings that hold metadata associated with a measurement.
In Influx syntax, tags are key value pairs.
For the GstLAL analysis, many of the measurements are stored with tags indicating the SVD bin number of the job (svdbin) which produced the data, the job type (job) which is either noninj or inj, and the upper frequency cutoff (fmax) for filtering used by the job that produced the data.
For example, if you wanted data only coming from the SVD bin 0123, you would specify the tag key “svdbin” and value “0123”.
You can view the tags that are available for a given measurement in the scald configuration file. For L1_snr_history in the GstLAL online analysis:
L1_snr_history:
measurement: L1_snr_history
column: data
tags:
- svdbin
- job
- fmax
tag_key: job
aggregate: max
Here is an example of using tags to query data from a specific inspiral job and threshold on the value:
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode "q=SELECT data from foo_online.\"1s\".L1_snr_history WHERE aggregate='max' AND time >= $THEN AND time <= $NOW AND svdbin = '0000' AND data > 8.0"
If you just want to see which tag keys and values are available for a particular measurement, you can use the following commands:
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode 'q=SHOW TAG KEYS ON "gstlal_inspiral_edward_MDC11" FROM "L1_snr_history"'
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode 'q=SHOW TAG VALUES ON "gstlal_inspiral_edward_MDC11" FROM "L1_snr_history" WITH KEY = "svdbin"'
Checking likelihood history and ram history
Here are instructions on how to check the likelihood history and ram history of a given job.
In the working dir of the analyses, there are registry.txt files.
As an example, we will use a job with svd bin id 0000.
This command:
cat 0000_inj_mdc_registry.txt
will print out the node, for example, http://node978:43873/. Then run
wget http://node978:43873/likelihood_history.txt
then
head likelihood_history.txt
to get the likelihood history.
If one is looking for the ram history, change likelihood_history to ram_history in the commands above.
miscellaneous
To query an influx database with periods in the name, for example “foo.bar.online” you should enclose the database name with double quotes in the commands above. For example, 'q=SHOW MEASUREMENTS ON "foo.bar.online"' or "q=SELECT data from \"foo.bar.online\".\"1s\".L1_injsnr_missed..."
Icinga Monitoring
Icinga is an open source monitoring and alert infrastructure. It is used across the LVK collaboration to monitor various critical services in low-latency, for example data distribution, GWCelery, and the search analyses among other things. The gstlal monitoring statuses are shown on the dashboard here. This page includes status outputs of:
- The GstLAL Grafana server
- The GstLAL low-latency analyses:
gstlal-<search>: infers the status of gstlal jobs by querying data from InfluxDB.gstlal-<search>_http: infers the status of gstlal jobs by querying data from bottle.
- The three GstLAL kafka servers on CIT
- The GstLAL influx server on CIT
Each Icinga service group is monitored by a script which outputs a JSON status object. You can find more information about the format of the JSON output in the IGWN monitoring docs. The JSON output is then periodically read by a shibboleth scraper. The configurations for these monitoring services are controlled in the dashboard2 git repo, eg see: gstlal.conf.
See query_influx for error message details.
GstLAL analysis influx test
The status of GstLAL inspiral jobs are tracked via the gstlal-<search> service groups on the Icinga dashboard.
This check is powered by the query_influx executable, a Python script that queries the most recent RAM-history datapoint from each gstlal-inspiral job without contacting the jobs directly.
A job is marked down if:
- no RAM-history datapoint is found, or
- the most recent datapoint is older than 3 minutes.
Status logic:
- OK: all jobs reporting normally
- WARNING: 1% of jobs are down
- CRITICAL: more than 1% of jobs are down
Setting Up Icinga Monitoring
To enable Icinga monitoring through Influx, copy the
inspiral_o4_analysis.sh script
or use an existing one such as ~/cron/inspiral_o4a_<analysis_name>.sh.
These scripts provide the analysis-specific configuration for the query_influx executable.
Uncomment and edit the relevant configuration lines and variables.
The output JSON files are written to ~/public_html/nagios.
These JSON files are read by the shib scraper every 4 minutes to populate the Icinga dashboard.
If a JSON file is more than 10 minutes old, the status is marked UNKNOWN.
Example Cron Job
This monitor is intended to run every 3 minutes.
A typical cron configuration is:
# Example: run every 3 minutes
*/3 * * * * tmpfile=/tmp/$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1) ; <path to cron dir>/inspiral_o4_<analysis_name>.sh > $tmpfile ; mv $tmpfile <path to nagios output>/inspiral_o4_<analysis_name>.json
This script runs under the gstlalcbc.online account at CIT and NEMO, and the gstlalcbc account at ICDS.
It does not need to run on the same node as the analyses, only on the same cluster.
~/public_html/nagios
GstLAL analyis HTTP test
The status of GstLAL inspiral jobs' http connection are tracked via the gstlal-<search>_http service groups on the Icinga dashboard.
The GstLAL inspiral jobs each spin up their http server which lets the GstLAL inspiral jobs connect with http.
It is important that each GstLAL inspiral job connect to a http server in order to connect to bottle, such that users can communicate with the jobs directly. More importantly, the marginalize_likelihood jobs contacts each of the GstLAL inspiral jos via http, so the GstLAL inspiral jobs must be connecting to http.
When the wget command is failing (timing out), it means that the GstLAL inspiral jobs and http servers are not connecting properly.
This issue is usually caused by CIT hardware unable to manage all the information due to being overloaded.
The executable for ICDS is cgi-bin/checkAnalysisHttp which is written in PEARL.
The script is in ~/public_html/cgi-bin/http_o4_<analysis_name> which provides analysis-specific information for the executible.
The script checks for the number of jobs and checks if each job connects to the http servers.
The jobs that cannot connect to the http servers are marked and the script prints out such jobs in a json file.
When Icinga alerts show that the http test is failing, check the following:
- Confirm that the GstLAL inspiral jobs are not responding to http requests with a
wgetcommand such as:
cat 0000_noninj_registry.txt
wget <url>/ram_history.txt
- Check the job-specific dashboard to see if the jobs are still producing data. If so, the jobs are likely to be outputting data in a healthy manner, but http connection is failing. If the jobs are not producing data, they might have become zombie.
- Run
condor_ssh_to_job <job-id>and runtop -u gstlalcbcfor Alice/Jacob ortop -u gstlalcbc.onlinefor Edward/Charlie. Check that the CPU usage of the GstLAL inspiral job is non-zero. (The state of the job can be “S.") If the CPU usage is 0, the job is not running properly. If the CPU usage is non-zero, the jobs could be running but the http connection is failing. - If the GstLAL inspiral jobs are running but the http connection is failing, find the job-id of the GstLAL inspiral job with
condor_q -dag <dag-id> --nobatch |grep "job-tag <bin number>_noninj"and runcondor_rm <job-id>so that the GstLAL inspiral job gets relaunched on a different node and start up the http connection again.
Automated Online Operations Bot (Sherlog)
Sherlog is a customizable automation bot designed to support operations related to online gravitational-wave analyses. It automates many of the routine monitoring, diagnostic, and operational tasks typically performed by the ROTA. Sherlog posts updates directly to the GstLAL: Icinga Mattermost channel via an incoming webhook.
All code for Sherlog resides in the o4b-containers repository on the main branch.
To use the bot, clone the repository into your builds directory.
Deployment is handled via cron jobs, which must be configured on the host machine.
It is recommended to stagger all installed cron so as to not overwhelm the system.
NOTE: Make a backup of the current cron file before editing it
Daily Error Check
A daily automated error check helps track the stability of the online analysis and identify issues that Icinga may not catch. Once configured, Sherlog posts a summary message containing the time, job ID, job name, signal number, and error type for each detected issue.
To enable this feature, copy the
daily_errors_analysis.sh
script, edit the commented out sections of the script and install the recommended cron job:
# Example: run once per day at 07:00
00 07 * * * <home directory path>/cron/daily_errors_analysis.sh
Diagnose Failures
Sherlog can diagnose failures in real time, including identifying whether a node is down. This allows the ROTA and analysis operators to understand the current state of the online analysis without logging into the cluster or manually probing logs. This check is intended to run regularly throughout the day.
To enable this feature, copy the diagnose_failures.sh script, edit the commented out sections of the script and install the recommended cron job:
# Example: run every 5 minutes
*/5 * * * * <home directory path>/cron/diagnose_failures.sh
NOTE : Diagnose Failures relies on the GstLAL Analysis Influx Test being configured and writing JSON status files to ~/public_html/nagios
Weekly Takedown
Sherlog can automatically take down the analysis during scheduled weekly maintenance windows. It checks the detector state, brings down the analysis if required (or forcefully), and posts a message explaining the action. A follow-up message confirms whether the takedown was successful. This is especially useful for the ROTA team, who no longer need to manually monitor detector states or initiate the takedown themselves.
To enable this feature, copy the weekly_takedown_analysis.sh script, edit the commented out sections of the script and install the recommended cron job:
# Example: run every Tuesday at 10:30
30 10 * * 2 <home directory path>/cron/weekly_takedown_rick.sh
Weekly Re-whitening
Sherlog can set up and run the weekly re-whitening workflow. This should be configured at the location where re-whitening is normally performed. Through operational experience, the optimal time to run the workflow without interruption was found to be Sundays at 17:00 PT. Automating this step removes the need for the ROTA team to manually prepare and launch the workflow.
To enable this feature, copy the weekly_rewhiten.sh script, edit the commented out sections of the script and install the recommended cron job:
This task should run once per week. An example cron configuration is:
# Example: run every Sunday at 17:00
0 17 * * 0 <home directory path>/cron/weekly_rewhiten.sh 2>/dev/null 1>/dev/null