Online Operations Manual
Table of Contents
Working with ligo-scald
ligo-scald is a software package used for interacting with Kafka and Influx DB.
In GstLAL instead of reading or writing data to Kafka directly with the confluent_kafka
python package, we use ligo-scald as a convenience wrapper.
Similarly, instead of duplicating the code to write data directly to an Influx database with line protocol, we instead use functions in ligo-scald to do this for us.
configuration
We use configuration yaml files to define the schemas that will get written to an Influx database. Additionally, the configuration file includes the database name, hostname, and port along with some other default properties.
The schemas define the structure of our data.
Each schema includes a measurement name. The column field defines the name of the column where data values are stored in the measurement.
There are two general structures of measurement that we can use: time series and triggers.
Time series are measurements in which each entry consists of a single time and data field.
In ligo-scald syntax, we always use a column called “data” for time series data.
Trigger measurements are structured like dictionaries, where each entry has a time field in addition to an arbitrary number of additional data fields.
Tags are strings of metadata associated with a measurement.
Tags are used for grouping data in queries, and for each tag key there should be a finite number of possible tag values.
In ligo-scald, we specify an aggregate
field which defines a function to aggregate data by (for example min or max).
There is also a field in the schema called tag_key
.
This field specifies which tag key to group data by when aggregating.
For example,
range_history:
measurement: range_history
column: data
tag: ifo
tag_key: ifo
aggregate: max
ifo:
- H1
- L1
- V1
In the above schema, the numerical values are stored in the “data” column of the “range_history” measurement. These data are aggregated by maximum value in each IFO. In other words, the range history data in H1, L1, and V1 are isolated from each other and each are aggregated individually.
Working with Kafka
The kafka server at CIT is: kafkagstlal.ldas.cit:9196
Note: The kafka server at ICDS is: rtdb-01.gwave.ics.psu.edu:9196
A useful tool for connecting to kafka is kcat
. It’s not installed at CIT, but you can run it using the scimma/client container. To build and start the container:
singularity build -s scimma_client.sif docker://scimma/client:latest
singularity run scimma_client.sif /bin/bash
Here are some examples of what you can do with kcat:
List topics:
kcat -L -b kafkagstlal.ldas.cit:9196
This shows all topics which have ever been written to on the kafka server and will be way too much output to be useful. It is better to grep for the analysis tag that you are interested in to see just the topics that your analysis is writing to:
kcat -L -b kafkagstlal.ldas.cit:9196 | grep my_tag
Consume the first 5 messages on the gstlal.foo_online.L1_snr_history
topic:
kcat -C -b kafkagstlal.ldas.cit:9196 -t 'gstlal.foo_online.L1_snr_history' -c 5
Consume the last 5 messages on the gstlal.foo_online.L1_snr_history
topic:
kcat -C -b kafkagstlal.ldas.cit:9196 -t 'gstlal.foo_online.L1_snr_history' -o -5 -c 5
Consume the last message on the ‘gstlal.foo_online.L1_snr_history’ topic (on each partition) and then continue reading new messages:
kcat -C -b kafkagstlal.ldas.cit:9196 -t 'gstlal.foo_online.L1_snr_history' -o -1
If there is “science” data, you should get a constant stream of messages. If you don’t get a stream of messages, either there is really no data coming into the analysis or there is a problem with the analysis.
Consume all messages on the ‘gstlal.foo_online.L1_snr_history’ and exit when you hit the end (instead of waiting) (You might want to redirect this command to a file):
kcat -C -b kafkagstlal.ldas.cit:9196 -t 'gstlal.foo_online.L1_snr_history' -e
Consume the last message on the ‘gstlal.foo_online.L1_snr_history’ topic and print the key:
kcat -C -b kafkagstlal.ldas.cit:9196 -t 'gstlal.foo_online.L1_snr_history' -o -1 -K ::
Use keys and grep to retrieve data from a single job:
kcat -C -b kafkagstlal.ldas.cit:9196 -t 'gstlal.foo_online.L1_snr_history' -o -1 -K :: | grep 0000.noninj
Working with Influx DB
This section contains examples of commands that can be run at CIT or ICDS once you have sourced your influx_creds.sh
file.
Note: To query an influx database on ICDS use the url:
https://influxdb.ligo.caltech.edu:8086
on CIT use:
https://influxdb.gwave.ics.psu.edu:8086
The example commands in this section will use the CIT url.
accessible databases
To see what databases your credentials can access:
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode 'q=SHOW DATABASES'
Sample output:
{
"results": [
{
"statement_id": 0,
"series": [
{
"name": "databases",
"columns": [
"name"
],
"values": [
[
"foo_online"
]
]
}
]
}
]
}
if you get authorization failed
, then your username or your password is wrong (or at least doesn’t match what is in the influx configuration).
available measurements
You can see the measurements that are available in a database with a command like:
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode 'q=SHOW MEASUREMENTS ON my_database'
where my_database
is replaced with one of your InfluxDB databases.
The data in these measurements is what appears on the Grafana monitoring dashboards for each GstLAL analysis. Only measurements that actually contain data will appear in this list. However, data stays in an influx database essentially forever so the above command does not tell you whether a certain measurement is being written to currently.
retention policies
Normally, retention policies define the duration over which data in an Influx DB measurement is kept. In ligo-scald, retention policies are used a bit differently. Note that the syntax for retrieving data from Influx DB is:
{retention_policy}.{measurement}
The retention policies are used to define durations over which to aggregate data. For example, a retention policy of “1s” with an aggregate function of “max” would keep the single maximum data point over each one second window.
You can see the retention policies on a database named foo_online
with the command:
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode "q=show retention policies on foo_online"
This will return a list of retention policies with the following columns:
"columns": [
"name",
"duration",
"shardGroupDuration",
"replicaN",
"default"
],
query over a time range
Here is an example of how you can get data out of influxdb. Note that the timestamps in InfluxDB are nanoseconds since the epoch (not GPS time). To query the last 24 hours data for:
DB = foo_online
measurement = L1_snr_history
retention policy = 1s
you can use the command:
NOW=`date +%s`
NOW=`echo $NOW ' * 10^9' | bc`
THEN=`echo $NOW ' - 86400*10^9' | bc`
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode "q=SELECT data from foo_online.\"1s\".L1_snr_history WHERE aggregate='max' AND time >= $THEN AND time <= $NOW"
data transformation
You can also perform transformations on the data, for example count how many L1 snr history points have been stored in the last 24 hours:
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode "q=SELECT count(data) from foo_online.\"1s\".L1_snr_history WHERE aggregate='max' AND time >= $THEN AND time <= $NOW “
measurement tags
Tags are strings that hold metadata associated with a measurement.
In Influx syntax, tags are key value pairs.
For the GstLAL analysis, many of the measurements are stored with tags indicating the SVD bin number of the job (svdbin) which produced the data, the job type (job) which is either noninj or inj, and the upper frequency cutoff (fmax) for filtering used by the job that produced the data.
For example, if you wanted data only coming from the SVD bin 0123, you would specify the tag key “svdbin” and value “0123”.
You can view the tags that are available for a given measurement in the scald configuration file. For L1_snr_history
in the GstLAL online analysis:
L1_snr_history:
measurement: L1_snr_history
column: data
tags:
- svdbin
- job
- fmax
tag_key: job
aggregate: max
Here is an example of using tags to query data from a specific inspiral job and threshold on the value:
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode "q=SELECT data from foo_online.\"1s\".L1_snr_history WHERE aggregate='max' AND time >= $THEN AND time <= $NOW AND svdbin = '0000' AND data > 8.0"
If you just want to see which tag keys and values are available for a particular measurement, you can use the following commands:
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode 'q=SHOW TAG KEYS ON "gstlal_inspiral_edward_MDC11" FROM "L1_snr_history"'
curl -k -u $INFLUX_USERNAME:$INFLUX_PASSWORD 'https://influxdb.ligo.caltech.edu:8086/query?pretty=true' --data-urlencode 'q=SHOW TAG VALUES ON "gstlal_inspiral_edward_MDC11" FROM "L1_snr_history" WITH KEY = "svdbin"'
Checking likelihood history and ram history
Here are instructions on how to check the likelihood history and ram history of a given job.
In the working dir of the analyses, there are registry.txt files.
As an example, we will use a job with svd bin id 0000
.
This command:
cat 0000_inj_mdc_registry.txt
will print out the node, for example, http://node978:43873/
. Then run
wget http://node978:43873/likelihood_history.txt
then
head likelihood_history.txt
to get the likelihood history.
If one is looking for the ram history, change likelihood_history
to ram_history
in the commands above.
miscellaneous
To query an influx database with periods in the name, for example “foo.bar.online” you should enclose the database name with double quotes in the commands above. For example, 'q=SHOW MEASUREMENTS ON "foo.bar.online"'
or "q=SELECT data from \"foo.bar.online\".\"1s\".L1_injsnr_missed..."
Icinga Monitoring
Icinga is an open source monitoring and alert infrastructure. It is used across the LVK collaboration to monitor various critical services in low-latency, for example data distribution, GWCelery, and the search analyses among other things. The gstlal monitoring statuses are shown on the dashboard here. This page includes status outputs of:
- The GstLAL Grafana server
- The GstLAL low-latency analyses:
gstlal-<search>
: infers the status of gstlal jobs by querying data from InfluxDB.gstlal-<search>_http
: infers the status of gstlal jobs by querying data from bottle.
- The three GstLAL kafka servers on CIT
- The GstLAL influx server on CIT
Each Icinga service group is monitored by a script which outputs a JSON status object. You can find more information about the format of the JSON output in the IGWN monitoring docs. The JSON output is then periodically read by a shibboleth scraper. The configurations for these monitoring services are controlled in the dashboard2 git repo, eg see: gstlal.conf.
See query_influx for error message details.
GstLAL analysis influx test
The status of GstLAL inspiral jobs are tracked via the gstlal-<search>
service groups on the Icinga dashboard.
The executable is query_influx which is a Python script to query the latest RAM history data point from each gstlal-inspiral job without contacting the jobs directly.
If the script cannot find a RAM history data point from a job or if the latest output is older than 3 minutes, the job will be marked as “down”.
If 1% of jobs are down, the status is WARNING.
If more than 1% of jobs are down, the status is CRITICAL.
If all jobs are ok, the status is OK.
This script is run every 3 minutes by a cron job on the gstlalcbc.online
account at CIT and the gstlalcbc
account at ICDS.
The script does not need to be run on the same node as the analyses, but on the same cluster.
The scripts are ~/cron/inspiral_o4a_<analyis_name>.sh
, which provides analysis-specific information to the executable.
The output JSON files are written to ~/public_html/nagios
for each production low-latency analysis.
The JSON ouptut is then read by the shib scraper every 4 minutes and the status on the dashboard page is populated.
If the JSON output is more than 10 minutes old, the status will be marked as UNKNOWN.
GstLAL analyis HTTP test
The status of GstLAL inspiral jobs' http connection are tracked via the gstlal-<search>_http
service groups on the Icinga dashboard.
The GstLAL inspiral jobs each spin up their http server which lets the GstLAL inspiral jobs connect with http.
It is important that each GstLAL inspiral job connect to a http server in order to connect to bottle, such that users can communicate with the jobs directly. More importantly, the marginalize_likelihood
jobs contacts each of the GstLAL inspiral jos via http, so the GstLAL inspiral jobs must be connecting to http.
When the wget
command is failing (timing out), it means that the GstLAL inspiral jobs and http servers are not connecting properly.
This issue is usually caused by CIT hardware unable to manage all the information due to being overloaded.
The executable for ICDS is cgi-bin/checkAnalysisHttp
which is written in PEARL.
The script is in ~/public_html/cgi-bin/http_o4_<analysis_name>
which provides analysis-specific information for the executible.
The script checks for the number of jobs and checks if each job connects to the http servers.
The jobs that cannot connect to the http servers are marked and the script prints out such jobs in a json file.
When Icinga alerts show that the http test is failing, check the following:
- Confirm that the GstLAL inspiral jobs are not responding to http requests with a
wget
command such as:
cat 0000_noninj_registry.txt
wget <url>/ram_history.txt
- Check the job-specific dashboard to see if the jobs are still producing data. If so, the jobs are likely to be outputting data in a healthy manner, but http connection is failing. If the jobs are not producing data, they might have become zombie.
- Run
condor_ssh_to_job <job-id>
and runtop -u gstlalcbc
for Alice/Jacob ortop -u gstlalcbc.online
for Edward/Charlie. Check that the CPU usage of the GstLAL inspiral job is non-zero. (The state of the job can be “S.") If the CPU usage is 0, the job is not running properly. If the CPU usage is non-zero, the jobs could be running but the http connection is failing. - If the GstLAL inspiral jobs are running but the http connection is failing, find the job-id of the GstLAL inspiral job with
condor_q -dag <dag-id> --nobatch |grep "job-tag <bin number>_noninj"
and runcondor_rm <job-id>
so that the GstLAL inspiral job gets relaunched on a different node and start up the http connection again.