SQR-009: The SQuaSH metrics dashboard

  • Angelo Fausti

Latest Revision: 2018-02-28

Note

Work in progress.

Introduction

We present the current design for the Science Quality Analysis Harness (SQuaSH) metrics dashboard being developed by LSST/DM SQuaRE as part of the verification effort.

The verification of the LSST software stack performance on simulated and precursor data sets is an important activity during the project construction phase. It gives LSST/DM the opportunity to develop Quality Control (QC) infrastructure and Quality Analysis (QA) code to support the development of the stack and evaluate its results.

The DM Science Data Quality Assurance (SDQA) System Conceptual Design [LDM-522] describes the QC services and QA activities necessary to implement the capabilities listed in the LSST Data Quality Assurrance plan [LDM-63]. In particular, it describes the different tasks to be performed in each QC tier, starting from the QC Tier 0 which aims to test and verify the DM sub-system during software development, to the QC Tier 3 which will enable the science collaboration to evaluate the data quality for their own science analyses.

SQuaSH is a QC service, in its current implementation it is focused on QC Tier 0 tasks. For that we compute Key Performance Metrics (KPMs) on fixed data sets to monitor the stability of the LSST software stack through the Jenkins Continuous Integration (CI) system from daily builds. As part of the CI pipeline, the KPMs are computed by afterburner packages such as validate_drp [DMTN-008] and the results are sent to the SQuaSH metrics dashboard for monitoring.

In addition to that, we propose a workflow for developers to enable them to instrument their specific Science Pipeline Tasks, perform the verification measurements on development branches through the CI system or even on their local execution environment, and send the results to SQuaSH. We also discuss the SQuaSH integration with the LSST Science Platform (LSP) and how users of the notebook aspect of the LSP can benefit from pre-built SQuaSH apps embeded in the Jupyter notebooks. Finally, in the Appendix we provide more information about the system architecture and its deployment on the Google Kubernetes Engine (GKE).

The current production version of the SQuaSH metrics dashboard is available at https://squash.lsst.codes/ and the current development version at https://squash-demo.lsst.codes/.

We expect to get feedback from users and iterate with the science pipelines group to incorporate their metrics and visualizations in SQuaSH. The main goal is to anticipate the needs for commissioning and to leverage the production SDQA system based on this experience.

Design guidelines

Some of the design guidelines and features of SQuaSH are summarized below:

  • implement the concepts developed in the LSST Verification Framework
  • implement a QC database to support the tasks performed in the QC Tier 0, 1, 2 and 3 to preserve and visualize their results
  • support multiple test datasets and multiple verification packages
  • support automated verification runs, initially through the Jenkins CI environment and other execution environments
  • enable interactive visualization for the metric measurements
  • correlate deviations of the metric measurements with code changes
  • provide notifications/alerts upon metric measurement deviations
  • provide the “drill down” capability, i.e, in addition to the visualization of the scalar metrics enable further interactive visualization from the same data used to perform the metric measurements
  • extensible: make it easy for users to add new visualizations (a.k.a SQuaSH apps)
  • embeddable: reuse SQuaSH apps for analysis in the LSP notebook aspect
  • replaceable: easy to replace SQuaSH components keep up with standard technologies
  • deployable on Kubernetes

Other desired capabilities:

  • support verification runs from the developer “local” environment
  • support multi users to keep results from user “local” verification runs, use GitHub OAuth for the SQuaSH API.
  • record measurements of a given metric per Patch, Visit, CCD specially for QC-Tier 2 and 3 where we process larger data sets. If we aggregate metric measurements at the CCD level to the Visit level we can also implement some kind of drill down visualization.

SQuaSH in the context of the DM software development

The LSST Verification framework

The lsst.verify is the framework for making verification measurements in the LSST Science Pipelines. lsst.verify aims to generalize the process of defining metrics, measuring those metrics, and tracking those measurements in SQuaSH. The design is described in SQR-017, and SQR-019 demonstrates its use.

lsst.verify is currently being used by specialized afterburner packages such as validate_drp [DMTN-008] but it can also be used to track performance metrics of specific pipeline Tasks such as jointcal.

See also DMTN-057 for a discussion on how to instrument the Science Pipelines tasks with verification metrics based on the ap_verify experience.

The SQuaSH API implements the concepts developed in lsst.verify. A verification Job is the result of a verification run and contains at least one measurement of a particular metric. A Job also contains metadata, with information about the execution environment, and the actual data used to make the measurements. The data used to make the measurements is called data blob which includes tabular data and additional metadata. Blobs are used for the drill down visualization in SQuaSH.

A Job object produced by lsst.verify can be persisted as a JSON file and sent to SQuaSH using the dispatch_verify utility.

Next we discuss use cases for DM developers involving SQuaSH.

Use case 1: Afterburner packages in CI

SQuaSH in the context of CI

SQuaSH supports verification packages that run as afterburners in Jenkins CI. In the current implementation the Jenkins release/nightly-release pipeline builds the stack and triggers validate_drp which measures metrics from the ProcessCcdTask outputs.

By measuring metrics on fixed data sets in CI, we monitor the stability of the LSST software stack and can correlate deviations of the metric measurements with code changes.

The results are sent to the SQuaSH metrics dashboard.

Use case 2: Instrumenting science pipeline Tasks

Developers can implement verification metrics on their own pipeline Tasks.

The jointcal is an example of science pipeline Task that uses lsst.verify for defining its performance metrics and making verification measurements.

A Task should produce one verification Job containing the set of measurements performed by the Task. Usually it would be part of a larger pipeline involving several tasks and the verification measurements would be sent to SQuaSH after the execution of each Task.

In order to illustrate this use case, we run jointcal on HSC test data and use dispatch_verify to sent the results to SQuaSH.

$ setup obs_subaru
$ setup jointcal
$ setup testdata_jointcal
$ export ASTROMETRY_NET_DATA_DIR=${TESTDATA_JOINTCAL_DIR}/hsc_and_index
$ export SETUP_ASTROMETRY_NET_DATA="astrometry_net_data hsc_and_index"
$ jointcal.py ${TESTDATA_JOINTCAL_DIR}/hsc/ --output output --id visit=0903334^0903336^0903338^0903342^0903344^0903346 --clobber-versions --clobber-config  --configfile $JOINTCAL_DIR/tests/config/hsc-config.py --configfile photometry-config.py

Which will produce among other things the output/verify/job.json file with the verification measurements.

Here we set explicitly some variables that would be required in the Jenkins CI environment, or equivalently in other execution environments:

$ export BUILD_ID=1  # ID in the CI system
$ export PRODUCT=jointcal # Name of the package
$ export dataset=hsc # Name of the test dataset used
$ dispatch_verify.py --url https://squash-restful-api-demo.lsst.codes --user <squash user> --password <squash passwd> --env jenkins --lsstsw lsstsw/ output/verify/job.json

Use case 3: Supporting development branches

Given that specific Tasks can make verification measurements on test data sets (see e.g. the jointcal tests) and send results to SQuaSH, it might be interesting for the developer to do so before merging the development branch to master. That would enable developers to compare their performance metrics with previous results on master and to avoid regressions in the first place. The results would be sent to SQuaSH when the Jenkins stack-os-matrix job is triggered by the developer. We can implement on Jenkins a similar mechanism we have to run the stack demo pipeline something like Send verification measurements to SQuaSH. SQuaSH can keep track of the branch being tested and the dashboard should make it easy to identify results from development branches and compare them with results for the same verification metrics from master.

Note

If the execution time of the stack-os-matrix job becomes impeditive due to new tests executing the verification measurements, we could think about doing the verification measurements on specialized verfication packages that execute the science pipeline Tasks of interest.

Use case 4: Supporting the developer local environment

Another use case is to support the developer local execution environment. An implication for that is to support multiple users in SQuaSH. Then the user_id can be associated with the corresponding job_id and the results can be displayed by user in the SQuaSH dashboard.

Note

Multi user support is not fully implemented yet. While user registration and token authentication is already present in the SQuaSH API we need a third-party OAuth provider for authentication purposes.

In this scenario, dispatch_verify would accept the user local environment with the appropriate metadata such as an incremental user run ID, the data set used and the directory path for the user local lsstsw installation in order to register the package metadata associated to its run.

Supporting multiple execution environments

In order to be useful for the verification activities SQuaSH must support multiple execution enviroments like the Jenkins CI, the user “local” environment, the verification cluster environment and potentially others.

In the verification framework, a Job packages several measurements, metadata and data blobs. The metadata contains information about the execution environment.

Examples of verification Job metadata for different execution environments:

  • Jenkins CI
    • Look up key: ID of the CI run
    • Environment metadata: ci_name, ci_dataset, ci_label, ci_url, lsstsw and extra packages
  • User local environment (imply support to multiple users)
    • Look up key: ID of the user run
    • Environment metadata: lsstsw and extra packages
  • Verification Cluster
    • Look up key: ID of the verification cluster run.
    • Environment metadata: lsst stack build (assuming we are using stable versions of the stack only)

Note

For QC Tier 1 the metadata may include things like the stack configuration used in each run.

The SQuaSH RESTful API provides a generic resource to interact with jobs and specific resources to interact with runs on different execution environments that ultimately map to a job. In this way a request to /jenkins/<ci_id> or /local/<username>/<run_id> will look up for the corresponding /jobs/<job_id> to retrieve the associated measurements, metadata and data blobs.

SQuaSH in the context of the LSST Science Platform

SQuaSH in the context of the LSP

Using the LSP environment for QC-Tier 3 analysis. Using SQuaSH to submit verification runs in the verification cluster. Embedding SQuaSH apps in the JupyterLab environment.

Appendix

Deployment

SQuaSH is currently deployed to a commodity cloud, the Google Cloud Platform, on the Google Kubernetes Engine (GKE), and is architected as independent microservices. The figure below shows the various “layers” of the Kubernetes deployment, the service which provides an external IP to the microservice, the pod which groups containers running on the same GKE node. Other Kubernetes objects like secrets and customized configurations stored as configmaps are also indicated in the figure. The microservices squash-restful-api, squash-bokeh and squash-dash are connected through HTTPS and TLS termination is implemented trough the nginx container which works as a reverse proxy to secure the traffic outside the pods.

SQuaSH Kubernetes deployment

The general instructions to deploy squash can be found at squash-deployment with links to the individual microservices:

  • squash-restful-api: it is used to manage the SQuaSH metrics dashboard. The SQuaSH RESTful API was developed initially using Django DRF and then reimplemented in Flask with several extensions. It also uses Celery to enable the execution of tasks in background. This can be extended later to
  • squash-bokeh: it serves the squash bokeh apps, we use the Bokeh plotting library for rich interactive visualizations.
  • squash-dash: dashboard to embed the bokeh apps. Alternatively we are exploring the possibility to embed the same apps in the Jupyter Lab environment of the LSST Science Platform.

The QC Tier 0 database

For the QC Tier 0 DB, we opted for a relational database. The motivation behind this choice is that we plan to deploy QC DBs to the Oracle consolidated database as part of the LSP. SQuaSH currently uses an instance of MySQL 5.7 deployed to Cloud SQL. We chose MySQL over MariaDB, used in Qserv, because of the support to JSON data types which are used in this implementation to make the database schema more generic. We store Job metadata, environment metadata as well as metric definitions and specifications as JSON blobs.

Current SQuaSH database schema for QC Tier 0 tasks. This implementation supports multiple verification packages and multiple execution environments.

  • Entities:
    • env, user, job, package, blob, measurement, metric, spec
  • Relationships:
    • 1 env : N jobs
    • 1 job : N packages
    • 1 job : N measurements
    • M measurements : N data blobs
    • 1 metric : N specs
    • 1 metric : N measurements
QC Tier 0 Database

Back ups of the SQuaSH QC Tier 0 DB are automated in Cloud SQL.

The SQuaSH RESTful API

The SQuaSH RESTful API is a web app implemented in Flask for managing the SQuaSH metrics dashboard.

Current version

By default, all requests to https://squash-restful-api-demo.lsst.codes/ receive version 1.0 (default) of the RESTful API. The default version of the API may change in the future, thus we encourage you to explicitly request versions via the Accept header.

You can specify a version like this:

Accept: application/json; version=1.0

Schema

All API access is over HTTPS, accessed from the https://squash-restful-api-demo.lsst.codes/. All data is sent and received as JSON.

Authentication

Operations like POST and DELETE (see below) require authentication. To authenticate through the SQuaSH RESTful API you need to provide a valid access token in the authorization header, which can be obtained from the /auth endpoint for a registered user:

import requests

# assuming a registered user
user = {'username': user, 'password': passwd}
r = requests.post("https://squash-restful-api-demo.lsst.codes/auth", json=user)
access_token = 'JWT ' + r.json()['access_token']

# assuming you a have a job document you want to post to SQuaSH
headers = {'Authorization': access_token}
r = requests.post("https://squash-restful-api-demo.lsst.codes/job", json=job, headers=headers)

Documentation

The SQuaSH RESTful API follows the OpenAPI 2.0 documentation specification. The specification is extracted from the docstrings by the flasgger utility which is also used to create the Swagger UI for the API.

Note

The Swagger UI is experimental, authentication does not work through this interface yet.

This notebook provides an example on how to interact with the SQuaSH RESTful API from registering a new user in SQuaSH to loading a verification job.

All the available resources and possible operations are listed below:

GET /datasets

Retrieve the list of datasets used in SQuaSH.

Status Codes:
  • 200 OK – Dataset list successfully retrieved.
GET /default

Retrieve default values used for display configuration in the SQuaSH apps.

Status Codes:
  • 200 OK – Default values successfully retrieved.
GET /jenkins/{ci_id}

Retrieve a verification job from the jenkins environment.

Parameters:
  • ci_id (integer) – ID of the jenkins job.
Status Codes:
POST /job

Create a verification job.

This is the entry point for dispatch_verify.py. See http://sqr-019.lsst.io for the body content.<br/>

Status Codes:
DELETE /job/{job_id}

Delete a verification job and associated measurements, data blobs and metadata.

Parameters:
  • job_id (integer) – ID of the job.
Status Codes:
GET /job/{job_id}

Retrieve a verification job.

Parameters:
  • job_id (integer) – ID of the job.
Status Codes:
GET /measurement/{job_id}

Retrieve all measurements performed by a verification job.

Parameters:
  • job_id (integer) – ID of the job.
Status Codes:
POST /measurement/{job_id}

Create a new measurement for an existing job.

Parameters:
  • job_id (integer) – ID of the job.
Status Codes:
GET /measurements

Retrieve the complete list of measurements.

Status Codes:
  • 200 OK – List of Measurements successfully retrieved.
DELETE /metric/{name}

Delete a metric.

Parameters:
  • name (string) – Full qualified name of the metric, e.g. validate_drp.AM1
Status Codes:
GET /metric/{name}

Retrieve a metric from its name.

Parameters:
  • name (string) – Full qualified name of the metric, e.g. validate_drp.AM1
Status Codes:
POST /metric/{name}

Create a metric.

Parameters:
  • name (string) – Full qualified name of the metric, e.g. validate_drp.AM1
Status Codes:
GET /metrics

Retrieve the complete list of metrics.

Status Codes:
  • 200 OK – List of metrics successfully retrieved.
POST /metrics

Create a list of metrics.

Status Codes:
GET /monitor

Retrieve the data structure used to feed the SQuaSH Monitor app.

Status Codes:
  • 200 OK – Monitor data successfully retrieved.
GET /packages

Retrieve the list of verification packages used in SQuaSH.

Status Codes:
  • 200 OK – Package list successfully retrieved.
POST /register

Create a SQuaSH user.

Status Codes:
DELETE /spec/{name}

Delete a metric specification.

Parameters:
  • name (string) – Full qualified name of the metric specification, e.g. validate_drp.AM1.mininum_gri
Status Codes:
  • 200 OK – Metric specification deleted.
  • 401 Unauthorized – Authorization Required. Request does not contain a valid access token.
  • 404 Not Found – Metric specification not found.
GET /spec/{name}

Retrieve a metric specification.

Parameters:
  • name (string) – Full qualified name of the metric specification, e.g. validate_drp.AM1.minimum_gri
Status Codes:
POST /spec/{name}

Create a metric specification.

Parameters:
  • name (string) – Full qualified name of the metric specification, e.g. validate_drp.AM1.mininum_gri
Status Codes:
GET /specs

Retrieve the complete list of metric specifications.

Status Codes:
  • 200 OK – List of metric specifications successfully retrieved.
POST /specs

Create a list of metric specifications.

Status Codes:
GET /status/{task_id}

Retrieve status of an upload task.

Parameters:
  • task_id (string) – Upload task ID as returned by /job
Status Codes:
  • 200 OK – Upload task status successfully retrieved. PENDING: the task did not start yet. STARTED: the task has started. SUCCESS: the task has completed. FAILURE: something went wrong.
DELETE /user/{username}

Delete a SQuaSH user.

Parameters:
  • username (string) – name of the user
Status Codes:
GET /user/{username}

Retrieve a user from SQuaSH.

Parameters:
  • username (string) – name of the user
Status Codes:
GET /users

Retrieve the complete list of SQuaSH users.

Status Codes:
  • 200 OK – List of users successfully retrieved
GET /version

Retrieve the default version of the API.

Status Codes:
  • 200 OK – Version successfully retrieved