Defcon

Defcon is a tool for monitoring external services for specific failure scenarii. You can see it as a lightweight Nagios focused on watching over networked services. This kind of tool is sometimes calls uptime monitoring services.

A common example use case of what you can do with Defcon is periodically perform an HTTP request to an API endpoint and verify that the response status code is 200 OK and the content contains the words ready, sending a message on a Slack channel whenever the check fails three times in a row.

This documentation is still in the process of being written, it is far from complete and could not bring you all the information you need.

Concepts

Checks

The main concept in Defcon is that of a check. A check is a definition for an external service, including how to monitor it, how to detect issues, when to consider it as failing and what do do when it is.

Each check includes what is called a handler specification (or spec), that describes how this service is to be monitored. This specification must be one of the supported handlers, as described in this section.

Alerters

Each check can optionally trigger an alert when an outage is confirmed by its handler. All alerters in Defcon use webhooks to transmit information about the failing check to a HTTP server (outside Defcon's scope) that will handle the actual notification handling.

Site

A site is a distinct location (read, server), where checks can be run from. This allows for monitoring services from different locations concurrently to help avoid detection issues caused by hardware failing, network partitions or transient problems.

Each check can be set to run on one of more different sites, with a configurable number of failing sites needed for an outage to be confirmed.

Getting started

This guide will help you set up a simple Defcon instance, configure a first HTTP check and query the results through the bundled API.

Requirements

The following list described the infrastructure required to follow this guide:

  • A Linux box
  • A MySQL instance, with a user account and an empty database

Download a release

Binaries are listed under the Releases section of the GitHub repository. A new release will be created for each tag on the codebase, and a special tip release follow master and provides the latest snapshot of the code.

$ curl https://github.com/apognu/defcon/releases/download/tip/defcon-tip-x86_64 > defcon
$ chmod +x defcon

Running the controller

From here, you can start the controller:

$ RUST_LOG=defcon=debug \
  DSN=mysql://defcon:password@mysql.host/defcon?ssl-mode=DISABLED \
  ./defcon
INFO[2021-02-06T11:48:51.801+0000] starting api process port="8000"
INFO[2021-02-06T11:48:51.801+0000] starting handler process interval="1s"
INFO[2021-02-06T11:48:51.801+0000] no public key found, disabling runner endpoints

Creating your first check

In this guide, we will monitor two HTTP services, that will need to return a 200 OK status code to pass. Each check will run every 10 seconds and will require three failures to be considered failed. We will use the Defcon API to create the checks.

Our first check can be represented with the following JSON:

{
  "name": "Successful HTTP request",
  "interval": "10s",
  "sites": ["@controller"],
  "passing_threshold": 3,
  "failing_threshold": 3,
  "site_threshold": 1,
  "spec": {
    "kind": "http",
    "url": "http://jsonplaceholder.typicode.com/users",
    "code": 200
  }
}

This snippet defines the following:

  • The human-readable name for this check
  • The interval at which the check should be run (here, 10 seconds)
  • The sites on which the check will be run (@controller is the implicit name for Defcon's controller)
  • Each site will be considered as failed when the checks fails three times in a row, and recovered after three successes
  • An outage will be created when the number of failed sites reaches 1
  • This check uses the http handler, making a GET request to the provided URL, and expect a response status code of 200

You can create the check by performing a POST request to http://127.0.0.1:8000/api/checks:

$ curl -v -XPOST http://127.0.0.1:8000/api/checks -d@book.json
HTTP/1.1 201 Created
location: /api/checks/82a3b532-0883-4544-ba2c-0a7159a89d8e

You can check the configuration for this check by calling the API with the returned path:

$ curl http://127.0.0.1:8000/api/checks/82a3b532-0883-4544-ba2c-0a7159a89d8e
{
    "alerter": null,
    "enabled": true,
    "failing_threshold": 3,
    "interval": "10s",
    "name": "Successful HTTP request",
    "passing_threshold": 3,
    "silent": false,
    "site_threshold": 0,
    "sites": [
        "@controller"
    ],
    "spec": {
        "code": 200,
        "content": null,
        "digest": null,
        "headers": {},
        "kind": "http",
        "timeout": null,
        "url": "http://jsonplaceholder.typicode.com/users"
    },
    "uuid": "82a3b532-0883-4544-ba2c-0a7159a89d8e"
}

Check the check status

If you look at your console where Defcon is running, you should see that the handler for this check is running:

DEBG[2021-02-05T20:30:44.323+0000] check passed site="@controller" kind="http" check="82a3b532-0883-4544-ba2c-0a7159a89d8e" name="Successful HTTP request"
DEBG[2021-02-05T20:30:54.203+0000] check passed site="@controller" kind="http" check="f00ee7ad-b389-4819-bb24-e9797735e2df" name="Successful HTTP request"
DEBG[2021-02-05T20:30:54.207+0000] check passed site="@controller" kind="http" check="cf4a4917-92fb-4721-ab76-cae6a5fda2b8" name="Successful HTTP request"

Create a failing check

As an exercice to the reader, create another check from the above model, but this time, define it as expecting a 201 status code. This check should fail and create an outage when failing_threshold is reached.

DEBG[2021-02-05T20:56:42.185+0000] check failed site="@controller" kind="http" check="4766d0dc-5d39-4ec7-8aee-95b46f33dc55" name="Personal - Website & API" message="status code was 200"
DEBG[2021-02-05T20:56:53.164+0000] check failed site="@controller" kind="http" check="4766d0dc-5d39-4ec7-8aee-95b46f33dc55" name="Personal - Website & API" message="status code was 200"
DEBG[2021-02-05T20:57:04.192+0000] check failed site="@controller" kind="http" check="4766d0dc-5d39-4ec7-8aee-95b46f33dc55" name="Personal - Website & API" message="status code was 200"
INFO[2021-02-05T20:57:04.291+0000] site outage started site="@controller" kind="http" check="4766d0dc-5d39-4ec7-8aee-95b46f33dc55" failed="3/3" passed="0/3"
INFO[2021-02-05T20:57:04.358+0000] outage confirmed check="4766d0dc-5d39-4ec7-8aee-95b46f33dc55" outage="f3bdec24-1f7f-4fd6-bbe8-2e937cc67746" since="2021-02-05 20:57:04 UTC"

Here you can see that a site outage was created, and since our site_threshold was set to 1, an outage was confirmed for the check.

Configuration

The controller has a few configuration knobs you can tweak to adjust its overall behavior, they are described in this document. Most of them are optional and default to maybe-no-so sensible values. All configuration is applied through environment variables.

Handler configuration

Environment variableRequiredDefault valueDescription
RUST_LOGdefcon=info
DSNYesConnection string to the MySQL database
PUBLIC_KEYYesPath to an PEM-encoded ECDSA public key
API_ENABLE1Enable or disable the API process
API_LISTEN127.0.0.1:8000Set the listen address and port of the API process
WEB_ENABLE0Enable or disable the Web administration interface
WEB_STATUS_PAGE_ENABLE0Enable or disable the public status page, requires WEB_ENABLE
HANDLER_ENABLE1Enable or disable the handler process
HANDLER_INTERVAL1sInterval between handler loop iterations
HANDLER_SPREAD0sMaximum random delay applied when a check needs to run
CLEANER_ENABLE0Enable or disable the cleaner process
CLEANER_INTERVAL10mInterval between cleaner loop iterations
CLEANER_THRESHOLD1yPeriod of time after which to delete stale objects
ALERTER_DEFAULTAlerter to create checks with, if unspecified
ALERTER_FALLBACKAlerter to be called when none is set on a check

RUST_LOG

This allows for controlling the log level for each individual dependency. Defcon uses the defcon identifier, and default to info, which will mainly print errors, API access logs and when outages are created and resolved.

DSN

This should be a full connection string (with options) to the database Defcon is to use, and start with mysql://, as this is the only database supported by Defcon. This is an example of a valid DSN.

mysql://user:password@host:3306/defcon

PUBLIC_KEY

This option should contain the path to an existing PEM-encoded ECDSA public key. The following command can generate a compatible public key for usage with Defcon's controller:

$ openssl ecparam -genkey -name prime256v1 -noout | openssl pkcs8 -topk8 -nocrypt -out defcon-private.pem
$ openssl ec -in defcon-private.pem -pubout -out defcon-public.pem

API_ENABLE

A value of 0 or 1 respectively disables and enables the API process bundled within Defcon.

API_LISTEN

A string representing an IP address and port on which the API process will bind its process. By default, the API is only reachable by the local host on port 8000.

WEB_ENABLE

A value of 0 or 1 respectively disables and enabled the Web administration interface to manage and visualization Defcon's operations.

WEB_STATUS_PAGE_ENABLE

A value of 0 or 1 respectively disables and enabled the public (unauthenticated) Web status page. You will further have to enable checks to be presented on the page for them to show.

HANDLER_ENABLE

A value of 0 here disables check handling on the controller. If it is disabled, no check will run on this node. This is particularly useful if all your checks are configured to run on off-site runners and you would prefer to use the controller only as Defcon's control plane.

HANDLER_INTERVAL

If the handler process is enabled, this setting defines at which interval we should try to determine if checks need to be run. Accepts human-readable durations (such as 1s or 5m), defaults to 1s and cannot be smaller than one second.

HANDLER_SPREAD

Maximum amount of time for the controller to wait before executing a stale check. This can be useful to prevent all checks running at the exact same time. When a check needs executing, if HANDLER_SPREAD is set to 5s, Defcon will wait for a random duration between 0s and 5s before executing the related handler.

CLEANER_ENABLE

Use 1 here if you wish to enable the cleaner process. The cleaner process is used to delete old items (events, site outages and confirmed outages) from the database. Only resolved outages are elligible to be cleaned.

CLEANER_INTERVAL

This option defines the interval at which Defcon will check for database items to be deleted from the database. This is a maintenance operation and does not need to run as often as the handler process.

CLEANER_THRESHOLD

How old should an outage be to be elligible for deletion? Here, a value of 6 months will delete all resolved outages, site outages and events that are at least six months old.

Off-site runner

By default, all checks are run on the controller node, and require only one site outage for an outage to be confirmed. Defcon allows to offload check handling to other nodes by the use of off-site runners.

A runner is a stripped-down instance of Defcon that only knows how to run handlers and report their status back to the controller. Its workflow is the following:

  • Regularly check with the controller for stale checks that are configured to run on this particular runner
  • Call the handlers for each of those checks
  • Report their status back to the controller
  • Start over

An off-site runner authenticates itself to the controller by possessing the private key matching the controller's public key.

Whereas the controller is identified by the static @controller tag, each runner must be configured to have a unique tag, such as eu-1 or home-runner. Site identifiers should only contain lowercase alphanumeric characters and dashes (^[a-z0-9-]+$)

Download the binary

$ curl https://github.com/apognu/defcon/releases/download/tip/defcon-runner-tip-x86_64 > defcon
$ chmod +x defcon

Generate keys

You first need to generate an ECDSA key pair that will be used when you add your first off-site runner (not covered in this guide). Without this key pair, the API endpoint used by the runners will be disabled.

$ openssl ecparam -genkey -name prime256v1 -noout | openssl pkcs8 -topk8 -nocrypt -out defcon-private.pem
$ openssl ec -in defcon-private.pem -pubout -out defcon-public.pem

Start the controller with runner support

$ PUBLIC_KEY=./defcon-public.pem \
  RUST_LOG=defcon=debug \
  DSN=mysql://defcon:password@mysql.host/defcon?ssl-mode=DISABLED \
  ./defcon
INFO[2021-02-06T11:48:51.801+0000] starting api process port="8000"
INFO[2021-02-06T11:48:51.801+0000] starting handler process interval="1s"

Start a runner

$ PRIVATE_KEY=./defcon-private.pem \
  CONTROLLER_URL=http://127.0.0.1:8000 \
  SITE=eu-1 \
  ./defcon-runner
INFO[2021-02-06T14:18:36.973+0000] starting runner process site="eu-1" poll_interval="1s"

This runner will start running any stale check configured to run on site eu-1.

Checks

A check is used to describe an external service to be monitored. Among other things, it allows to specify some metadata about the check, how and where the check should be run, the actual handler configuration to use and conditions for confirming outages.

Metadata

AttributeTypeExample valueDescription
namestring"acme-public-site"A human-friendly name used in logs and alerters
alerterUUID"19b9eb20-3e3e-46d5-801f-a912e159913c"Alerter to be triggered when an outage is created
enabledbooltrueWhen disabled, a check will not run
on_status_pageboolfalseWhen enabled, the check will appear on the public static page, if enabled
silentboolfalseWhen silent, a check will not trigger its alerter
groupstring"9b77035c-218e-4d32-bcd7-4a015f7ee147"Put the check into a pre-existing group

Run and error condition

AttributeTypeExample valueDescription
sites[int]["us-1", "eu-1"]List of sites where this check should run
intervalstring"10s"Interval of time between subsequent runs
site_thresholdint2Number of sites that have to fail to confirm an outage
failing_thresholdint3Number of successive fails required to mark a site as failing
passing_thresholdint3Number of successive passes required to mark a site as recovered

Note: if a check is to run on the controller, as well as another site, the controller's identifier should be given explicitely, e.g. "sites": ["@controller", "eu-1"].

Handler specification

Each check needs one more attribute, spec, detailed in the next section, where the handler specification is configured.

Groups

A group is used to group checks together, and be able to filter which checks you list through the API.

Metadata

AttributeTypeExample valueDescription
namestringPersonalA human-friendly name for the group

Handlers

A handler is a process that knows how to determine the status of a particular kind of external service. A handler can, for example, perform an HTTP request, open a TCP connection, or check the presence of some domain-specific item on a remote server. The exact list of supported handler is described in the next sections.

A handler specification is a series of attributes that describes how to perform the check, and conditions on when to report an issue (for example, when the status code is not 200).

Valid attributs varies from handler to handler, but they all have one common attribute, kind, which specifies the kind of handler this is.

In the context of a check, a handler specification is laid out as:

{
  // Check attributes
  "spec": {
    "kind": "<handler type>",
    // Spec attributes
  }
}

Ping

This handler sends one ICMP echo request to the specified host and reports an error if it is unsuccessful.

Note that this may require some sort of elevated privilege to be able to run. For example, on Linux, it needs either to be run as root (not recommended), or to have the CAP_NET_RAW capability.

To set CAP_NET_RAW, you can execute the following command on Defcon's binary:

setcap cap_net_raw+ep defcon

Attributes

AttributeTypeExampleDescription
kindstring"ping"-
hoststring"8.8.8.8"Host to which to send the ICMP echo request

TCP connection

This handler will attempt to open a TCP connection on a provided host and port, and fail if the connection is unsuccessful.

Attributes

AttributeTypeExampleDescription
kindstring"tcp"-
hoststring"93.184.216.34"Domain name or IP address of the target host
portint80Port on which to open the TCP connection

UDP datagram

The UDP handler will attempt to send a datagram to a host and port, and expect to receive a specific response in return.

Attributes

AttributeTypeExampleDescription
kindstring"udp"-
hoststring"1.2.3.4"Target host where to send the datagram
portint10000Port to use as destination
messagestring"aGVsbG8="Base 64-encoded message to send on the socket
timeoutstring"5s"Timeout before giving up on the response
contentstring"Z29vZGJ5ZQ=="Base64-encoded value to expect in the response. This must be a submatch of the actual response

HTTP request

The HTTP handler will perform an HTTP GET request, with the specified parameter, and check a variety of elements from the response, namely, the status code, the string content and a digest of the response.

Attributes

AttributeTypeExampleDescription
kindstring"http"-
urlstring"https://example.com"Full URL to request
headersmap<string, string>{ "authorization": "me" }List of headers to add to the request
timeoutint2Abort the request after this number of seconds
codeint201Status code of the response
contentstring"ACME"Substring to find in the response body
digeststring"..."Hex-encoded SHA-512 sum of the response body
json_querystring".status == \"ok\""JQ-compatible JSON query returning a boolean

DNS

The DNS handler will retrieve DNS records of a specific type, for a specific domain, and check if one of the values matches the specification. Only the following DNS record types are supported:

  • NS
  • MX
  • A
  • AAAA
  • CNAME
  • CAA

Attributes

AttributeTypeExampleDescription
kindstring"dns"-
recordstring"A"Type of DNS record to verify
domainstring"example.com"Domain name for which the retrieve the records
valuestring"1.2.3.4"Value to compare to each retrieved record, must match exactly

Configuration

You can change the DNS resolver used to resolve DNS records by using the DNS_RESOLVER environment variable when starting the controller and the runners, like so: DNS_RESOLVER=1.2.3.4. By default, 1.1.1.1 is used.

Domain expiration

This handler will check Whois databases for the provided domain and will attempt to retrieve the domain's expiration date. The emitted event will be marked as failed if the expiration is within the configured window.

Not all TLDs expose their domains' expiration date in the Whois response, this handler will only work for those that do.

Attributes

AttributeTypeExampleDescription
kindstring"domain"-
domainstring"example.com"The base domain name to check for
windowstring"60d"Period of time within which the handler should fail
attributestring"expiry date"Whois attribute to use as the expiration date. Defaults to registry expiry date

TLS expiration

This handler can retrieve TLS certificates for a website and fail if its expiration date falls within a configurable window of time. This can help detect issues in your renewal processes and be used as a last resort reminder if you still do it manually.

For now, only TLS certificates served on port 443 are supported.

Attributes

AttributeTypeExampleDescription
kindstring"tls"-
domainstring"example.com"Domain to retrieve the certificate for
windowstring"15d"Period of time before the expiration date to trigger an alert

Application stores

Two handlers exist to verify the availability of Android and iOS application on, respectively, the Play Store and the App Store. These can be used to monitor for Google or Apple removing your apps, as well as human error or malice.

App Store

Attributes

AttributeTypeExampleDescription
kindstring"app_store"-
bundle_idstring"com.apple.Maps"Bundle ID for the iOS app to monitor

Play Store

Attributes

AttributeTypeExampleDescription
kindstring"play_store"-
app_idstring"com.google.android.apps.maps"Application ID for the Android app to monitor

Python

This handler executes an external Python script to perform the actual check.

This script must contain a check() funtion that returns the status and message of the check. The constants OK, WARNING and CRITICAL are provided in the current module.

The handler looks for a file named <script>.py, so the script name must be provided without the extension.

def check():
  return (CRITICAL, "something unexpected happened")

Attributes

AttributeTypeExampleDescription
scriptstring"mycustomscript"The extension-stripped name of the script to execute

Configuration

The path where the script are looked up in can be configured through the SCRIPTS_PATH environment variable, which defaults to /var/lib/defcon/scripts.

Dead Man Switch

The Dead Man Switch (DMS) handler will trigger an alert if an external service has not "checked in" in some time.

More precisely, a separate HTTP server is spawned on which external service can send GET requests to "check in". These services would usually check in after performing some task successfully (like a backup process, for instance) to let Defcon know the task finished successfully. If a check in is missed, this would indicate the task has failed, triggering an alert.

Attributes

AttributeTypeExampleDescription
kindstring"deadmanswitch"-
stale_afterstring"1h"The duration after which to create an outage if no check in happened

Configuration

The DMS_ENABLE can be used to disable the HTTP server used to receive checkins. Additionally, its listening address (127.0.0.1:8080 by default) can be configured through DMS_LISTEN.

To check in, a service needs to perform a GET request at http://${LISTEN_ADDRESS}/checkin/<check_id>.

Alerters

REST API