Defcon
Defcon is a tool for monitoring external services for specific failure scenarii. You can see it as a lightweight Nagios focused on watching over networked services. This kind of tool is sometimes calls uptime monitoring services.
A common example use case of what you can do with Defcon is periodically perform an HTTP request to an API endpoint and verify that the response status code is 200 OK
and the content contains the words ready
, sending a message on a Slack channel whenever the check fails three times in a row.
This documentation is still in the process of being written, it is far from complete and could not bring you all the information you need.
Concepts
Checks
The main concept in Defcon is that of a check. A check is a definition for an external service, including how to monitor it, how to detect issues, when to consider it as failing and what do do when it is.
Each check includes what is called a handler specification (or spec
), that describes how this service is to be monitored. This specification must be one of the supported handlers, as described in this section.
Alerters
Each check can optionally trigger an alert when an outage is confirmed by its handler. All alerters in Defcon use webhooks
to transmit information about the failing check to a HTTP server (outside Defcon's scope) that will handle the actual notification handling.
Site
A site is a distinct location (read, server), where checks can be run from. This allows for monitoring services from different locations concurrently to help avoid detection issues caused by hardware failing, network partitions or transient problems.
Each check can be set to run on one of more different sites, with a configurable number of failing sites needed for an outage to be confirmed.
Getting started
This guide will help you set up a simple Defcon instance, configure a first HTTP check and query the results through the bundled API.
Requirements
The following list described the infrastructure required to follow this guide:
- A Linux box
- A MySQL instance, with a user account and an empty database
Download a release
Binaries are listed under the Releases section of the GitHub repository. A new release will be created for each tag on the codebase, and a special tip
release follow master
and provides the latest snapshot of the code.
$ curl https://github.com/apognu/defcon/releases/download/tip/defcon-tip-x86_64 > defcon
$ chmod +x defcon
Running the controller
From here, you can start the controller:
$ RUST_LOG=defcon=debug \
DSN=mysql://defcon:password@mysql.host/defcon?ssl-mode=DISABLED \
./defcon
INFO[2021-02-06T11:48:51.801+0000] starting api process port="8000"
INFO[2021-02-06T11:48:51.801+0000] starting handler process interval="1s"
INFO[2021-02-06T11:48:51.801+0000] no public key found, disabling runner endpoints
Creating your first check
In this guide, we will monitor two HTTP services, that will need to return a 200 OK
status code to pass. Each check will run every 10 seconds and will require three failures to be considered failed. We will use the Defcon API to create the checks.
Our first check can be represented with the following JSON:
{
"name": "Successful HTTP request",
"interval": "10s",
"sites": ["@controller"],
"passing_threshold": 3,
"failing_threshold": 3,
"site_threshold": 1,
"spec": {
"kind": "http",
"url": "http://jsonplaceholder.typicode.com/users",
"code": 200
}
}
This snippet defines the following:
- The human-readable name for this check
- The interval at which the check should be run (here, 10 seconds)
- The sites on which the check will be run (
@controller
is the implicit name for Defcon's controller) - Each site will be considered as failed when the checks fails three times in a row, and recovered after three successes
- An outage will be created when the number of failed sites reaches 1
- This check uses the
http
handler, making aGET
request to the provided URL, and expect a response status code of200
You can create the check by performing a POST
request to http://127.0.0.1:8000/api/checks
:
$ curl -v -XPOST http://127.0.0.1:8000/api/checks -d@book.json
HTTP/1.1 201 Created
location: /api/checks/82a3b532-0883-4544-ba2c-0a7159a89d8e
You can check the configuration for this check by calling the API with the returned path:
$ curl http://127.0.0.1:8000/api/checks/82a3b532-0883-4544-ba2c-0a7159a89d8e
{
"alerter": null,
"enabled": true,
"failing_threshold": 3,
"interval": "10s",
"name": "Successful HTTP request",
"passing_threshold": 3,
"silent": false,
"site_threshold": 0,
"sites": [
"@controller"
],
"spec": {
"code": 200,
"content": null,
"digest": null,
"headers": {},
"kind": "http",
"timeout": null,
"url": "http://jsonplaceholder.typicode.com/users"
},
"uuid": "82a3b532-0883-4544-ba2c-0a7159a89d8e"
}
Check the check status
If you look at your console where Defcon is running, you should see that the handler for this check is running:
DEBG[2021-02-05T20:30:44.323+0000] check passed site="@controller" kind="http" check="82a3b532-0883-4544-ba2c-0a7159a89d8e" name="Successful HTTP request"
DEBG[2021-02-05T20:30:54.203+0000] check passed site="@controller" kind="http" check="f00ee7ad-b389-4819-bb24-e9797735e2df" name="Successful HTTP request"
DEBG[2021-02-05T20:30:54.207+0000] check passed site="@controller" kind="http" check="cf4a4917-92fb-4721-ab76-cae6a5fda2b8" name="Successful HTTP request"
Create a failing check
As an exercice to the reader, create another check from the above model, but this time, define it as expecting a 201
status code. This check should fail and create an outage when failing_threshold
is reached.
DEBG[2021-02-05T20:56:42.185+0000] check failed site="@controller" kind="http" check="4766d0dc-5d39-4ec7-8aee-95b46f33dc55" name="Personal - Website & API" message="status code was 200"
DEBG[2021-02-05T20:56:53.164+0000] check failed site="@controller" kind="http" check="4766d0dc-5d39-4ec7-8aee-95b46f33dc55" name="Personal - Website & API" message="status code was 200"
DEBG[2021-02-05T20:57:04.192+0000] check failed site="@controller" kind="http" check="4766d0dc-5d39-4ec7-8aee-95b46f33dc55" name="Personal - Website & API" message="status code was 200"
INFO[2021-02-05T20:57:04.291+0000] site outage started site="@controller" kind="http" check="4766d0dc-5d39-4ec7-8aee-95b46f33dc55" failed="3/3" passed="0/3"
INFO[2021-02-05T20:57:04.358+0000] outage confirmed check="4766d0dc-5d39-4ec7-8aee-95b46f33dc55" outage="f3bdec24-1f7f-4fd6-bbe8-2e937cc67746" since="2021-02-05 20:57:04 UTC"
Here you can see that a site outage was created, and since our site_threshold
was set to 1
, an outage was confirmed for the check.
Configuration
The controller has a few configuration knobs you can tweak to adjust its overall behavior, they are described in this document. Most of them are optional and default to maybe-no-so sensible values. All configuration is applied through environment variables.
Handler configuration
Environment variable | Required | Default value | Description |
---|---|---|---|
RUST_LOG | defcon=info | ||
DSN | Yes | Connection string to the MySQL database | |
PUBLIC_KEY | Yes | Path to an PEM-encoded ECDSA public key | |
API_ENABLE | 1 | Enable or disable the API process | |
API_LISTEN | 127.0.0.1:8000 | Set the listen address and port of the API process | |
WEB_ENABLE | 0 | Enable or disable the Web administration interface | |
WEB_STATUS_PAGE_ENABLE | 0 | Enable or disable the public status page, requires WEB_ENABLE | |
HANDLER_ENABLE | 1 | Enable or disable the handler process | |
HANDLER_INTERVAL | 1s | Interval between handler loop iterations | |
HANDLER_SPREAD | 0s | Maximum random delay applied when a check needs to run | |
CLEANER_ENABLE | 0 | Enable or disable the cleaner process | |
CLEANER_INTERVAL | 10m | Interval between cleaner loop iterations | |
CLEANER_THRESHOLD | 1y | Period of time after which to delete stale objects | |
ALERTER_DEFAULT | Alerter to create checks with, if unspecified | ||
ALERTER_FALLBACK | Alerter to be called when none is set on a check |
RUST_LOG
This allows for controlling the log level for each individual dependency. Defcon uses the defcon
identifier, and default to info
, which will mainly print errors, API access logs and when outages are created and resolved.
DSN
This should be a full connection string (with options) to the database Defcon is to use, and start with mysql://
, as this is the only database supported by Defcon. This is an example of a valid DSN
.
mysql://user:password@host:3306/defcon
PUBLIC_KEY
This option should contain the path to an existing PEM-encoded ECDSA public key. The following command can generate a compatible public key for usage with Defcon's controller:
$ openssl ecparam -genkey -name prime256v1 -noout | openssl pkcs8 -topk8 -nocrypt -out defcon-private.pem
$ openssl ec -in defcon-private.pem -pubout -out defcon-public.pem
API_ENABLE
A value of 0
or 1
respectively disables and enables the API process bundled within Defcon.
API_LISTEN
A string representing an IP address and port on which the API process will bind its process. By default, the API is only reachable by the local host on port 8000.
WEB_ENABLE
A value of 0
or 1
respectively disables and enabled the Web administration interface to manage and visualization Defcon's operations.
WEB_STATUS_PAGE_ENABLE
A value of 0
or 1
respectively disables and enabled the public (unauthenticated) Web status page. You will further have to enable checks to be presented on the page for them to show.
HANDLER_ENABLE
A value of 0
here disables check handling on the controller. If it is disabled, no check will run on this node. This is particularly useful if all your checks are configured to run on off-site runners and you would prefer to use the controller only as Defcon's control plane.
HANDLER_INTERVAL
If the handler process is enabled, this setting defines at which interval we should try to determine if checks need to be run. Accepts human-readable durations (such as 1s
or 5m
), defaults to 1s
and cannot be smaller than one second.
HANDLER_SPREAD
Maximum amount of time for the controller to wait before executing a stale check. This can be useful to prevent all checks running at the exact same time. When a check needs executing, if HANDLER_SPREAD
is set to 5s
, Defcon will wait for a random duration between 0s
and 5s
before executing the related handler.
CLEANER_ENABLE
Use 1
here if you wish to enable the cleaner process. The cleaner process is used to delete old items (events, site outages and confirmed outages) from the database. Only resolved outages are elligible to be cleaned.
CLEANER_INTERVAL
This option defines the interval at which Defcon will check for database items to be deleted from the database. This is a maintenance operation and does not need to run as often as the handler process.
CLEANER_THRESHOLD
How old should an outage be to be elligible for deletion? Here, a value of 6 months
will delete all resolved outages, site outages and events that are at least six months old.
Off-site runner
By default, all checks are run on the controller node, and require only one site outage for an outage to be confirmed. Defcon allows to offload check handling to other nodes by the use of off-site runners.
A runner is a stripped-down instance of Defcon that only knows how to run handlers and report their status back to the controller. Its workflow is the following:
- Regularly check with the controller for stale checks that are configured to run on this particular runner
- Call the handlers for each of those checks
- Report their status back to the controller
- Start over
An off-site runner authenticates itself to the controller by possessing the private key matching the controller's public key.
Whereas the controller is identified by the static @controller
tag, each runner must be configured to have a unique tag, such as eu-1
or home-runner
. Site identifiers should only contain lowercase alphanumeric characters and dashes (^[a-z0-9-]+$
)
Download the binary
$ curl https://github.com/apognu/defcon/releases/download/tip/defcon-runner-tip-x86_64 > defcon
$ chmod +x defcon
Generate keys
You first need to generate an ECDSA key pair that will be used when you add your first off-site runner (not covered in this guide). Without this key pair, the API endpoint used by the runners will be disabled.
$ openssl ecparam -genkey -name prime256v1 -noout | openssl pkcs8 -topk8 -nocrypt -out defcon-private.pem
$ openssl ec -in defcon-private.pem -pubout -out defcon-public.pem
Start the controller with runner support
$ PUBLIC_KEY=./defcon-public.pem \
RUST_LOG=defcon=debug \
DSN=mysql://defcon:password@mysql.host/defcon?ssl-mode=DISABLED \
./defcon
INFO[2021-02-06T11:48:51.801+0000] starting api process port="8000"
INFO[2021-02-06T11:48:51.801+0000] starting handler process interval="1s"
Start a runner
$ PRIVATE_KEY=./defcon-private.pem \
CONTROLLER_URL=http://127.0.0.1:8000 \
SITE=eu-1 \
./defcon-runner
INFO[2021-02-06T14:18:36.973+0000] starting runner process site="eu-1" poll_interval="1s"
This runner will start running any stale check configured to run on site eu-1
.
Checks
A check is used to describe an external service to be monitored. Among other things, it allows to specify some metadata about the check, how and where the check should be run, the actual handler configuration to use and conditions for confirming outages.
Metadata
Attribute | Type | Example value | Description |
---|---|---|---|
name | string | "acme-public-site" | A human-friendly name used in logs and alerters |
alerter | UUID | "19b9eb20-3e3e-46d5-801f-a912e159913c" | Alerter to be triggered when an outage is created |
enabled | bool | true | When disabled, a check will not run |
on_status_page | bool | false | When enabled, the check will appear on the public static page, if enabled |
silent | bool | false | When silent, a check will not trigger its alerter |
group | string | "9b77035c-218e-4d32-bcd7-4a015f7ee147" | Put the check into a pre-existing group |
Run and error condition
Attribute | Type | Example value | Description |
---|---|---|---|
sites | [int] | ["us-1", "eu-1"] | List of sites where this check should run |
interval | string | "10s" | Interval of time between subsequent runs |
site_threshold | int | 2 | Number of sites that have to fail to confirm an outage |
failing_threshold | int | 3 | Number of successive fails required to mark a site as failing |
passing_threshold | int | 3 | Number of successive passes required to mark a site as recovered |
Note: if a check is to run on the controller, as well as another site, the controller's identifier should be given explicitely, e.g.
"sites": ["@controller", "eu-1"]
.
Handler specification
Each check needs one more attribute, spec
, detailed in the next section, where the handler specification is configured.
Groups
A group is used to group checks together, and be able to filter which checks you list through the API.
Metadata
Attribute | Type | Example value | Description |
---|---|---|---|
name | string | Personal | A human-friendly name for the group |
Handlers
A handler is a process that knows how to determine the status of a particular kind of external service. A handler can, for example, perform an HTTP request, open a TCP connection, or check the presence of some domain-specific item on a remote server. The exact list of supported handler is described in the next sections.
A handler specification is a series of attributes that describes how to perform the check, and conditions on when to report an issue (for example, when the status code is not 200
).
Valid attributs varies from handler to handler, but they all have one common attribute, kind
, which specifies the kind of handler this is.
In the context of a check, a handler specification is laid out as:
{
// Check attributes
"spec": {
"kind": "<handler type>",
// Spec attributes
}
}
Ping
This handler sends one ICMP echo request to the specified host and reports an error if it is unsuccessful.
Note that this may require some sort of elevated privilege to be able to run. For example, on Linux, it needs either to be run as
root
(not recommended), or to have theCAP_NET_RAW
capability.To set
CAP_NET_RAW
, you can execute the following command on Defcon's binary:setcap cap_net_raw+ep defcon
Attributes
Attribute | Type | Example | Description |
---|---|---|---|
kind | string | "ping" | - |
host | string | "8.8.8.8" | Host to which to send the ICMP echo request |
TCP connection
This handler will attempt to open a TCP connection on a provided host and port, and fail if the connection is unsuccessful.
Attributes
Attribute | Type | Example | Description |
---|---|---|---|
kind | string | "tcp" | - |
host | string | "93.184.216.34" | Domain name or IP address of the target host |
port | int | 80 | Port on which to open the TCP connection |
UDP datagram
The UDP handler will attempt to send a datagram to a host and port, and expect to receive a specific response in return.
Attributes
Attribute | Type | Example | Description |
---|---|---|---|
kind | string | "udp" | - |
host | string | "1.2.3.4" | Target host where to send the datagram |
port | int | 10000 | Port to use as destination |
message | string | "aGVsbG8=" | Base 64-encoded message to send on the socket |
timeout | string | "5s" | Timeout before giving up on the response |
content | string | "Z29vZGJ5ZQ==" | Base64-encoded value to expect in the response. This must be a submatch of the actual response |
HTTP request
The HTTP handler will perform an HTTP GET request, with the specified parameter, and check a variety of elements from the response, namely, the status code, the string content and a digest of the response.
Attributes
Attribute | Type | Example | Description |
---|---|---|---|
kind | string | "http" | - |
url | string | "https://example.com" | Full URL to request |
headers | map<string, string> | { "authorization": "me" } | List of headers to add to the request |
timeout | int | 2 | Abort the request after this number of seconds |
code | int | 201 | Status code of the response |
content | string | "ACME" | Substring to find in the response body |
digest | string | "..." | Hex-encoded SHA-512 sum of the response body |
json_query | string | ".status == \"ok\"" | JQ-compatible JSON query returning a boolean |
DNS
The DNS handler will retrieve DNS records of a specific type, for a specific domain, and check if one of the values matches the specification. Only the following DNS record types are supported:
- NS
- MX
- A
- AAAA
- CNAME
- CAA
Attributes
Attribute | Type | Example | Description |
---|---|---|---|
kind | string | "dns" | - |
record | string | "A" | Type of DNS record to verify |
domain | string | "example.com" | Domain name for which the retrieve the records |
value | string | "1.2.3.4" | Value to compare to each retrieved record, must match exactly |
Configuration
You can change the DNS resolver used to resolve DNS records by using the DNS_RESOLVER
environment variable when starting the controller and the runners, like so: DNS_RESOLVER=1.2.3.4
. By default, 1.1.1.1
is used.
Domain expiration
This handler will check Whois databases for the provided domain and will attempt to retrieve the domain's expiration date. The emitted event will be marked as failed if the expiration is within the configured window.
Not all TLDs expose their domains' expiration date in the Whois response, this handler will only work for those that do.
Attributes
Attribute | Type | Example | Description |
---|---|---|---|
kind | string | "domain" | - |
domain | string | "example.com" | The base domain name to check for |
window | string | "60d" | Period of time within which the handler should fail |
attribute | string | "expiry date" | Whois attribute to use as the expiration date. Defaults to registry expiry date |
TLS expiration
This handler can retrieve TLS certificates for a website and fail if its expiration date falls within a configurable window of time. This can help detect issues in your renewal processes and be used as a last resort reminder if you still do it manually.
For now, only TLS certificates served on port 443 are supported.
Attributes
Attribute | Type | Example | Description |
---|---|---|---|
kind | string | "tls" | - |
domain | string | "example.com" | Domain to retrieve the certificate for |
window | string | "15d" | Period of time before the expiration date to trigger an alert |
Application stores
Two handlers exist to verify the availability of Android and iOS application on, respectively, the Play Store and the App Store. These can be used to monitor for Google or Apple removing your apps, as well as human error or malice.
App Store
Attributes
Attribute | Type | Example | Description |
---|---|---|---|
kind | string | "app_store" | - |
bundle_id | string | "com.apple.Maps" | Bundle ID for the iOS app to monitor |
Play Store
Attributes
Attribute | Type | Example | Description |
---|---|---|---|
kind | string | "play_store" | - |
app_id | string | "com.google.android.apps.maps" | Application ID for the Android app to monitor |
Python
This handler executes an external Python script to perform the actual check.
This script must contain a check()
funtion that returns the status and message of the check. The constants OK
, WARNING
and CRITICAL
are provided in the current module.
The handler looks for a file named <script>.py
, so the script name must be provided without the extension.
def check():
return (CRITICAL, "something unexpected happened")
Attributes
Attribute | Type | Example | Description |
---|---|---|---|
script | string | "mycustomscript" | The extension-stripped name of the script to execute |
Configuration
The path where the script are looked up in can be configured through the SCRIPTS_PATH
environment variable, which defaults to /var/lib/defcon/scripts
.
Dead Man Switch
The Dead Man Switch (DMS) handler will trigger an alert if an external service has not "checked in" in some time.
More precisely, a separate HTTP server is spawned on which external service can send GET requests to "check in". These services would usually check in after performing some task successfully (like a backup process, for instance) to let Defcon know the task finished successfully. If a check in is missed, this would indicate the task has failed, triggering an alert.
Attributes
Attribute | Type | Example | Description |
---|---|---|---|
kind | string | "deadmanswitch" | - |
stale_after | string | "1h" | The duration after which to create an outage if no check in happened |
Configuration
The DMS_ENABLE
can be used to disable the HTTP server used to receive checkins. Additionally, its listening address (127.0.0.1:8080 by default) can be configured through DMS_LISTEN
.
To check in, a service needs to perform a GET request at http://${LISTEN_ADDRESS}/checkin/<check_id>
.