Airbnb Engineering & Data Science

This repo contains the interferon gem. This gem enables you to store your alerts configuration in code. You should create your own repository, with a Gemfile which imports the interferon gem. For an example of such a repository, along with example configuration and alerts files, see https://www.github.com/airbnb/alerts_example

Running This Gem

This gem provides a single executable, called interferon.

You are meant to invoke it like so:

$ bundle exec interferon --config /path/to/config_file

Additional options:

-h, --help -- prints out usage information
-n, --dry-run -- runs interferon without making any changes to alerting destinations

Configuration File

The configuration file is written in YAML.

It accepts the following parameters:

verbose_logging -- whether to print more output
alerts_repo_path -- the location to your alerts repo, containing your interferon DSL files
group_sources -- a list of sources which can return groups of people to alert
host_sources -- a list of sources which can read inventory systems and return lists of hosts to monitor
destinations -- a list of alerting providers, which can monitor metrics and dispatch alerts as specified in your alerts dsl files

For more information, see config.example.yaml file in this repo.

The Moving Parts

This repo knows about four kinds of objects:

host_sources: these query various inventory systems and return lists of hosts or entities to alert on
destinations: these are metric systems, which can watch metrics and alert engineers
groups: these are groups of actual engineers who can be alerted in case of trouble
alerts: these are ruby DSL files which specify when and how engineers and groups are alerted via the destination about hosts

Host Sources

optica: can read a list of AWS instances from optica
optica_services: returns smartstack service information parsed from optica
aws_rds: lists RDS instances
aws_dynamo: lists dynamo-db tables
aws_elasticache: lists elasticache nodes and clusters

Destinations

Datadog

Datadog is our only alerting destination at the moment. Datadog's alerting syntax rule are here: http://docs.datadoghq.com/api/#alerts Here's a chart explaining the datadog metric syntax (generated via asciiflow):

    +---------+ alert condition +-------------------------------------------------+
    |                                                                             |
    |              +-----+ metric to alert on                                     |
    |              |                                                              |
    |              |    tags to slice the metric by +------+                      |
    |              |                                       |                      |
    v              v                                       v                      v
  |----------| |-------------------------||--------------------------|          |---|
  max(last_5m):avg:haproxy_count_by_status{role:<%= role %>,status:up} by {host} > 0
  ^      ^      ^                                                          ^
  |      |      |                                                          |
  |      | +----|------------------------------+                           |
  |      | | math on the metric over all tags  |                           |
  |      | |-----------------------------------|            +------------------------------------+
  |      | | * max, min, avg, sum              |            |trigger a separate alert for each   |
  |      + +-----------------------------------+            |different value of these tags the   |
  | +----|----------------------------------------------+   |entire `by {}` clause can be ommited|
  | | the interval to look at; always starts with last_ |   +------------------------------------+
  | |---------------------------------------------------|
  | | * 5m, 10m, 15m, 30m                               |
  | | * 1h, 2h, 4h                                      |
  + +---------------------------------------------------+
 +-------------------------------------------------------------------------------------------------+
 | metric condition, can be one of:                                                                |
 |-------------------------------------------------------------------------------------------------|
 | * max: the metric gets this high at least once during the interval                              |
 | * avg: the metric is this on average during the interval                                        |
 | * min: the metric is this small at least once during the interval                               |
 | * change: the metric changes this much between a value N minutes ago and now (raw difference).  |
 | * pct_change: the metric changes this much between a value N minutes ago and now (percentage).  |
 +-------------------------------------------------------------------------------------------------+

Groups

Groups actually come from group_sources. We only have a single group source right now, which reads groups in YAML files from the filesystem. However, we would like to add additional group sources, such as LDAP-based ones.

Interferon

Running This Gem

Configuration File

The Moving Parts

Destinations