In the Production Engineering team at Jump Trading, we are responsible for a large global fleet of servers and other devices as well as a wide variety of software. To monitor all this infrastructure, we are big fans of the TICK stack from InfluxData. This post announces the open source release of a piece of infrastructure that we have found critical to our successful deployment of the TICK stack at scale, and provides a high level description of how we approached and solved some of the challenges throughout our journey.
At Jump we initially deployed the telegraf agent across our fleet of machines starting with version 0.9. In the two years since then, the growth of metrics going through telegraf at Jump has increased enormously. As well as the core system metrics, we have written many plugins (some of which are useful outside of Jump and have been open sourced as part of the core telegraf software), and we have implemented metrics sent using the InfluxDB line protocol to many internal applications.
For operational simplicity and control, we adopted the pattern of running telegraf everywhere and teaching applications to write to a specific port on the local host where the telegraf socket listener picks it up. This allowed us to ‘enrich’ the metrics from the application with standardized tags added by telegraf (for example: host, data center, department). It also allowed us to batch and randomize the sending of metrics across our network to avoid microbursts in metrics traffic. On the server side, we adopted the pattern (primarily to increase resiliency and reduce cardinality) of running many InfluxDB servers, each one containing a set of metrics that would be useful for a specific set of users.
As an example, for application A we may run an InfluxDB instance for the metrics from Application A as well as the OS metrics from hosts involved in running Application A. We may run multiple instances, one with a fairly short retention policy that operational teams point their regularly refreshing Grafana dashboards at and another instance that is only used for queries going back much further in time. We heavily use kapacitor for real time monitoring and run multiple Kapacitor instances that should only receive the subset of metrics for the TICK scripts they run. We have additional internal software that consumes InfluxDB line protocol metrics for various internal uses.
One of the challenges of this model was that the resulting data flows became complex, and it became clear that the missing part of our infrastructure was a ‘router’ for metrics data. We are big fans of the Go Programming Language and an initial prototype for what is now influx-spout appeared as a spare time project from a team member. Since that time, with the help of a team consisting of various colleagues in multiple countries, a Summer intern in Chicago and a remote developer in NZ, influx-spout has become a core part of our metrics infrastructure. We are delighted to announce that we are open sourcing it today, under the Apache License 2.0. We hope that others may find it useful. It can be found at github.com/jumptrading/influx-spout.
High level picture
The overall metrics architecture at Jump looks like this:
Architecture of influx-spout
The architecture consists of Listeners, Filters and Writers. All of these are connected via the NATS message bus. All the components can run as independent services, possibly on separate servers, or as containers managed by systems such as Kubernetes.
From telegraf (or anything writing the InfluxDB line protocol), influx-spout receives metrics via either HTTP or UDP, and a Listener process publishes them to a NATS subject. The influx-spout Listeners are responsible for receiving these metrics, carrying out some initial sanity checking, batching them up and publishing them to another NATS subject for further processing.
One or more Filter processes receive the measurements published by the Listener and process them based on a configuration that allows simple or more complicated (regular expression based) rules to map measurements to more fine grained topics. These measurements are published to specific NATS subjects.
A larger number of ‘Writers’ (one per backend - which could be a InfluxDB process, Kapacitor process, or anything else that needs a stream of metrics) receive measurements from the Filters. Within the Writer, it is possible to optionally apply further granular filtering rules as well as batch data into larger chunks for better performance. The Writer is responsible for submitting each batch of metrics that match the configuration to a backend using HTTP POSTs.
The number one cause of problems and outages during the early days of our InfluxDB experimentation (before influx-spout) was that every so often somebody would produce broken metrics (metrics with enormous cardinality, or many gigabits per second of useless metrics, or metrics with erroneous timestamps such as seconds rather than milliseconds). Controlling a piece of (fairly simple) software in between the applications and the backends has proven hugely powerful and allowed us to effectively eliminate this problem. influx-spout already drops many bad things before they get to our backends, and allows us to ‘whitelist’ the metrics that go to specific “production” instances to keep them safe from the authors of other metrics. At Jump, any metrics not matching the Filter’s configuration are sent to a “junkyard” backend for initial evaluation before being fed into suitable production backends. This has meant that a new application writing a new measurement is unlikely to impact other backends. This in turn makes it safe for developers to introduce and experiment with new measurements without worrying about breaking production.
An additional benefit of this infrastructure is that it is also simple to develop custom components that can access the metrics data stream. An example of this is the influx-spout-tap utility which has proven very useful for debugging data flows (you can find this in the utils/ directory).
Influx-spout has comprehensive instrumentation, and we have a collection of Grafana dashboards and Kapacitor scripts to monitor influx-spout itself. In a future post we will make available some sample dashboards and scripts to help others deploying influx-spout.
Care has been taken to handle large numbers of metrics. At Jump, our primary production influx-spout comfortably receives over a gigabit of metrics per second (>300k measurement lines per second). Depending on operational requirements, some of these metrics are discarded while many others are needed in multiple backends. For our configuration, the volume of metrics written out by influx-spout around 4x what it receives (so our influx-spout is receiving ~1GiB/sec and writing ~4GiB/sec). The flexibility of influx-spout’s architecture allows for this. While handling this load, the various influx-spout processes we run consume less than 10GB of memory (RSS) and use less than 10 server-class CPU cores.
We are confident that we can scale this infrastructure significantly. If we hit the limits of what can be handled by a single host, influx-spout’s architecture allows us to trivially scale out across multiple hosts/containers.
We are excited to see if others find influx-spout useful. We have ideas for follow up posts and work, and would love to hear from anybody in the community using our software. Please get in touch at firstname.lastname@example.org.
Posted by Alex Davies, Production Engineering Lead