Metrics and Monitoring at Kloudless

This post was written by our engineering intern, Matthew Soh.

Metrics and monitoring are important for any service. They provide critical information needed to detect and respond to incidents and issues. Kloudless deals with millions of requests, and keeping track of everything can be difficult. Kloudless has recently integrated a new metrics system to tackle this challenge.

Metrics from Kloudless can be sent to the new analytics platform for collection, analysis, and alerting. The analytics integration is available to all operators of Kloudless systems: both our own DevOps team administering our cloud version as well as operators of self-hosted Kloudless Enterprise appliances.

Usage

Let’s walk through a simple use case. Let’s say that your application uses the Kloudless Storage API and is failing to store files uploaded to your service. The potential issue could be anywhere in the stack. With the new metrics system, dashboards that display request metrics are easily accessible. Now you can quickly isolate the issue based on these metrics.

blog-post-dashboard

Chronograf dashboard with graphs of Kloudless metrics.

Here we have a simple dashboard. Core health checks to the Kloudless appliance appear to be fine. However, there is a sharp spike in the graph of Request Failures! This graph shows failures for outbound requests to upstream services. Hovering over the graph, we can look at the tags on each data series and see that there’s been a large increase in 500 errors to Box. This suggests that the issue is likely with the upstream service rather than the Kloudless API or the appliance itself. Knowing this allows us to narrow down which logs we need to look at to learn more about the error and take further steps to assess the root cause.

Request status is the mere tip of the iceberg when it comes to metrics provided by the Kloudless appliance. For a more detailed reference of available metrics, please refer to the Kloudless Enterprise Configuration guide.

How it works

The dashboard used above is built with Chronograf. It is one part of the metrics system deployed at Kloudless that uses the TICK stack, by InfluxData. TICK is comprised of Telegraf (collector), InfluxDB (datastore), Chronograf  (visualization), and Kapacitor (monitoring).

blog-post-influx

The TICK Stack. © 2017 InfluxData, Inc.

The metrics processing chain begins with Telegraf, the metrics collection daemon. Telegraf is designed to aggregate data from different sources and send them to various datastores. Sources include sysstat (a system information tool) and statsd (a common metrics daemon). The default datastore is InfluxDB, though others such as Graphite and CloudWatch can also be used. If required, the Telegraf output in the Kloudless appliance can be modified, enabling existing metrics collection or storage infrastructure to be used instead of InfluxDB.

The metrics then proceed to InfluxDB, which is designed for storing time-series data. This means that it has some neat features such as simple data retention policies and continuous queries. Data retention policies allow for time limits on metrics to expire old data. Continuous queries run at regular intervals on InfluxDB to summarise detailed data into broad overviews. For example, summations over counts of API requests are used to build daily summaries. Together, these features allow InfluxDB resource usage to be managed effectively.

Once the data is stored, Chronograf and Kapacitor work in tandem to help understand the collected metrics. Chronograf enables dashboard visualizations like the one described above to be built, while Kapacitor provides automatic monitoring so that there is no need to stare at the dashboard all day. Kapacitor’s monitoring and alerts are managed through TICKScripts which can be configured using either the Kapacitor command line client or Chronograf. We have provided sample TICKScripts which cover some common use cases.

Why we moved to the TICK stack

As the Kloudless Platform has grown, the volume and complexity of metrics data has grown with it. Previously, we were using with StatsD as a collector and Graphite to visualize metrics. This simple solution was easy to work with, but our needs have changed. Here are some of the unsupported scenarios we encountered:

  • StatsD doesn’t support tags. Tags are useful for providing context for the measurement, such as status or type of a request. Our workaround was to append tags onto the measurement name, similar to the tags used by DataDog’s DogStatsD. The downside of this approach was messy measurement names that were difficult to query.
  • Each StatsD measurement only has one value. It is sometimes useful to have a tuple of data grouped together in a single metric, or associate data such as application IDs to a metric. This isn’t possible with the StatsD+Graphite solution either.
  • The language used to query Graphite is limited to nesting of functions, and performing aggregations can be slow since the metrics are stored in flat files. This results in slow queries across multiple metrics and prevents easy use of complex operations.

These factors led us to look for a better alternative. InfluxDB seemed promising as its design was tailored for high volume time-series data and metrics collection. InfluxDB is built around measurements, which are in turn made up of many data points. A data point can have one or more values associated with it, and as many tags as needed. This addressed our first two issues right away and allowed for better querying.

For example, let’s try to determine which Kloudless applications were associated with failed API requests in the past day. We can do this with the following query:

select status, path, application_id from request_metrics_api_requests 
where time > now() - 24h and status=~/[^2][0-9]{2}/

In just one query, we’ve selected multiple values (status, path, application_id), filtered by tag values (status is not a 2XX code) and with a time frame limit (last 24 hours). This would have been messier and more tedious with our previous metrics system which did not allow associated metadata to be recorded. Additionally, InfluxDB’s time-series database design allows for tag-based queries to execute more efficiently since all tags are indexed.

Next Steps

We’ve been using the new metrics system at Kloudless to better understand, measure and monitor the performance of our hosted cloud platform. We’ve found that it has saved us time and effort when triaging issues, and we think you’ll find it useful as well. We’re excited to make this metrics system available to our enterprise customers using the Kloudless appliance. Customers using our cloud platform will also see analytics data for their Kloudless applications exposed in the developer portal in the upcoming months. As always, we would love to hear your thoughts and feedback on Twitter, comments below, or at hello@kloudless.com.