Anton Lindstrom (about, @twitter, @github)

Monitoring Mesos tasks with Prometheus


I run an infrastructure for my own hobby projects and tests. The environment is currently running on DigitalOcean with Mesos and Mesosphere Marathon on top which runs all applications. The need for simple yet powerful monitoring without too much setup, maintenance and configuration has been high in that environment. I wanted to both know when applications failed or what the load and the performance of the system was.

My previous monitoring system comprised of a few different ones, I used a simple Uptime Robot to collect external alerts and InfluxDB with a custom dashboard and metrics collector to look at performance graphs. That system was unfortunately not good enough. It got complex and the automatic deployment of it was not configured properly, it did not get enough attention to make it work as good as it could've been.

Prometheus crystal ball

Enter the crystal ball: Prometheus. It got some attention lately and I decided to try it out. The easiest way to install it is via Docker which is listed on their installation page. I hooked that up to my deployment system and started looking at it. I was met by a simple but powerful configuration, query language and user interface. However, once started I needed a way to get some useful data into it.

I knew about the /monitor/statistics.json endpoint for Mesos slaves and have been using that endpoint successfully a few times to find out CPU or memory usage for running tasks. The endpoint would be a good place to fetch information from with a simple program that does a GET request, parses the data and displays it in the format Prometheus wants.

To get Mesos data into the system it was as easy as building a small Go app that used the client provided by Prometheus and serve HTTP. The Mesos exporter can be found on my Github and any feedback to get it better is welcome. The installation instructions are available in the README.

After I had built the exporter, the Prometheus configuration was just a few lines to add the exporter endpoint and graphs started showing.

Previously, my relationship towards monitoring systems have always been a bit on the complicated side. I love Sensu when I have machines running entire stacks but with Docker, it is not a perfect fit. If I use Puppet, Sensu also requires a lot of infrastructure to be set up in a small environment where there is high velocity but not that much resources it is far from ideal.

Prometheus sits right in the perfect spot where the simplicity and the extensibility of exporters makes it easy to run. That it also comes packaged in a Docker container makes it even more so.

Alerting in Prometheus is done on the data collected which is then sent off to another system, alertmanager. Simple and very powerful.

So far Prometheus has been everything I wanted and provides a really good way to monitor my infrastructure and alert on any anomalies in it. Once you start learning the query language, it is amazing how intuitive it feels.

A query against the data in Prometheus to find the top 3 highest consumers of CPU in Mesos for example, looks like this:

topk(3, sum(rate(mesos_task_cpus_user_time_secs[5m])) by (app))

TLDR: Mesos and Prometheus are great, here's a mesos_exporter to get Mesos task statistics into Prometheus.

Note: Image is probably copyrighted and owned by 20th Century Fox.