Building a Prometheus & Expression Evaluation Service in Go

Alexander Sniffin
5 min readDec 29, 2020

--

Prometheus is an popular open-source monitoring tool that provides functionality for displaying, querying, and alerting on time-series data that’s collected from various targets. Typically, it’s used in combination with Grafana for visualizing and building observability dashboards which let you understand the state of your systems at a glimpse.

One less known feature is that it also provides a pull model that lets you send PromQL queries over HTTP. This model can be used to access both historic data over a time range or current data.

Using this model, we can combine it with other things such as an expression engine to build services or tools that can be used to solve a number of different problems. For this article, I will go over some examples of using both separately and then building a simple expression evaluation service written in Go.

Querying Prometheus

Let’s try the HTTP pull model out by sending a simple query to the API with curl. If you don’t have access to a Prometheus instance, you can set it up easily from the official Getting Started instructions.

>> curl $HOST/api/v1/query?query=go_goroutines | jq .
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "go_goroutines",
"endpoint": "https-metrics",
"instance": "...",
"job": "kubelet",
"metrics_path": "/metrics",
"namespace": "kube-system",
"node": "...",
"service": "example"
},
"value": [
1608777052.769,
"291"
]
}
]
}
}

As of the v1 API, the response will be a status with a result type and an array containing many values. For this particular query, the values are instant vectors of all the applications that emit the go_goroutine metric which is a default metric in the official Go client library. For the service named example, it returned a value of 291 at the epoch time of 1608777052.

Experimenting With Code

We can expand on the curl example and write some code that will take an expression and dynamically evaluate it against the result of the query response. For this, we can use a Go library called expr. The advantage of using dynamic expression evaluation is that you can pass your business logic in the runtime, usually through configuration. This is similar to how Prometheus's Alert Manager’s rules work.

To start, here’s a simple example of using expr and how it can be used to compile and execute an expression to divide two variables:

Expr also supports boolean logic, custom or builtin functions, custom operators and more. Now let’s take a look at the Prometheus Go client (the same library mentioned before) and make a query request:

Making queries against Prometheus is fairly straight forward. You might notice how it would be easy to combine these and build an application which will let us pass in rules to evaluate.

Let’s do that by checking for when a service exceeds a certain amount of HTTP requests per second within a timeframe. For this example we will use a custom metric called http_request_duration_seconds which is a histogram that also provides a count of requests http_request_duration_seconds_count.

With this example, the full PromQL query would need to be an aggregated rate averaged over a timeframe, we will check for the average rate over 5 minutes, specify the service in the label selector and sum by the service.

sum(rate(http_request_duration_seconds_count{service="example"}[5m])) by (service)

For the expression, we will specify the variable name as query_result, and we will check if the requests per second averages greater than or equal to 100.

query_result >= 100

Now we can put together an example procedurally in a standalone program:

Awesome, it will print either true or false for whether our requests per second are over 100.

Writing the Service

Let’s continue but build this into a service that sends messages to a Slack channel. It’ll feature a worker pool that runs a set of configured rules that query Prometheus, evaluates an expression, then template the result into a string and finally sends the message to a Slack channel using an incoming webhook.

Here’s a list of all of the libraries which I’ll use to build this:

Note: As this is an example, I won’t go into all the details of the design and it isn’t written to production quality.

To start, the structure of the service follows project-layout and it includes only the necessary components to run the project locally.

This includes of a cmd, configs, and internal directories, and then the go.mod and go.sum files. The cmd directory includes the entry point of the application which handles the creation of the service and shutdown. The internal/example/server/server.go file will set up the dependencies and wire the application together and internal/processes/evaluator/evaluator.go handles running the worker pool with the configured rule set.

It’s completely configuration driven and new rules can be added without recompiling. I created two examples of rules, one which gets the average GC time over the last 30 minutes and compares it to the last day for a service, and the other which gets the total memory being used by a service currently.

The config is in yaml, I just omitted the hosts for Slack and Prometheus:

Then finally, here’s a sample run with the output in Slack!

That’s all there is to it, the full source code can be found on my GitHub.

Summary

There are plenty of use-cases for this that could go beyond a Slack bot, one of which I used recently to automate some database scaling with a managed product that didn’t support autoscaling.

Prometheus is a very versatile tool in the observability stack but can also be used to bring new optimizations or insights into your applications and systems. Thanks for reading!

--

--

Alexander Sniffin

Software Engineer solving the next big problem one coffee at a time @ alexsniffin.com