Microservice Usage at Honeycomb

I recently joined Honeycomb, an observability startup where we are working to help generalist software engineers solve problems with their apps in an increasingly distributed world. The architecture we use to deliver our product can be described as using microservices, but interestingly, to use microservices specifically was not a deliberate design choice. It was instead the result of various factors that are becoming increasingly common in today’s world:

The desire of software engineers to be involved with infrastructure and vice versa
The desire for increased visibility and debuggability (monoliths make it hard to pinpoint issues)
DevOps tooling is now mature enough to offload some operational burden onto

In this article we’ll take a peek behind the scenes at the systems that power Honeycomb. I’ll focus on what our implementation looks like here, but for a general summary of microservices, feel free to peruse the Microservices Journal overview.

Why Microservices?

Some teams start with a “monolith” and split that into separate services over time, but due to a variety of incidental factors the architecture at Honeycomb leaned toward microservices from day one. For instance, the team consisted of strong operational experts, so the infrastructure challenge to implement them was not quite as intimidating in our case. Having experienced the pitfalls of monoliths in previous roles, the team was strongly inclined toward splitting programs out by functionality from the beginning. This allowed us to avoid conflating things such as traditional SaaS app logic and the unique data layer of Honeycomb.

Before we dive in to architecture details, here is a quick summary of the basic functionality of Honeycomb:

Honeycomb ingests “events,” structured as JSON, which describe some happening worth tracking. (e.g.: A user made an HTTP request to our app).
Honeycomb stores these events on the backend for later querying.
Users can execute queries to gain insight into their (usually production) systems by accessing our web app.

This type of app has several pieces that do not obviously suggest at first glance whether they should be rolled into one program or split into several. Splitting the functions out has operational challenges, but can provide a lot of benefits for things such as scaling and debugging. Because Honeycomb’s founders and early employees were a world-class ops team who cut their teeth running systems at Linden Labs, Parse, Facebook and elsewhere, the operational challenge of running these as separate components in a distributed system was not as intimidating as it might be to more junior admins.

Another important consideration in splitting these out was that, to handle this expected large influx of event traffic, we were driven to design a system that leans heavily on a pivotal piece of third-party technology: Apache Kafka.

Kafka is a tool for publishing and receiving “messages” from one instance of a program to another more safely. For instance, consider two programs running on separate computers that wish to communicate: If one program makes a request and fails to reach the other computer, the other has no way to know if the message failed to send, or no message was sent in the first place. This can cause all sorts of problems in real production systems. An in-between “broker” program can help fix this and encode logic to handle failures and ensure message delivery.

Due to using Kafka as a middleman, splitting out the “sending messages” (ingesting events) and “receiving messages” (storing them for later querying) became much easier to separate into distinct components. Let’s take a look at the specific services Honeycomb uses to get a feel for that.

A Peek Into Honeycomb

Honeycomb used to be a company called Hound, so all of Honeycomb’s microservices are named with a dog theme. (Curiously, our actual office dog is named after a famous movie dinosaur named after yet another animal—”Ducky”). Most of them are written in Golang, but there is at least one written in NodeJS because it was a better fit. This is a noteworthy advantage of microservices: You can use the right tool for the right job more easily, albeit at the cost of additional operational complexity.

As mentioned above, there is a service which receives events (JSON payloads) in the first place. This service is called Shepherd. Shepherd sits behind the endpoint api.honeycomb.io and serves a variety of functions which are somewhat unique to fronting an API. For instance, rate limiting is handled by Shepherd so that a client sending us a huge burst of events does not cause problems for other customers. Shepherd will also handle business logic around things such as event schema (e.g., updating your dataset’s metadata if a new field has been received) and will hand the received events off to Kafka for storage.

When events hit Kafka, they are received by a process called Retriever, which is our persistent storage engine. Kafka lends itself well to being an intermediary in this split-up use case because even if Retriever is offline for a while (due to a deploy, panic, etc.), events simply accumulate in the backlog and remain available for later processing. This provides a lovely amount of resiliency to the Honeycomb service.

When Retriever receives events it writes them to disk in the proper location for later querying. Retriever serves as both the writer and reader of persisted event information. Thus, when we want to execute a query of the user events, our front-end web application called Poodle (which is what backs ui.honeycomb.io) calls Retriever directly. Retriever’s stateful and distributed nature makes it more likely to be fickle and troublesome to debug than some of the other pieces, so having it split into its own service is extremely useful for debugging and maintenance. Since Honeycomb deploys multiple times per day, it can become tricky to identify the source of issues (although it’s worth noting that we also have our own separate “dog food” Honeycomb cluster to debug Honeycomb with Honeycomb!).

If everything were mashed together into one binary, identifying performance issues (such as processes dying due to being out of memory) would be a nightmare. Any piece of the app might have an effect on any other piece of the app. All hope of doing things such as using an appropriate AWS instance type (for instance, one with faster disks) depending on the service would be dashed. But by having our architecture split into several component pieces, we can address these needs and more.

We also have at least two more services that slot in nicely: A service called Doodle, written in NodeJS, which makes pretty pre-rendered pictures of queries for posting flattened images in Slack. Another is called Basset and works to supervise Triggers, which tip users off when the value returned by a query of interest changes.

How Honeycomb Uses Honeycomb

One interesting aspect about Honeycomb’s architecture is that it happens to be exactly the type of system that benefits strongly from being observed and debugged using the user-facing Honeycomb product! However, we cannot send events from our production website to itself, since this could potentially result in nasty feedback loops and unsolvable problems. Therefore, we created another “dog food” (as in eat your own dog food) cluster that we use to maintain the health of our production cluster.

For instance, in one case we were alerted that an end-to-end check we use to ensure the quality of our site was failing. Events which were being written to Shepherd could not successfully be read back later by Retriever/Poodle. Using a Honeycomb Break Down (similar to a SQL GROUP BY) in our dogfood cluster, we were able to group the events we attempted to write based on the data partition they were meant to be written to and by whether or not they succeeded (shown below).

This allowed us to quickly identify that a particular instance of our data partitions (of which there are many) was the one having issues without having to dig through a mountain of logs or staring bleary-eyed at a pile of pre-aggregated metrics wishing that we had access to more information. From this discovery we were able to compare other metrics such as API latency and write latency and deduced that the problem seemed to be within Kafka itself rather than in our systems surrounding it. Our dogfood cluster allows us to gain the benefits of using Honeycomb to debug a distributed system to debug Honeycomb itself, a complex application with many moving parts and microservices.

If you are interested in more details of our dogfood cluster, we have a series of blog posts explaining our setup.

Conclusion

We hope this article gave you a peek at what a microservices world looks like. There are, of course, tradeoffs with any architecture, but we have been quite satisfied with our choices so far.

Please reach out to us at support@honeycomb.io if you have any questions! Honeycomb is a tool designed to help with debugging in a distributed world, so we’re happy to help. We also have a free trial available if you are interested in signing up.

— Nathan LeClaire