Observability has become a buzzword, and the API world is slowly catching up. But be careful when looking for content about API observability- there are many outdated best practices and irrelevant pieces of content.
I was recently invited as a guest to Stephen Townshend’s SlightReliability podcast, and his brilliant blog post on “Bad Observability” inspired me to start ranting about the topic.
Here is a follow-up to our discussion: The observability anti-pattern to avoid when working with APIs. Thanks, Stephen, for the inspiration!
Anti-Pattern: You Forgot Your Users
It’s easy to lose sight of the fact that APIs are built for people to use. APIs are not just a technical interface; they are a bridge that connects digital experiences and enables people to accomplish tasks efficiently.
There are two different kinds of users for your API:
- The developers who are integrating your API into their system.
- Your customers’ users, who will ultimately interact with the applications or services that rely on your API.
If you can’t understand how they interact with your APIs, you’re missing half the picture.
When developers integrate with your APIs, how long does it take them to make their first successful API call? What are typical errors slowing them down? Observability plays an essential role in surfacing insights to improve developer experience.
Once your customers’ applications or services relying on your APIs are in production, can you track how your releases impacted their requests? Which version of your APIs are they using? What are the most popular endpoints? Are all customers getting the same performance? Same error rates?
Anti-Pattern: Relying Only on API Monitoring
API monitoring typically involves sending automated test requests to API endpoints and verifying the responses. It helps you keep track of the health of your APIs even when there is no customer activity at all, and can be used to report uptime.
But API monitoring is nothing more than black box testing; you don’t see what is happening inside. You just know that your requests reach the API and receive responses, but you can only create and maintain tests for some possible use cases. You will miss something.
Anti-Pattern: Using API Access Logs for Troubleshooting
API access logs are records of requests and responses made to an API, capturing details such as the requestor’s IP address, HTTP method, requested endpoint, status codes and other relevant information for monitoring, debugging and security purposes.
Using access logs to troubleshoot API issues is like searching for a needle in your microservices haystack. One of our users had a performance issue with their APIs; they spent days searching for the root cause using access logs. Is it the gateway? Does it need more CPU? Is it a configuration issue? They enabled distributed tracing and, within seconds, they could see what was happening: All the time was spent in the authentication server issuing JWT tokens.
To really understand what is happening in your system, you need to be able to track a single user request from the API gateway to the upstream services and their dependencies for effective troubleshooting.
As Martin Thwaites puts it: “Logs are a poor man’s observability tool.” You need distributed tracing.
Anti-Pattern: Different Teams, Different Tools, Different Data
Different teams have different needs. The tool that works for your DevOps team might not fit your product team best. By forcing everyone to use the same tool, you’re not promoting unity but inefficiency.
However, while promoting tool diversity might be the right move for your organization, there should be some coordination and integration between different tools and data sources. Open standards like OpenTelemetry help ensure that information can be shared and leveraged across teams and tools.
The DevOps and the product teams should rely on the same error rate when assessing the performance and reliability of the newest API product launched.
Anti-Pattern: One Size Fits All
Different API architecture styles have different needs when it comes to observability.
REST APIs might be today’s default, but newer styles like GraphQL and gRPC are gaining popularity. Each of these styles has distinct characteristics and requirements regarding observability. Just think about GraphQL’s way of handling errors, which differs significantly from REST APIs. Web Socket and event-driven APIs have their own unique observability challenges, as well.
Different API architectures have different needs, and your observability strategy needs to reflect that to improve the overall reliability of your application.
Anti-Pattern: Observability for Production Only
Why aren’t you using observability in pre-production?
By using distributed tracing during the API development life cycles, developers can trace the flow of requests and responses across different services, providing a comprehensive view of how the API behaves under various scenarios.
With observability, they can identify performance bottlenecks and even detect architectural problems like N+1 problems in GraphQL queries.
For teams leveraging APIOps practices, observability data can be used as an additional validation before promoting an API to the production stage.
Anti-Pattern: Starting the Trace at the API Gateway is Overrated
Tutorials about observability usually start the instrumentation process at a microservice level. It is great to have detailed insights into your microservices. But a lot will happen at the API gateway level when dealing with APIs.
You want to capture all your user transactions, including those that never reach your microservices because of rate-limiting rules, an authentication problem or a caching mechanism.
Starting the trace at the API gateway gives you a clear entrance point and a complete picture of the journey of all your users. That’s why it’s important to use a modern API gateway with built-in support for OpenTelemetry.
So there you have it: Seven API observability pitfalls to avoid. Keep your APIs observable and your users happy!
To hear more about cloud-native topics, join the Cloud Native Computing Foundation, Techstrong Group and the entire cloud-native community in Paris, France at KubeCon+CloudNativeCon EU 2024 – March 19-22, 2024.