Our automation platform experienced rapid growth, which led to a happy problem. The utility began as a basic application, quickly developed using a monolithic codebase. At the time, that approach made sense: One codebase, one database, one team. Everything was in one place.
The system began handling over 100,000 automation tasks per hour, numerous vulnerabilities started to surface. Downtime became frequent. Deployments grew risky. Smaller issues in a single segment could cause widespread failures across multiple areas.
Our organization realized that there was a need to grow our structure — a simple user scale was not enough. This narrative describes our transition from a monolithic system to a microservices architecture, while explaining our advantages and challenges along with suggested improvements.
Chapter 1: Life Inside the Monolith
Every functionality we needed was condensed into a single application, which handled job processing combined with user administration and event logging functionality, in addition to analytics capabilities. All teams operated their code within the same shared repository. The system used a PostgreSQL database for every single feature.
At first, it was fine. However, additional programming introduced increasing complexity that we had to manage.
Team members experienced blocking each other when deployment schedules came into effect. A single breakdown of the system would cause a complete shutdown. We faced challenges when trying to scale the system, since we could not divide the scaling process by individual system components.
Too much time passed during testing stages, alongside debugging operations.
The increasing number of system bugs forced us to dedicate more time to bug fixes than to the advancement of new features.
Chapter 2: Adoption of Microservices
First, we asked ourselves:
- Do we really need microservices?
- Should we keep the system functional given our current team count?
- Does the system demand a strategy to handle service-to-service communication?
System elements with the highest number of complications were identified. The job processor required independent scaling capabilities as one of its needs. Logs were consuming storage. The system crashed notifications when it faced heavy use conditions.
That is where we started.
Chapter 3: The First Services
We established the job processor service independently by creating a new Node.js application. The service obtained an independent database connection and a dedicated message queue for communication. Operating independently, the service processed jobs before sending data through Redis’ publish/subscribe (Pub/Sub) (later using Kafka for data transmissions).
It wasn’t easy.
We began without ideal error detection systems. Some messages were lost. Stabilizing the system required extended periods — weeks of effort. After all modifications, the job processor deployment became possible as a standalone operation without affecting any other parts of the system.
Encouraged, we continued:
- The notification service grew to operate as an independent unit.
- The user events, along with logs, transitioned to a separate logging service module.
- The analytics system ran independently from other services as a background batch process.
Chapter 4: Unexpected Challenges
When we divided our system, it proved to be more difficult than just drawing boxes on whiteboards. We quickly encountered several challenges:
1. Authentication and Shared State
The same authentication functions and session handling existed within the monolith structure. Microservices required each service to verify tokens independently, as token verification was now distributed. Constructing a common authorization library became necessary during this process.
2. Deployment Complexity
The transformation resulted in five deployment targets instead of one. Implementing Docker and GitLab continuous integration and continuous deployment (CI/CD) technology required substantial effort to execute correctly.
3. Monitoring and Debugging
Testing and debugging tasks became more complex after service logs began arriving from different containers. We implemented Grafana and Prometheus-based centralized logging to gain visibility into operational events.
4. Communication Failures
One service failure caused all others to become stuck due to delays. The implementation of retries, fallbacks and circuit breakers was added to improve functionality.
Chapter 5: Unanticipated Wins
Not everything was difficult. Surprisingly, the implementation turned out better than we predicted:
- We could deploy faster: Fixes for the notification system bugs no longer had to wait for the next major release.
- Teams became more independent: Each developer in devs controlled a limited code segment and managed their individual service components.
- We had fewer critical crashes: The core operational system functioned without the analytics service, as it was configured as a redundant component.
- Job processing services scaled up to three containers based on current operational load, while other services operated with only one container.
This seemed like an achievement of higher system capability.
Chapter 6: Our Observability Stack
A productional observability platform that displayed all running services became our immediate requirement. We selected practical and powerful infrastructure tools:
- Prometheus for metrics
- Grafana for dashboards
- Loki for logs
- Alertmanager, which automatically notified us when any breakdown occurred
The addition of tracing functionality allowed us to track application programming interface (API) requests between services as they propagated to the database and returned to the API.
Chapter 7: Things We Would Do Differently
Reflection shows that our team made various errors:
- The core functionality received no attention to observability in its initial stage. Implementing monitoring after development made debugging during early phases more difficult.
- The development team should have created a unified shared library framework for common code (authentication, logging and configurations) much earlier.
- Redis’ Pub/Sub served our needs initially, but we later switched to Kafka, which proved more effective for message handling.
The next time we apply this approach, we will dedicate extra time to designing the framework and messaging strategy before starting with individual service partitions.
Chapter 8: Where We Are Now
Our automation platform currently processes over 100,000 events per hour through the operation of at least 10 microservices. Each individual service has its own repository, a dedicated CI/CD pipeline and specific metrics.
Our team executes deployments multiple times each day. Downtimes are rare. Since these changes were implemented, the team operates faster, with greater focus and overall satisfaction.
Through the adoption of microservices, the organization achieved improved scalability and introduced a completely new approach to how work is performed.
Final Thoughts
Teamwork with microservices requires careful evaluation before determining its suitability for your team or product. An organization achieves the best results by splitting its monolith into smaller parts when the current system becomes a bottleneck and the team is capable of handling complex requirements effectively.
Microservices offer flexibility, but they also introduce additional management complexities. Start small. Monitor everything. Plan for failure.
Our experience taught us a valuable lesson — one that proved why we no longer want to return to our previous way of working.