A Monolithic Layer Processing 12,000+ Daily Transactions split into 7 Independent Services

Microservices architecture migration from monolithic ERP integration

A B2B distribution company processing roughly 12,000 orders per day through an integration layer that connected their ERP, warehouse management system, three logistics partners, and a payment gateway had reached a breaking point. The integration layer — a single Node.js monolith that had grown to approximately 180,000 lines of code over several years — was deployed as one unit. Every deployment was a full-system event that required a maintenance window of around 45 minutes, typically scheduled for Sunday nights. A bug in the logistics rate-calculation module had taken down payment processing for several hours because both flows shared the same runtime.

Brainstack Technologies led a seven-month migration that decomposed this monolith into seven independently deployable microservices — without requiring any system downtime and without disrupting the 12,000+ orders flowing through the pipeline daily.

Project Overview

ClientA B2B distribution company (name withheld under NDA)
IndustryWholesale Distribution & Logistics
Scale~12,000 daily order transactions, 180K LOC monolith, 3 logistics partners, 1 payment gateway
Engagement Duration7 months (6 weeks architecture & planning, 5 months phased extraction)
Team3 backend engineers, 1 DevOps/infrastructure engineer, 1 architect (part-time), 1 QA engineer
ChallengeA Node.js monolith that had grown over several years, handling all integration flows as a single deployable unit — creating cascading failure risk, 45-minute deployment windows, and inability to scale individual flows independently
SolutionStrangler fig migration extracting seven domain-aligned microservices behind a Kong API gateway, deployed on Kubernetes with independent CI/CD pipelines per service

The Challenge

The integration layer had started life as a straightforward Node.js application that connected the company's ERP to a single logistics provider via REST APIs. Over time, as the company onboarded two additional logistics partners (one using SFTP file exchange, one using SOAP APIs), added a payment gateway integration, and built inventory synchronization with their warehouse management system, the codebase grew to approximately 180,000 lines — all in a single deployable unit.

By the time we were brought in, the problems were compounding:

Cascading failures — a memory leak in logistics rate-calculation took down payment processing for hours, resulting in significant penalties and delayed shipments
45-min deployment windows — every release, even a one-line config change, required a full-system redeploy with 72-hour advance notice to logistics partners
All-or-nothing scaling — during peak seasons, scaling payment processing meant provisioning 3× infrastructure for all flows, most of which didn't need the extra capacity
Developer velocity stalled — ~30% of engineering time spent on integration testing that only existed because of architectural coupling across shared modules

The triggering outage made the case for migration. A rate-calculation change for one logistics partner introduced a memory leak that exhausted the Node.js heap within hours of deployment. Because all integration flows — including payment processing — ran in the same process, the memory leak took down the entire system. Orders couldn't be processed while the team rolled back. The post-mortem estimated the outage cost in the tens of thousands of dollars in delayed shipments and penalty fees from logistics partners.

Deployments required Sunday-night maintenance windows. Every release — even a one-line configuration change — required redeploying the entire monolith, which meant a 45-minute window where no orders were processed. The ops team had to coordinate with logistics partners to pause inbound feeds, which required 72-hour advance notice. This made even minor fixes a multi-day process.

Scaling was all-or-nothing. During peak seasons (Black Friday, holiday pre-orders), the order volume tripled. The payment processing flow needed to scale, but because it shared resources with logistics rate calculation and inventory sync, scaling meant provisioning three times the infrastructure across all flows — most of which didn't need the extra capacity.

Developer velocity had stalled. With four developers working in the same codebase, merge conflicts were constant. A change to the payment flow required regression testing against logistics and inventory integrations because of shared utility modules, database connections, and error-handling patterns that were deeply intertwined. The team estimated they spent roughly 30% of their time on integration testing that only existed because of architectural coupling.

Our Approach

Strangler Fig Migration Strategy

We ruled out a big-bang rewrite immediately. The integration layer was processing 12,000+ orders per day; it couldn't go offline, and the business couldn't tolerate running a new untested system in parallel for months. Instead, we used a strangler fig approach: extracting one service at a time from the monolith while the remaining monolith continued to handle everything else.

The first six weeks were spent on architecture and decomposition planning. We analysed the monolith's code, database schema, and runtime call patterns (we instrumented the monolith with OpenTelemetry for two weeks to capture actual request flows, not just what the code suggested). This analysis revealed seven natural domain boundaries:

  1. Order Ingestion — receiving and validating incoming orders from the ERP
  2. Payment Processing — gateway communication, authorization, settlement
  3. Logistics: Partner A — REST-based carrier integration
  4. Logistics: Partner B — SFTP file exchange (daily batch)
  5. Logistics: Partner C — SOAP API integration
  6. Inventory Sync — bidirectional sync with the warehouse management system
  7. Notification & Alerting — order confirmations, shipment tracking, failure alerts

We intentionally split logistics into three separate services rather than one unified “logistics service.” The three partners used fundamentally different protocols (REST, SFTP, SOAP), had different SLA requirements, and changed at different rates. Combining them into a single service would have recreated the coupling problem at a smaller scale.

The extraction order was deliberate: we started with Notification & Alerting (lowest risk, no transactional data, easiest to validate) and ended with Payment Processing (highest risk, regulatory requirements, most complex error handling). This gave the team progressively harder challenges rather than starting with the most dangerous one.

API Gateway and Service Communication

We deployed Kong as the API gateway in front of both the monolith and the emerging services. During migration, Kong handled the routing logic: requests for extracted domains (e.g., /notifications/*) were routed to the new service, while everything else continued to hit the monolith. As each service was extracted, we updated Kong's routing configuration — no code changes to the monolith required for the switchover.

For inter-service communication, we used two patterns based on the consistency requirements of each flow:

Synchronous REST for the order-payment flow. When an order comes in, payment authorization must happen immediately and return a success/failure before the order is confirmed. This is a hard consistency requirement — eventual consistency is not acceptable for payment authorization. These calls go through internal REST APIs with circuit breakers (we used Opossum in Node.js) to prevent cascading failures.

Asynchronous messaging via RabbitMQfor everything else. Inventory updates, logistics dispatch notifications, and alerting all use event-driven messaging. When an order is confirmed, an “order.confirmed” event is published to RabbitMQ, and the relevant services consume it independently. If the notification service is temporarily down, the message waits in the queue — the order isn't affected.

The hardest communication problem was the logistics batch service (Partner B). This partner expected a single consolidated SFTP file every four hours, but orders trickled in continuously. We built a small aggregation service that consumed individual order events from RabbitMQ, batched them into 4-hour windows, generated the SFTP file in the partner's expected format, and uploaded it on schedule. This service was arguably the most custom piece of the entire architecture.

Containerization and Orchestration

Each of the seven services was containerized with Docker and deployed on a managed Kubernetes cluster (AWS EKS). We set up independent CI/CD pipelines using GitHub Actions — each service has its own repository, its own test suite, its own pipeline, and can be deployed to staging or production independently.

Deployment went from a 45-minute Sunday-night maintenance window to a rolling update that completes in under 4 minutes per service with zero downtime (Kubernetes rolling deployment strategy with readiness probes). The team now deploys individual services 8-12 times per week across the seven services combined, compared to the previous cadence of once per week for the entire monolith.

For the payment service specifically, we configured more conservative deployment guardrails: canary deployments that route a small percentage of payment traffic to the new version before full rollout, automatic rollback if the error rate exceeds a defined threshold, and a mandatory staging environment test against the payment gateway's sandbox before production deployment. The earlier outage had made leadership understandably cautious about payment-related changes.

Technology Stack

Service Layer

Node.js (Express) Python (FastAPI)
Node.js (Express)— Order Ingestion, Payment Processing, and three Logistics services. Node was the monolith's original language, so most extraction was straightforward.
Python (FastAPI)— Inventory Sync service. The warehouse management system's SDK was Python-only, so this service was written in Python from scratch rather than wrapping the SDK in a Node.js child process.

Gateway & Messaging

Kong API Gateway RabbitMQ
Kong API Gateway— chosen over AWS API Gateway because Kong allowed us to run the same gateway configuration in local development and production, simplifying the dev workflow. The routing rules that split traffic between monolith and services were managed as code in Kong's declarative config.
RabbitMQ— chosen over Kafka because message throughput (~12K orders/day) didn't justify Kafka's operational complexity. RabbitMQ's simpler queue model was a better fit for the event patterns we needed.

Data & Infrastructure

PostgreSQL 15 AWS EKS Docker GitHub Actions
PostgreSQL 15 — each service owns its own database schema. We enforced schema-per-service at the PostgreSQL level to prevent accidental cross-service queries.
AWS EKS + Docker + GitHub Actions — containerised deployment on managed Kubernetes, with one independent CI/CD pipeline per service.

Observability

OpenTelemetry ELK Stack Prometheus + Grafana
OpenTelemetry for distributed tracing, ELK Stack for centralised logging, Prometheus + Grafana for metrics and SLO dashboards — all established before extracting the first service.

Results

45 min → 4 min
Deploy time per service, zero downtime
1 → 8–12×/week
Deployments across services per week
Isolated
Failures contained to affected service only
~60% saved
Peak-season infrastructure cost reduction
8–10 wks → 3 wks
Time to onboard a new logistics partner
~30% → ~10%
Developer time on integration overhead

Deployment Speed:

Deploy time per service dropped from 45 minutes (full monolith) to under 4 minutes (rolling update, zero downtime). Maintenance windows eliminated entirely.

Deployment Frequency:

From 1 deployment per week (Sunday night, coordinated) to 8-12 deployments per week across services. Individual services are updated 1-3 times per week.

Incident Blast Radius:

Before: a bug in any module could take down all 12,000+ daily transactions. After: failures are isolated to the affected service. The payment service has had two incidents since migration — neither affected logistics or inventory sync.

Scaling Efficiency:

During the first post-migration peak season, the team scaled the Payment Processing and Order Ingestion services to 3x capacity while the other five services stayed at baseline — saving an estimated 60% on peak-season infrastructure costs compared to scaling the entire monolith.

New Integration Speed:

Onboarding a fourth logistics partner (REST API) took 3 weeks post-migration. Pre-migration, the team estimated the same work would have taken 8-10 weeks because of regression testing requirements and deployment coordination.

Developer Productivity:

Merge conflicts dropped substantially. Developers reported spending roughly 10% of their time on integration-related overhead, down from ~30% pre-migration.

Observability & Monitoring

We established the observability stack before extracting the first service — this was one of the most valuable decisions in the project. By instrumenting the monolith with OpenTelemetry first, the team could see request flows across the monolith's internal modules. When we started extracting services, the tracing data simply reflected the new boundaries without any additional instrumentation work.

The observability stack includes:

Distributed tracing (OpenTelemetry + Jaeger): Every request gets a trace ID that follows it across all seven services. When the ops team investigates a slow order, they can see exactly which service introduced the latency — including time spent waiting for external partner APIs. Mean time to diagnose production issues dropped from ~2 hours to ~15 minutes.
Centralised logging (ELK Stack): All service logs ship to a shared Elasticsearch cluster, correlated by trace ID. Searching for a specific order ID returns logs from every service that touched that order, in chronological order.
SLO dashboards (Prometheus + Grafana): Each service has defined SLOs — for example, the Payment Processing service targets a p99 latency under 800ms and an error rate below 0.1%. When a service approaches its error budget, the dashboard alerts the on-call engineer before users are affected. The team reviews SLO compliance weekly.

Key Engineering Lessons

01

Instrument before you extract. We deployed OpenTelemetry on the monolith two weeks before extracting the first service. This gave the team baseline visibility into request flows, latency, and error rates — which meant we could immediately compare a newly extracted service's performance against its monolith-era baseline. Without this, we would have been flying blind on whether each extraction improved or degraded performance.

02

Split logistics into three services, not one. Our initial architecture proposed a single “Logistics Service” handling all three partners. During planning, we realised that Partner B (SFTP batch) and Partner A (REST real-time) had fundamentally different runtime characteristics, failure modes, and change frequencies. Combining them would have created a mini-monolith. The decision to split saved us from re-coupling the architecture we had just decoupled.

03

The shared database was the real migration bottleneck, not the code. Extracting service code was relatively straightforward. The hard part was untangling shared database tables. The monolith used a single PostgreSQL database where the payment module and the logistics module both read from an “orders” table with 47 columns. We had to decide which service owned which columns, build data synchronization events for cross-service reads, and migrate foreign key relationships — all while 12,000 orders per day continued flowing through the system.

04

Team topology had to change alongside the architecture. The four developers who previously worked on the monolith initially continued reviewing each other's PRs across all services. This recreated the coordination overhead that microservices were supposed to eliminate. We restructured into two pairs, each responsible for a set of services end-to-end (deploy, monitor, fix). Deployment frequency doubled within two weeks of this change.

Tags:MicroservicesMonolith MigrationStrangler Fig PatternKubernetesAPI GatewayDistributed SystemsERP IntegrationDevOps

Spending Sunday Nights on Deployments?

If your deployments require maintenance windows, your scaling bills spike because you can't scale individual components, or a bug in one module can take down unrelated flows — the architecture is working against you. We start with a two-week instrumentation and analysis phase to map your actual request flows, identify natural service boundaries, and build a phased migration plan that doesn't require betting the business on a big-bang rewrite.