A Monolithic Layer Processing 12,000+ Daily Transactions split into 7 Independent Services

A B2B distribution company processing roughly 12,000 orders per day through an integration layer that connected their ERP, warehouse management system, three logistics partners, and a payment gateway had reached a breaking point. The integration layer — a single Node.js monolith that had grown to approximately 180,000 lines of code over several years — was deployed as one unit. Every deployment was a full-system event that required a maintenance window of around 45 minutes, typically scheduled for Sunday nights. A bug in the logistics rate-calculation module had taken down payment processing for several hours because both flows shared the same runtime.
Brainstack Technologies led a seven-month migration that decomposed this monolith into seven independently deployable microservices — without requiring any system downtime and without disrupting the 12,000+ orders flowing through the pipeline daily.
Project Overview
The Challenge
The integration layer had started life as a straightforward Node.js application that connected the company's ERP to a single logistics provider via REST APIs. Over time, as the company onboarded two additional logistics partners (one using SFTP file exchange, one using SOAP APIs), added a payment gateway integration, and built inventory synchronization with their warehouse management system, the codebase grew to approximately 180,000 lines — all in a single deployable unit.
By the time we were brought in, the problems were compounding:
The triggering outage made the case for migration. A rate-calculation change for one logistics partner introduced a memory leak that exhausted the Node.js heap within hours of deployment. Because all integration flows — including payment processing — ran in the same process, the memory leak took down the entire system. Orders couldn't be processed while the team rolled back. The post-mortem estimated the outage cost in the tens of thousands of dollars in delayed shipments and penalty fees from logistics partners.
Deployments required Sunday-night maintenance windows. Every release — even a one-line configuration change — required redeploying the entire monolith, which meant a 45-minute window where no orders were processed. The ops team had to coordinate with logistics partners to pause inbound feeds, which required 72-hour advance notice. This made even minor fixes a multi-day process.
Scaling was all-or-nothing. During peak seasons (Black Friday, holiday pre-orders), the order volume tripled. The payment processing flow needed to scale, but because it shared resources with logistics rate calculation and inventory sync, scaling meant provisioning three times the infrastructure across all flows — most of which didn't need the extra capacity.
Developer velocity had stalled. With four developers working in the same codebase, merge conflicts were constant. A change to the payment flow required regression testing against logistics and inventory integrations because of shared utility modules, database connections, and error-handling patterns that were deeply intertwined. The team estimated they spent roughly 30% of their time on integration testing that only existed because of architectural coupling.


Our Approach
Strangler Fig Migration Strategy
We ruled out a big-bang rewrite immediately. The integration layer was processing 12,000+ orders per day; it couldn't go offline, and the business couldn't tolerate running a new untested system in parallel for months. Instead, we used a strangler fig approach: extracting one service at a time from the monolith while the remaining monolith continued to handle everything else.
The first six weeks were spent on architecture and decomposition planning. We analysed the monolith's code, database schema, and runtime call patterns (we instrumented the monolith with OpenTelemetry for two weeks to capture actual request flows, not just what the code suggested). This analysis revealed seven natural domain boundaries:
- Order Ingestion — receiving and validating incoming orders from the ERP
- Payment Processing — gateway communication, authorization, settlement
- Logistics: Partner A — REST-based carrier integration
- Logistics: Partner B — SFTP file exchange (daily batch)
- Logistics: Partner C — SOAP API integration
- Inventory Sync — bidirectional sync with the warehouse management system
- Notification & Alerting — order confirmations, shipment tracking, failure alerts
We intentionally split logistics into three separate services rather than one unified “logistics service.” The three partners used fundamentally different protocols (REST, SFTP, SOAP), had different SLA requirements, and changed at different rates. Combining them into a single service would have recreated the coupling problem at a smaller scale.
The extraction order was deliberate: we started with Notification & Alerting (lowest risk, no transactional data, easiest to validate) and ended with Payment Processing (highest risk, regulatory requirements, most complex error handling). This gave the team progressively harder challenges rather than starting with the most dangerous one.
API Gateway and Service Communication
We deployed Kong as the API gateway in front of both the monolith and the emerging services. During migration, Kong handled the routing logic: requests for extracted domains (e.g., /notifications/*) were routed to the new service, while everything else continued to hit the monolith. As each service was extracted, we updated Kong's routing configuration — no code changes to the monolith required for the switchover.
For inter-service communication, we used two patterns based on the consistency requirements of each flow:
Synchronous REST for the order-payment flow. When an order comes in, payment authorization must happen immediately and return a success/failure before the order is confirmed. This is a hard consistency requirement — eventual consistency is not acceptable for payment authorization. These calls go through internal REST APIs with circuit breakers (we used Opossum in Node.js) to prevent cascading failures.
Asynchronous messaging via RabbitMQfor everything else. Inventory updates, logistics dispatch notifications, and alerting all use event-driven messaging. When an order is confirmed, an “order.confirmed” event is published to RabbitMQ, and the relevant services consume it independently. If the notification service is temporarily down, the message waits in the queue — the order isn't affected.
The hardest communication problem was the logistics batch service (Partner B). This partner expected a single consolidated SFTP file every four hours, but orders trickled in continuously. We built a small aggregation service that consumed individual order events from RabbitMQ, batched them into 4-hour windows, generated the SFTP file in the partner's expected format, and uploaded it on schedule. This service was arguably the most custom piece of the entire architecture.
Containerization and Orchestration
Each of the seven services was containerized with Docker and deployed on a managed Kubernetes cluster (AWS EKS). We set up independent CI/CD pipelines using GitHub Actions — each service has its own repository, its own test suite, its own pipeline, and can be deployed to staging or production independently.
Deployment went from a 45-minute Sunday-night maintenance window to a rolling update that completes in under 4 minutes per service with zero downtime (Kubernetes rolling deployment strategy with readiness probes). The team now deploys individual services 8-12 times per week across the seven services combined, compared to the previous cadence of once per week for the entire monolith.
For the payment service specifically, we configured more conservative deployment guardrails: canary deployments that route a small percentage of payment traffic to the new version before full rollout, automatic rollback if the error rate exceeds a defined threshold, and a mandatory staging environment test against the payment gateway's sandbox before production deployment. The earlier outage had made leadership understandably cautious about payment-related changes.
Technology Stack
Service Layer
Gateway & Messaging
Data & Infrastructure
Observability
Results
Deployment Speed:
Deploy time per service dropped from 45 minutes (full monolith) to under 4 minutes (rolling update, zero downtime). Maintenance windows eliminated entirely.
Deployment Frequency:
From 1 deployment per week (Sunday night, coordinated) to 8-12 deployments per week across services. Individual services are updated 1-3 times per week.
Incident Blast Radius:
Before: a bug in any module could take down all 12,000+ daily transactions. After: failures are isolated to the affected service. The payment service has had two incidents since migration — neither affected logistics or inventory sync.
Scaling Efficiency:
During the first post-migration peak season, the team scaled the Payment Processing and Order Ingestion services to 3x capacity while the other five services stayed at baseline — saving an estimated 60% on peak-season infrastructure costs compared to scaling the entire monolith.
New Integration Speed:
Onboarding a fourth logistics partner (REST API) took 3 weeks post-migration. Pre-migration, the team estimated the same work would have taken 8-10 weeks because of regression testing requirements and deployment coordination.
Developer Productivity:
Merge conflicts dropped substantially. Developers reported spending roughly 10% of their time on integration-related overhead, down from ~30% pre-migration.
Observability & Monitoring
We established the observability stack before extracting the first service — this was one of the most valuable decisions in the project. By instrumenting the monolith with OpenTelemetry first, the team could see request flows across the monolith's internal modules. When we started extracting services, the tracing data simply reflected the new boundaries without any additional instrumentation work.
The observability stack includes:




Key Engineering Lessons
Instrument before you extract. We deployed OpenTelemetry on the monolith two weeks before extracting the first service. This gave the team baseline visibility into request flows, latency, and error rates — which meant we could immediately compare a newly extracted service's performance against its monolith-era baseline. Without this, we would have been flying blind on whether each extraction improved or degraded performance.
Split logistics into three services, not one. Our initial architecture proposed a single “Logistics Service” handling all three partners. During planning, we realised that Partner B (SFTP batch) and Partner A (REST real-time) had fundamentally different runtime characteristics, failure modes, and change frequencies. Combining them would have created a mini-monolith. The decision to split saved us from re-coupling the architecture we had just decoupled.
The shared database was the real migration bottleneck, not the code. Extracting service code was relatively straightforward. The hard part was untangling shared database tables. The monolith used a single PostgreSQL database where the payment module and the logistics module both read from an “orders” table with 47 columns. We had to decide which service owned which columns, build data synchronization events for cross-service reads, and migrate foreign key relationships — all while 12,000 orders per day continued flowing through the system.
Team topology had to change alongside the architecture. The four developers who previously worked on the monolith initially continued reviewing each other's PRs across all services. This recreated the coordination overhead that microservices were supposed to eliminate. We restructured into two pairs, each responsible for a set of services end-to-end (deploy, monitor, fix). Deployment frequency doubled within two weeks of this change.
Spending Sunday Nights on Deployments?
If your deployments require maintenance windows, your scaling bills spike because you can't scale individual components, or a bug in one module can take down unrelated flows — the architecture is working against you. We start with a two-week instrumentation and analysis phase to map your actual request flows, identify natural service boundaries, and build a phased migration plan that doesn't require betting the business on a big-bang rewrite.










