Removing Data Silos: Building a Unified Analytics System

A US-based sustainability leader collects assessment data from smallholder farming communities across multiple countries in East Africa and Latin America. The assessments measure farm-level practices — soil management, input usage, yield, income — alongside environmental indicators derived from satellite imagery. The data feeds sustainability benchmarks, certification audits, and advisory programs used by global commodity buyers and development agencies.
Before this engagement, every reporting cycle was a multi-week manual grind. Field assessors submitted surveys from a mobile app — often delayed by days or weeks of patchy connectivity from remote rural areas. Satellite imagery was downloaded and processed manually by a GIS analyst. Partner organisations sent data in everything from structured CSVs to scanned PDF tables. The analytics team spent more time cleaning and reconciling data than analysing it.
Brainstack Technologies designed and built an automated data pipeline. It ingests data from the field survey platform, satellite imagery APIs, and partner submissions. It transforms, validates, and links the records. It delivers clean, analysis-ready datasets into a centralised warehouse. The reporting cycle moved from weeks of manual prep to near-real-time delivery.
Project Overview
- Client
- A US-based sustainability leader (name withheld under NDA)
- Industry
- Agricultural Sustainability & Development
- Data Sources
- Mobile field surveys (ODK-based), satellite imagery (Sentinel-2), partner CSV/Excel/PDF submissions, and government agricultural databases
- Engagement Duration
- Approximately 6 months
- Team
- 3 data engineers, 1 GIS/remote sensing specialist (part-time), 1 QA engineer
- Challenge
- Assessment data arriving from mobile field surveys (with connectivity delays of days or weeks), satellite imagery, partner CSV/PDF submissions, and government databases — all manually cleaned and reconciled by the analytics team over a multi-week reporting cycle
- Solution
- An automated Python/Airflow pipeline ingesting from all sources, standardising formats, validating against domain-specific rules, linking records across sources by farmer and plot identifiers, and delivering clean datasets to a PostgreSQL/PostGIS analytics warehouse
The Challenge
The organisation runs annual sustainability assessments across smallholder farming communities — primarily coffee and cocoa producers — spanning East Africa and Latin America. The assessment data feeds directly into sustainability benchmarks that global commodity buyers and certification bodies use to evaluate supply chain practices.
Data flows into the system from four distinct source types, each with its own challenges:
Mobile Field Surveys
Field assessors use a mobile data collection application to conduct interviews with farmers and record observations at the plot level. In rural areas across East Africa and Latin America, cellular connectivity is intermittent at best. Surveys are completed offline and synced when assessors reach a connectivity point — sometimes days or weeks after collection. The data arrives in batches, with timestamps that reflect sync time rather than collection time, which creates ordering challenges downstream.
Satellite Imagery
The organisation uses Sentinel-2 optical imagery to compute environmental indicators — vegetation indices (NDVI), land-use classification, and deforestation change detection — for the geographic areas where assessments are conducted. The optical/NIR bands used in this pipeline are 10m, while Sentinel-2 also includes 20m and 60m bands; revisit is typically around 5 days at the equator with the full constellation and can be shorter at mid-latitudes. Before automation, a GIS analyst manually downloaded, processed, and summarised this imagery for each reporting region — a process that consumed days of dedicated analyst time per region.
Partner Data Submissions
Multiple partner organisations in different countries submit their own assessment data. Some submit structured CSV exports from their own data systems, others submit Excel workbooks with varying column structures that change between cycles, and at least one submits scanned PDF reports that require manual data extraction. There was no schema standard across partners.
Government Agricultural Databases
Baseline statistics — crop prices, rainfall data, regional yield averages — are pulled from national agricultural databases and FAO datasets to contextualize the farm-level assessment data. These sources update irregularly, and some are only available as downloadable files from government websites with no API access.
The existing process: Before the pipeline was built, the analytics team manually downloaded data from each source, cleaned it in Excel and R scripts, reformatted it to a common structure, resolved identifier mismatches (different partners used different farmer ID schemes), and assembled the combined dataset for analysis. This process took weeks per reporting cycle. During that time, the team was essentially unavailable for actual analysis work. Worse, manual cleaning introduced subtle inconsistencies that were sometimes not discovered until the reporting stage — leading to correction cycles that further delayed outputs.


Our Solution
Data Ingestion Layer
We built a set of source-specific connectors — one per data source — each handling the specific authentication, format, and error-recovery requirements of its source:
Survey platform connector:Pulls completed survey submissions via the field data collection platform's API. Handles the sync-delay problem by using the submission's metadata timestamp (when the assessor recorded it) rather than the server receipt timestamp for ordering. Processes in batches, with idempotent ingestion — if a sync delivers duplicate submissions (common with intermittent connectivity), the pipeline deduplicates by submission UUID without creating duplicate records.
Sentinel-2 connector: Pulls publicly available Sentinel-2 satellite imagery for defined geographic areas of interest. Downloads are scheduled weekly, with automatic retry on failed tile downloads. Raw imagery is stored in S3 before processing.
Partner CSV/Excel connectors:Each partner's connector handles their specific column mapping and format quirks. We built a schema-mapping configuration layer so that when partners changed their column names (which happened multiple times during the engagement), the fix was a config update — not a code change. For partners submitting scanned PDFs, we built a semi-automated extraction pipeline using Tabula (Java-based, invoked from Python) for table extraction and a validation step that flags rows with low extraction confidence for human review.
Government/FAO data connectors: Scheduled scrapers for national agricultural databases and FAO statistical downloads. These sources have no APIs, so the connectors simulate the download process and parse the resulting files. Update checks run weekly; data is only re-ingested when the source file has changed (checked via hash comparison).
All connectors feed into a staging area in S3, organised by source, date, and batch ID. An Airflow DAG orchestrates the ingestion schedule, with per-connector retry logic, alerting on consecutive failures, and a dead-letter queue for submissions that fail validation repeatedly.
Transformation and Validation Pipeline
Raw data from the staging area passes through a four-stage transformation pipeline orchestrated by Airflow:
Stage 1 — Format standardisation:Each source's raw data is transformed to a common internal schema. Column names are mapped, data types are cast (dates, numerics, categoricals), and encoding issues are resolved (a recurring problem with partner Excel files containing mixed-encoding text from local languages).
Stage 2 — Geographic normalization:Farm and plot locations arrive in different formats — some as GPS coordinates (WGS84), some as village/district names, some as relative descriptions (“3km north of [village]”). The pipeline geocodes named locations using a curated gazetteer of farming communities built during the engagement, and all coordinates are projected to a common CRS (EPSG:4326) for consistency with the satellite imagery layer.
Stage 3 — Domain-specific validation:This is where agricultural domain knowledge is encoded into the pipeline. Validation rules flag records where, for example, reported coffee yield per hectare exceeds plausible ranges for the region, where a farmer's reported plot size contradicts the satellite-derived plot boundary by more than 30%, or where assessment dates fall outside the crop season calendar for that country. Flagged records are not discarded — they're routed to a review queue where the analytics team can inspect and either correct or approve them. This was a deliberate design decision: in agricultural data, “outliers” are sometimes the most important data points (a farmer reporting unusually high yield may be using an innovative practice worth studying).
Stage 4 — Record linking:The pipeline links records across sources using a composite key of farmer ID, plot ID, and assessment year. Where farmer IDs don't match across partners (each partner had their own ID scheme), we built a fuzzy matching layer using farmer name, village, and approximate GPS location to suggest matches, with human confirmation required before linking. Over the first two reporting cycles, the confirmed matches were fed back into the system to improve match confidence for subsequent cycles.
Geospatial Data Processing
The geospatial processing component runs as a separate Airflow DAG on dedicated compute-optimised EC2 instances — sized for the imagery workload and kept separate from the tabular data pipeline to avoid resource contention.
The processing flow for each geographic area of interest:
Sentinel-2 tiles covering the area are downloaded from CDSE platform APIs. Cloud masking is applied using the Scene Classification Layer (SCL) map to exclude cloud-covered pixels — a critical step in tropical regions where cloud cover can obscure 50–90% of optical images during rainy seasons.
Vegetation indices are computed from the cloud-free imagery — primarily NDVI (Normalized Difference Vegetation Index) for crop health assessment and EVI (Enhanced Vegetation Index) for areas with dense canopy cover where NDVI saturates.
Land-use classification is run using a supervised classification model trained on labeled ground-truth data from early assessment cycles. The model distinguishes between active cropland, fallow land, forest cover, and non-agricultural land use. Classification accuracy was validated against a held-out test set of manually labeled plots.
Change detection compares current-cycle imagery against the previous cycle to identify deforestation, land-use conversion, or significant changes in vegetation health — indicators that are increasingly relevant for sustainability certification and EUDR compliance reporting.
Per-plot statistics are extracted by overlaying the farmer plot boundaries (from GPS coordinates in the survey data) onto the processed imagery, computing mean and variance of vegetation indices within each plot boundary. These per-plot statistics are then joined with the tabular assessment data in the analytics warehouse.
The full automated geospatial processing run completes in hours. Before automation, the equivalent manual processing by a GIS analyst took weeks per reporting cycle.
Analytics-Ready Data Store
Clean, validated, linked datasets are delivered into a PostgreSQL 15 database with the PostGIS extension for geospatial queries. The schema is organised around three core entities: farmers, plots, and assessments, with satellite-derived indicators stored as plot-level attributes joined by plot ID and assessment year.
The analytics team connects directly via SQL clients and R/Python notebooks. We built a data dictionary (maintained as a versioned Markdown document in the project repository) that documents every table, column, data type, source, and transformation applied. This investment in documentation proved essential: within two months of deployment, the analytics team was writing their own queries against the warehouse without engineering support for most reporting needs.
For audit and reproducibility — critical in sustainability certification — the pipeline maintains full data lineage: every record in the warehouse can be traced back to its source submission, with a log of every transformation and validation step applied. This lineage is queryable, so when a certification auditor asks “where did this number come from?”, the answer is a database query — not a conversation with an analyst trying to remember what they did in a spreadsheet six months ago.


Technology Stack
Data Pipeline
Infrastructure
Data Quality & Lineage
Results
- Reporting cycle went from weeks of manual preparation to automated delivery. The analytics team now receives updated, validated datasets shortly after field data syncs — compared to the previous multi-week lag.
- Geospatial processing time dropped from weeks of manual GIS analyst work to hours of automated processing across all regions.
- The automated validation pipeline catches a meaningful percentage of incoming records per cycle as potential quality issues (range violations, identifier mismatches, implausible values). Before automation, these issues were discovered — if at all — during the analysis phase, often weeks after ingestion.
- The analytics team reclaimed the majority of their working time that had previously been consumed by data preparation, redirecting it to actual analysis and reporting.
- Cross-regional comparative analysis became possible for the first time. Previously, the manual reconciliation effort meant each region's data was analysed separately. The unified warehouse now supports queries that compare sustainability indicators across all operating regions in a single query.
- Satellite-derived environmental indicators (NDVI, land-use classification, deforestation change detection) are now integrated at the individual plot level with farm-level survey data — enabling the organisation to correlate farming practices with measurable environmental outcomes.
- Full data lineage supports certification audit requirements. When auditors request the provenance of a specific data point, the answer is a traceable pipeline record rather than a manual reconstruction.
- The deforestation change detection outputs have positioned the organisation to support EUDR (EU Deforestation Regulation) compliance reporting for commodity buyers in their supply chains — an emerging requirement that was not part of the original project scope but became possible because of the pipeline's satellite processing capability.
- Within two months of deployment, the analytics team was writing their own queries against the warehouse without engineering support for routine reporting needs.
- The data dictionary and schema documentation significantly reduced onboarding time for new analysts — from weeks of learning the manual process to days of familiarisation with the warehouse schema.


Key Engineering Lessons
Our initial validation rules were based on statistical distributions — flag anything outside 2 standard deviations. The analytics team quickly pointed out that this approach would flag every farmer in a drought-affected region as an outlier. We replaced the statistical rules with domain-specific rules co-designed with the research team: crop yield ranges calibrated per country and crop type, plot size plausibility checks cross-referenced against satellite boundaries, and assessment date validation against local crop season calendars. The resulting validation caught genuinely problematic records while preserving legitimate data variance.
Field assessors in remote rural areas sometimes sync the same batch of surveys multiple times when connectivity drops mid-upload. If the pipeline isn't idempotent — if it can't safely re-process the same submission without creating duplicates — you get corrupted data. Every pipeline stage was designed to be safely re-runnable: connectors deduplicate by submission UUID, transformations are deterministic, and the warehouse load uses upsert logic keyed on composite identifiers.
Early in development, we ran everything on the same infrastructure. When the Sentinel-2 processing kicked off, it consumed enough CPU and memory to slow down the tabular pipeline — which meant survey data ingestion stalled while satellite tiles were being processed. We moved geospatial processing to dedicated EC2 instances with a separate Airflow DAG. The cost increase was marginal; the operational stability improvement was significant.
The data dictionary we built — a versioned Markdown file documenting every table, column, type, source, and transformation — was the single highest-ROI deliverable of the project. Within two months, the analytics team stopped asking engineering questions about the data and started writing their own queries. When a new analyst joined the team, onboarding dropped from three weeks to four days. Documentation is not overhead — it's the bridge that turns a pipeline from an engineering artifact into an organisational asset.
Is Your Analytics Team Spending More Time Cleaning Data Than Analysing It?
If your reporting cycles are measured in weeks because data arrives from multiple sources in inconsistent formats, or your analysts spend most of their time on preparation rather than insight generation — you're in the same position this organisation was. We start with a data source audit: understanding what data you have, where it comes from, how often it changes, and what your analysts actually need. From there, we scope an automated pipeline that eliminates the manual preparation work and gives your team same-day access to clean, validated, analysis-ready data.










