Analytics Engineering · 2025–2026

Building a reporting data mart from a production app database

ECD Connect · CHW Connect (Grow Great) · dbt · PostgreSQL · Git

The problem
Two production databases built for an app and a CMS — not for reporting. Getting reliable programme data out of them required understanding a schema that was never designed to be queried directly.
What I built
A production reporting data mart using dbt: 17+ tested dimension and fact tables across two schemas, spanning education and health data for over 4,000 practitioners and 16,000 children and clients.
My role
Solo analytics engineer. Source database exploration, schema design, SQL, dbt modelling, testing, documentation, and data quality investigation — end to end.
Stack
PostgreSQL on Azure · dbt 1.11 · postgres_fdw · DBeaver · Git · Grafana (dashboards, upcoming)

ECD Connect supports over 4,000 early childhood practitioners across South Africa. CHW Connect — built on the same codebase for Grow Great — supports community health workers visiting pregnant clients and infants in underserved communities.

Both platforms generate rich programme data: attendance, child development assessments, practitioner registrations, income statements, CHW visits, clinical screenings. The problem was that all of it lived in a production database built for an app, not for analysis. Querying it directly meant dealing with misleading table names, unreliable flags, missing timestamps, and joins that only made sense once you understood the application logic behind them.

Partner organisations and funders needed reliable programme data. Dashboards were being built directly against production — fragile, slow, and dependent on whoever had written the query understanding all the source quirks. The goal was to build a clean, tested, documented reporting layer that would make all of that sustainable.

The hardest work happened before any SQL was written.

The source databases use naming conventions that made sense for the application but not for reporting. What should be called a "preschool" is named Classroom. What should be called a "class" is ClassroomGroup. The health platform's visit table has a PractitionerId column — inherited from the shared codebase — that is never populated in the health context. CHWs join to their user records via UserId, not Id, and there is exactly one row in production where those two values differ — which means joining on the wrong column mostly works, until it doesn't.

Understanding the database meant months of exploration: tracing foreign keys, reading application code, testing assumptions against real data, and building up a mental model of what the schema actually represented versus what it was named. Every phase of the mart build surfaced new corners of the database that had to be understood before they could be modelled correctly.

One example: Closing a client folder in the app triggers three different event types in the source database — only one of which sits under a close_folder parent type. The other two (baby_was_born and miscarriage) are top-level event types with no parent. Handling only the obvious path left folder-closed counts off by over 1,400 records. There were also 535 records where IsActive had been set to false with no corresponding event at all — a product bug, now ticketed.

Three-layer dbt architecture, following analytics engineering conventions:

Staging layer
One model per source table. Renames PascalCase columns to snake_case, applies plain-English table names, drops irrelevant columns (UI engagement tracking, ghost columns), and filters invalid rows (infinity timestamps, test users). Views, not tables.
Intermediate layer
Complex join logic that would otherwise be repeated across multiple mart models — cohort membership resolution, deduplication patterns, derived flags.
Mart layer
Analysis-ready dimension and fact tables. Denormalised where needed for dashboard performance. Full geography chains pre-resolved. All data quality decisions documented in model descriptions and the project README.
Testing & documentation
dbt data tests on every primary key and critical foreign key. Known anomalies documented with warn-level tests so they surface in the build output rather than silently corrupting downstream queries. One schema.yml per folder.

Three phases complete, two in progress:

Phase 1 — Education core ✅

Practitioners, preschools, classes, children, attendance, register completion, cohorts. The first end-to-end slice proving the pipeline worked.

Phase 2 — Education depth ✅

Child progress reports and skill observations (with reverse-scoring logic), DBE registration tracking (with history-based recovery for a source bug), practitioner self-assessment forms, income statements. Ten models, all with documented data quality handling.

Phase 3 — Health mart ✅

New schema (mart_health), new source database (chwconnect). Community health workers, pregnant clients, child clients, and CHW visits. Over 70,000 visits modelled. CHW resolved via dual caregiver path pre-computed so downstream queries don't repeat the logic.

Phase 4 & 5 — Upcoming

Visit responses (clinical data — HIV status, MUAC, nutrition, immunisation), growth measurements, referrals, and a combined de-identified dataset spanning both platforms for government and funder reporting.

Downstream dashboards that previously queried production directly now run off a clean, tested reporting layer. Clinical and programme queries that required complex multi-table joins with source-specific quirks hardcoded into every panel are now single-table queries against reliable mart tables.

The mart is the foundation for the combined de-identified dataset that will support government planning and funder reporting — the long-term goal of the ECD Connect data infrastructure.

Scale: Over 4,000 practitioners and community health workers. More than 16,000 children and clients. 70,000+ CHW visits. Two platforms, two source databases, one reporting layer.