ECD Connect · CHW Connect (Grow Great) · dbt · PostgreSQL · Git
ECD Connect supports over 4,000 early childhood practitioners across South Africa. CHW Connect — built on the same codebase for Grow Great — supports community health workers visiting pregnant clients and infants in underserved communities.
Both platforms generate rich programme data: attendance, child development assessments, practitioner registrations, income statements, CHW visits, clinical screenings. The problem was that all of it lived in a production database built for an app, not for analysis. Querying it directly meant dealing with misleading table names, unreliable flags, missing timestamps, and joins that only made sense once you understood the application logic behind them.
Partner organisations and funders needed reliable programme data. Dashboards were being built directly against production — fragile, slow, and dependent on whoever had written the query understanding all the source quirks. The goal was to build a clean, tested, documented reporting layer that would make all of that sustainable.
The hardest work happened before any SQL was written.
The source databases use naming conventions that made sense for the application but not for reporting. What should be called a "preschool" is named Classroom. What should be called a "class" is ClassroomGroup. The health platform's visit table has a PractitionerId column — inherited from the shared codebase — that is never populated in the health context. CHWs join to their user records via UserId, not Id, and there is exactly one row in production where those two values differ — which means joining on the wrong column mostly works, until it doesn't.
Understanding the database meant months of exploration: tracing foreign keys, reading application code, testing assumptions against real data, and building up a mental model of what the schema actually represented versus what it was named. Every phase of the mart build surfaced new corners of the database that had to be understood before they could be modelled correctly.
close_folder parent type. The other two (baby_was_born and miscarriage) are top-level event types with no parent. Handling only the obvious path left folder-closed counts off by over 1,400 records. There were also 535 records where IsActive had been set to false with no corresponding event at all — a product bug, now ticketed.
Three-layer dbt architecture, following analytics engineering conventions:
Three phases complete, two in progress:
Practitioners, preschools, classes, children, attendance, register completion, cohorts. The first end-to-end slice proving the pipeline worked.
Child progress reports and skill observations (with reverse-scoring logic), DBE registration tracking (with history-based recovery for a source bug), practitioner self-assessment forms, income statements. Ten models, all with documented data quality handling.
New schema (mart_health), new source database (chwconnect). Community health workers, pregnant clients, child clients, and CHW visits. Over 70,000 visits modelled. CHW resolved via dual caregiver path pre-computed so downstream queries don't repeat the logic.
Visit responses (clinical data — HIV status, MUAC, nutrition, immunisation), growth measurements, referrals, and a combined de-identified dataset spanning both platforms for government and funder reporting.
Downstream dashboards that previously queried production directly now run off a clean, tested reporting layer. Clinical and programme queries that required complex multi-table joins with source-specific quirks hardcoded into every panel are now single-table queries against reliable mart tables.
The mart is the foundation for the combined de-identified dataset that will support government planning and funder reporting — the long-term goal of the ECD Connect data infrastructure.