Data Mart Case Study | Kim Tichmann

At a glance

The problem

Two production databases optimised for app and CMS functionality. Built a dedicated reporting layer so that programme data could be queried reliably, efficiently, and without needing to understand the underlying application logic.

What I built

A production reporting data mart using dbt: 17+ tested dimension and fact tables across two schemas, spanning education and health data for over 4,000 practitioners and 16,000 children and clients.

My role

Source database exploration, schema design, SQL, dbt modelling, testing, and documentation.

Stack

PostgreSQL on Azure · dbt · DBeaver · Git · Grafana (dashboards)

Context

ECD Connect and CHW Connect support over 4,000 early childhood practitioners and community health workers across South Africa.

Both platforms generate rich programme data: attendance, child development assessments, practitioner registrations, income statements, CHW visits, clinical screenings. All of it lived in production databases optimised for application logic. Dashboards I was building directly against production were slow, computationally costly, and created a single point of failure: reliable reporting depended entirely on one person knowing the schema. The goal was a clean, tested reporting layer that documented that knowledge into the models themselves, so any analyst could query reliably without needing to understand the underlying application logic.

Source database exploration

The schema was created for application logic rather than reporting logic.

Understanding the source database meant months of exploration: tracing foreign keys, reading application code, testing assumptions against real data, and building up a mental model of what the schema actually represented versus what it was named.

Building the data mart also meant an opportunity to fix naming that had drifted from what the data actually represents. What should be called a "preschool" is named Classroom. What should be called a "class" is ClassroomGroup. Those get corrected once at the staging layer so nothing downstream inherits the confusion.

        One example: Closing a client folder in the app triggers three different event types in the source database — only one of which sits under a close_folder parent type. The other two (baby_was_born and miscarriage) are top-level event types with no parent. Handling only the obvious path left folder-closed counts off by over 1,400 records.
      

Approach

Three-layer dbt architecture, following analytics engineering conventions:

Staging layer

One model per source table. Renames PascalCase columns to snake_case, applies plain-English table names, drops irrelevant columns (UI engagement tracking, ghost columns), and filters invalid rows (infinity timestamps, test users). Views, not tables.

Intermediate layer

Complex join logic that would otherwise be repeated across multiple mart models: cohort membership resolution, deduplication patterns, derived flags.

Mart layer

Analysis-ready dimension and fact tables. Denormalised where needed for dashboard performance. Full geography chains pre-resolved. All data quality decisions documented in model descriptions and the project README.

Testing & documentation

dbt data tests on every primary key and critical foreign key. Known anomalies documented with warn-level tests so they surface in the build output rather than silently corrupting downstream queries. One schema.yml per folder.

What was built

Three phases complete, two in progress:

Phase 1: Education core ✅

Practitioners, preschools, classes, children, attendance, register completion, cohorts. The first end-to-end slice proving the pipeline worked.

Phase 2: Education depth ✅

Child progress reports and skill observations (with reverse-scoring logic), DBE registration tracking (with history-based recovery for a source bug), practitioner self-assessment forms, income statements. Ten models, all with documented data quality handling.

Phase 3: Health mart ✅

New schema (mart_health), new source database (chwconnect). Community health workers, pregnant clients, child clients, and CHW visits. Over 70,000 visits modelled. CHW resolved via dual caregiver path pre-computed so downstream queries don't repeat the logic.

Phase 4 & 5: Upcoming

Visit responses (clinical data: HIV status, MUAC, nutrition, immunisation), growth measurements, referrals, and a combined de-identified dataset spanning both platforms for government and funder reporting.

What it enables

Downstream dashboards can now query a clean, tested reporting layer rather than production directly. Clinical and programme queries that required complex multi-table joins with source-specific quirks hardcoded into every panel are now single-table queries against reliable mart tables.

The mart is the foundation for the combined de-identified dataset that will support government planning and funder reporting, the long-term goal of the ECD Connect data infrastructure.

        Scale: Over 4,000 practitioners and community health workers. More than 16,000 children and clients. 70,000+ CHW visits. Two platforms, two source databases, one reporting layer.
      

Code examples

Three excerpts from the working codebase. Each one reflects a decision that had to be figured out from the source data, not assumed from the schema.

1. Dual caregiver path — stg_chwconnect__child_clients.sql

Child clients in CHW Connect can be linked to a CHW via two different paths: a direct Caregiver record (CaregiverId), or a Mother record (MotherCaregiverId). Neither path is guaranteed to be populated. The COALESCE pattern below resolves this at the staging layer so every downstream model and every Grafana query uses the same logic without repeating it. The source Visit table has a PractitionerId column inherited from the shared ECD Connect codebase — it is never populated in the health context, so CHW resolution goes through the client record instead.

-- CHW resolved via dual caregiver path:
--   primary:  Infant.CaregiverId → Caregiver.HealthCareWorkerId
--   fallback: Infant.MotherCaregiverId → Mother.HealthCareWorkerId
-- COALESCE used throughout — same pattern as all Grafana queries.
--
-- Source note: Visit.PractitionerId is a ghost column from the shared
-- ECD Connect codebase. It is never populated in the health context.
-- CHW resolution goes via the client record instead.

folder_close_events as (

    select
        er."InfantId"                   as child_client_id,
        max(er."InsertedDate")          as folder_closed_at

    from chwconnect."EventRecord" er
    join chwconnect."EventRecordType" ert_child
        on er."EventRecordTypeId" = ert_child."Id"
    left join chwconnect."EventRecordType" ert_parent
        on ert_child."ParentId" = ert_parent."Id"

    where ert_parent."Name" = 'close_folder'
      and er."InfantId" is not null

    group by er."InfantId"

),

renamed as (

    select
        i."Id"                              as child_client_id,

        -- resolved CHW via caregiver path
        coalesce(
            cg."HealthCareWorkerId",
            m."HealthCareWorkerId"
        )                                   as chw_id,

        ...

    from infants i
    left join caregivers    cg  on i."CaregiverId"       = cg."Id"
    left join mothers       m   on i."MotherCaregiverId" = m."Id"
    left join health_care_workers hw
        on coalesce(
            cg."HealthCareWorkerId",
            m."HealthCareWorkerId"
        ) = hw."Id"
    left join folder_close_events fce on i."Id" = fce.child_client_id

)

2. Role derivation — schema.yml, dim_practitioners

The source IsPrincipal flag records what a user selected during onboarding, not whether they completed setup. Users commonly pick a role and then exit the flow without finishing, which leaves the source flag unreliable for reporting. The derived role column uses PrincipalHierarchy as the authoritative signal, with preschool ownership as a fallback. The original flag is preserved in source_is_principal_selection for auditing the gap between selection and committed state.

      - name: role
        description: >
          Derived, reliable role of the practitioner. One of
          'principal', 'practitioner', or 'unknown'. See model
          description for the derivation rules. Prefer this over
          source_is_principal_selection for all role reporting.
        data_tests:
          - not_null
          - accepted_values:
              arguments:
                values: ['principal', 'practitioner', 'unknown']

      - name: source_is_principal_selection
        description: >
          Raw Practitioner.IsPrincipal value from source — the user's
          selection during onboarding. Nullable. Unreliable as a role
          signal because users may select a role without completing
          setup. Preserved for auditing the gap between onboarding
          selection and committed role (e.g. "what % of users who
          selected 'principal' actually completed preschool setup?").
          Do NOT use for role reporting — use the role column.

3. Ghost column rename and pre-computed latency — schema.yml, fact_chw_visits

Visit.PractitionerId is renamed to chw_id at the staging layer with a note explaining the source naming quirk, so downstream models and analysts don't need to know about it. visit_latency_days is calculated at mart build time rather than in each dashboard panel — negative values are expected and indicate visits that happened before the planned date.

      - name: chw_id
        description: >
          FK to dim_chws. The CHW who conducted the visit.
          Source column PractitionerId renamed to chw_id —
          it references HealthCareWorker.Id, not an ecdconnect
          practitioner. Visit.PractitionerId is a ghost column
          from the shared codebase and is not populated in the
          health context.

      - name: visit_latency_days
        description: >
          Days between planned and actual visit date. Negative
          when the visit occurred before the planned date.
          Null when either date is missing. Pre-calculated at
          mart build time so Grafana panels do not repeat
          the logic.

Building a reporting data mart from a production app database