Glossary

dbt Tests and Data Quality Checks: Building Reliable Data Pipelines

Learn how to implement comprehensive data quality checks using dbt tests, from basic assertions to advanced streaming integration for real-time data validation.

dbt Tests and Data Quality Checks: Building Reliable Data Pipelines

Data quality is the foundation of trustworthy analytics. As data pipelines grow in complexity, ensuring data integrity becomes critical. dbt (data build tool) provides a robust testing framework that allows Analytics Engineers and Data Quality Analysts to define, execute, and monitor data quality checks throughout the transformation pipeline.

Understanding dbt's Testing Framework

dbt's testing approach treats data quality as code, enabling version control, peer review, and automated validation. Tests in dbt are essentially SELECT queries that return failing rows. If a test returns zero rows, it passes; any rows returned indicate failures that need attention.

Generic Tests vs. Singular Tests

dbt offers two primary testing approaches:

Generic tests are reusable, parameterized tests that can be applied to any column or model. The four built-in generic tests are:

  • unique: Ensures all values in a column are unique

  • not_null: Validates that a column contains no null values

  • accepted_values: Confirms values match a predefined list

  • relationships: Enforces referential integrity between tables

Singular tests are custom SQL queries stored in the tests/ directory, providing flexibility for complex business logic validation.

Implementing Basic Data Quality Checks

Let's start with a practical example. Consider a customer orders model where we need to ensure data quality:

# models/schema.yml
version: 2

models:
  - name: fct_orders
    description: "Fact table containing order transactions"
    columns:
      - name: order_id
        description: "Unique identifier for each order"
        tests:
          - unique
          - not_null

      - name: customer_id
        description: "Foreign key to customers dimension"
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id

      - name: order_status
        description: "Current status of the order"
        tests:
          - accepted_values:
              values: ['pending', 'processing', 'shipped', 'delivered', 'cancelled']

      - name: order_amount
        description: "Total order amount in USD"
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

Running dbt test executes all defined tests and reports failures, enabling quick identification of data quality issues.

Advanced Testing with Custom Assertions

Beyond generic tests, singular tests enable complex validations. Create a file tests/assert_order_totals_match.sql:

-- Validate that order totals equal sum of line items
WITH order_totals AS (
    SELECT
        order_id,
        order_amount
    FROM {{ ref('fct_orders') }}
),

line_item_totals AS (
    SELECT
        order_id,
        SUM(quantity * unit_price) AS calculated_total
    FROM {{ ref('fct_order_lines') }}
    GROUP BY order_id
)

SELECT
    o.order_id,
    o.order_amount,
    l.calculated_total,
    ABS(o.order_amount - l.calculated_total) AS difference
FROM order_totals o
INNER JOIN line_item_totals l
    ON o.order_id = l.order_id
WHERE ABS(o.order_amount - l.calculated_total) > 0.01

This test ensures financial accuracy by validating that order totals match the sum of their line items, with a small tolerance for rounding differences.

Test Coverage and Quality Metrics

Measuring test coverage helps identify gaps in your data quality strategy. Use dbt packages like dbt-coverage to analyze which models and columns lack tests:

dbt-coverage compute doc --cov-report coverage-doc.json
dbt-coverage compute test --cov-report coverage-test.json

Aim for comprehensive coverage on critical business metrics and primary keys. Not every column requires testing, but understanding your coverage helps prioritize testing efforts.

Streaming Integration and Real-Time Data Quality

Modern data architectures increasingly incorporate streaming data. While dbt traditionally operates on batch transformations, integrating with streaming platforms enables near-real-time quality validation.

Streaming Data Quality Integration

Kafka management platforms can complement dbt's testing framework for streaming scenarios. Here's how to architect an integrated approach:

Architecture Pattern:

  1. Stream events flow through Kafka topics

  2. Governance platforms validate schema compliance and basic data quality rules

  3. Data lands in your data warehouse (incremental materialization)

  4. dbt tests run on micro-batches to validate transformations

  5. Failed tests trigger alerts through monitoring systems

Example incremental model with streaming considerations:

-- models/fct_streaming_events.sql
{{
    config(
        materialized='incremental',
        unique_key='event_id',
        on_schema_change='fail'
    )
}}

SELECT
    event_id,
    user_id,
    event_type,
    event_timestamp,
    properties,
    _kafka_partition,
    _kafka_offset
FROM {{ source('kafka_raw', 'user_events') }}

{% if is_incremental() %}
    WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }})
{% endif %}

Corresponding tests for streaming data:

# models/schema.yml
models:
  - name: fct_streaming_events
    tests:
      # Ensure no duplicate events within a time window
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns:
            - event_id
            - event_timestamp
    columns:
      - name: event_timestamp
        tests:
          # Validate events aren't too far in the future
          - dbt_utils.expression_is_true:
              expression: "<= CURRENT_TIMESTAMP + INTERVAL '5 minutes'"
          # Check for reasonable recency
          - dbt_utils.expression_is_true:
              expression: ">= CURRENT_TIMESTAMP - INTERVAL '7 days'"

Orchestrating Quality Checks

For streaming workflows, consider running dbt tests on a schedule (e.g., every 15 minutes) to catch issues quickly:

# .github/workflows/dbt-streaming-tests.yml
name: Streaming Data Quality Checks
on:
  schedule:
    - cron: '*/15 * * * *'  # Every 15 minutes

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run dbt tests on streaming models
        run: |
          dbt test --select tag:streaming --fail-fast

Best Practices for Data Quality at Scale

  1. Start with critical paths: Focus testing efforts on models that directly impact business decisions

  2. Test early and often: Run tests in development, CI/CD, and production environments

  3. Document test intent: Add clear descriptions to help team members understand validation logic

  4. Set appropriate thresholds: Use warn_if and error_if configurations for graceful degradation

  5. Monitor test performance: Track test execution times to prevent bottlenecks

  6. Integrate with alerting: Connect test failures to Slack, PagerDuty, or other notification systems

Conclusion

dbt's testing framework transforms data quality from an afterthought into a first-class concern. By combining generic tests for common patterns, singular tests for complex business logic, and integration with streaming platforms, teams can build resilient data pipelines that maintain quality from source to consumption.

The key is treating tests as living documentation that evolves with your data models. As your understanding of data quality requirements deepens, continuously refine your testing strategy to catch issues before they impact stakeholders.

Sources and References