Incremental models in dbt promise faster builds. In practice, late-arriving data silently corrupts them. Here’s the pattern that holds up.
The naive approach (breaks with late data)
-- models/orders_daily.sql
{{ config(materialized='incremental') }}
SELECT *
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}
This fails when an event from 3 days ago arrives today. The MAX(created_at) filter skips it forever.
The robust pattern: lookback window
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge'
) }}
WITH source AS (
SELECT *
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
-- lookback 3 days to catch late arrivals
WHERE created_at >= DATEADD('day', -3, (SELECT MAX(created_at) FROM {{ this }}))
{% endif %}
)
SELECT * FROM source
The DATEADD(-3 days) lookback catches late-arriving events. The unique_key + merge strategy deduplicates.
Mermaid: incremental merge flow
graph LR
A[Raw orders table] -->|filter last N+3 days| B[Staging CTE]
B -->|MERGE on order_id| C[Existing incremental table]
C --> D[Updated table]
What went wrong
Default append strategy meant duplicate rows after a Snowflake task retry. Switching to merge + unique_key fixed it immediately.
Checklist
- Use
incremental_strategy='merge'with aunique_key - Add a 2-3 day lookback window for late arrivals
- Test with
dbt testafter each full-refresh - Set
on_schema_change='append_new_columns'for evolving schemas