Describe the bug
When using derived_columns to overwrite an existing source column (same alias as the source column name) with a SQL expression, the model fails at runtime with an UNRESOLVED_COLUMN error, even when include_source_columns is true.
The macro excludes the source column from source_columns_to_select to avoid duplicate column names in the output, which is correct. But this also removes it from the ldts_rsrc_data CTE entirely. By the time derived_columns tries to evaluate the expression, the column is simply not there.
There is no clean workaround. Using src_cols_required with include_source_columns: false triggers a second error where credits is carried through as an input but then selected twice in derived_columns, causing an ambiguous reference error. Both paths are broken.
Environment
- dbt version: dbt-core 1.11.0rc2
- datavault4dbt version: 1.16.0
- Database/Platform: Databricks
To Reproduce
Source table:
SELECT customer_id, customer_name, customer_segment, credits
FROM (
VALUES
('C001', 'Alice Brown', 'VIP', '1.2'),
('C002', 'Bob Smith', 'VIP', '1.5'),
('C003', 'Carol Jones', 'REGULAR', '33.2')
) AS t(customer_id, customer_name, customer_segment, credits)
Scenario 1: include_source_columns true, no src_cols_required (following the docs)
{{ config(materialized='view') }}
{%- set yaml_metadata -%}
source_model: 'erp_customers'
ldts: 'CURRENT_TIMESTAMP'
rsrc: '!test'
include_source_columns: True
enable_ghost_records: True
derived_columns:
credits:
value: 'TRY_CAST(credits AS DOUBLE)'
datatype: 'DOUBLE'
hashed_columns:
customer_hk:
- customer_id
hashdiff:
is_hashdiff: true
columns:
- customer_name
- customer_segment
- credits
{%- endset -%}
{{ datavault4dbt.stage(yaml_metadata=yaml_metadata) }}
Error:
Database Error in model stg_customers
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `credits`
cannot be resolved. Did you mean one of the following?
[`loaded_at`, `customer_id`, `record_source`, `customer_name`, `customer_segment`].
SQLSTATE: 42703; line 61 pos 11
The compiled SQL makes the root cause visible. credits is dropped from ldts_rsrc_data so the downstream derived_columns CTE has nothing to work with:
ldts_rsrc_data AS (
SELECT
CAST( CURRENT_TIMESTAMP as TIMESTAMP ) AS loaded_at,
CAST( 'test' as STRING ) AS record_source,
customer_id,
customer_name,
customer_segment
-- credits is missing here
FROM source_data
),
derived_columns AS (
SELECT
loaded_at,
record_source,
customer_id,
customer_name,
customer_segment,
TRY_CAST(credits AS DOUBLE) AS credits -- fails, credits doesn't exist upstream
FROM ldts_rsrc_data
),
Scenario 2: include_source_columns false, with src_cols_required (attempted workaround)
{{ config(materialized='view') }}
{%- set yaml_metadata -%}
source_model: 'erp_customers'
ldts: 'CURRENT_TIMESTAMP'
rsrc: '!test'
include_source_columns: False
enable_ghost_records: True
derived_columns:
credits:
value: 'TRY_CAST(credits AS DOUBLE)'
datatype: 'DOUBLE'
src_cols_required: credits
hashed_columns:
customer_hk:
- customer_id
hashdiff:
is_hashdiff: true
columns:
- customer_name
- customer_segment
- credits
{%- endset -%}
{{ datavault4dbt.stage(yaml_metadata=yaml_metadata) }}
Error:
Database Error in model stg_customers
[AMBIGUOUS_REFERENCE] Reference `credits` is ambiguous, could be:
[`derived_columns`.`credits`, `derived_columns`.`credits`].
SQLSTATE: 42704; line 78 pos 8
The compiled SQL shows that credits is now carried through correctly as an input, but it ends up selected twice in derived_columns, once as the raw source column and once as the derived expression. This duplication then propagates through every subsequent CTE:
ldts_rsrc_data AS (
SELECT
CAST( CURRENT_TIMESTAMP as TIMESTAMP ) AS loaded_at,
CAST( 'test' as STRING ) AS record_source,
credits, -- raw input column
customer_id,
customer_name,
customer_segment
FROM source_data
),
derived_columns AS (
SELECT
loaded_at,
record_source,
credits, -- raw column selected here...
customer_id,
customer_name,
customer_segment,
TRY_CAST(credits AS DOUBLE) AS credits -- ...and again as derived expression
FROM ldts_rsrc_data
),
hashed_columns AS (
SELECT
loaded_at,
record_source,
credits, -- duplicated
customer_id,
customer_name,
customer_segment,
credits, -- duplicated
...
FROM derived_columns
),
...
columns_to_select AS (
SELECT
loaded_at,
record_source,
credits, -- duplicated
customer_id,
customer_name,
customer_segment,
credits, -- duplicated
customer_hk,
hashdiff
FROM hashed_columns
...
)
Expected behavior
In both scenarios, the output should contain a single credits column cast to DOUBLE. The overwrite pattern is documented at Derived Columns as a supported feature and should work out of the box without requiring workarounds or hitting ambiguous reference errors.
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
Describe the bug
When using
derived_columnsto overwrite an existing source column (same alias as the source column name) with a SQL expression, the model fails at runtime with anUNRESOLVED_COLUMNerror, even wheninclude_source_columnsistrue.The macro excludes the source column from
source_columns_to_selectto avoid duplicate column names in the output, which is correct. But this also removes it from theldts_rsrc_dataCTE entirely. By the timederived_columnstries to evaluate the expression, the column is simply not there.There is no clean workaround. Using
src_cols_requiredwithinclude_source_columns: falsetriggers a second error wherecreditsis carried through as an input but then selected twice inderived_columns, causing an ambiguous reference error. Both paths are broken.Environment
To Reproduce
Source table:
Scenario 1: include_source_columns true, no src_cols_required (following the docs)
{{ config(materialized='view') }} {%- set yaml_metadata -%} source_model: 'erp_customers' ldts: 'CURRENT_TIMESTAMP' rsrc: '!test' include_source_columns: True enable_ghost_records: True derived_columns: credits: value: 'TRY_CAST(credits AS DOUBLE)' datatype: 'DOUBLE' hashed_columns: customer_hk: - customer_id hashdiff: is_hashdiff: true columns: - customer_name - customer_segment - credits {%- endset -%} {{ datavault4dbt.stage(yaml_metadata=yaml_metadata) }}Error:
The compiled SQL makes the root cause visible.
creditsis dropped fromldts_rsrc_dataso the downstreamderived_columnsCTE has nothing to work with:Scenario 2: include_source_columns false, with src_cols_required (attempted workaround)
{{ config(materialized='view') }} {%- set yaml_metadata -%} source_model: 'erp_customers' ldts: 'CURRENT_TIMESTAMP' rsrc: '!test' include_source_columns: False enable_ghost_records: True derived_columns: credits: value: 'TRY_CAST(credits AS DOUBLE)' datatype: 'DOUBLE' src_cols_required: credits hashed_columns: customer_hk: - customer_id hashdiff: is_hashdiff: true columns: - customer_name - customer_segment - credits {%- endset -%} {{ datavault4dbt.stage(yaml_metadata=yaml_metadata) }}Error:
The compiled SQL shows that
creditsis now carried through correctly as an input, but it ends up selected twice inderived_columns, once as the raw source column and once as the derived expression. This duplication then propagates through every subsequent CTE:Expected behavior
In both scenarios, the output should contain a single
creditscolumn cast toDOUBLE. The overwrite pattern is documented at Derived Columns as a supported feature and should work out of the box without requiring workarounds or hitting ambiguous reference errors.Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.