Skip to content

[BUG] Databricks: Overwriting a source column via derived_columns fails with UNRESOLVED_COLUMN #426

@cramirez98

Description

@cramirez98

Describe the bug

When using derived_columns to overwrite an existing source column (same alias as the source column name) with a SQL expression, the model fails at runtime with an UNRESOLVED_COLUMN error, even when include_source_columns is true.

The macro excludes the source column from source_columns_to_select to avoid duplicate column names in the output, which is correct. But this also removes it from the ldts_rsrc_data CTE entirely. By the time derived_columns tries to evaluate the expression, the column is simply not there.

There is no clean workaround. Using src_cols_required with include_source_columns: false triggers a second error where credits is carried through as an input but then selected twice in derived_columns, causing an ambiguous reference error. Both paths are broken.

Environment

  • dbt version: dbt-core 1.11.0rc2
  • datavault4dbt version: 1.16.0
  • Database/Platform: Databricks

To Reproduce

Source table:

SELECT customer_id, customer_name, customer_segment, credits
FROM (
    VALUES
        ('C001', 'Alice Brown', 'VIP',     '1.2'),
        ('C002', 'Bob Smith',   'VIP',     '1.5'),
        ('C003', 'Carol Jones', 'REGULAR', '33.2')
) AS t(customer_id, customer_name, customer_segment, credits)

Scenario 1: include_source_columns true, no src_cols_required (following the docs)

{{ config(materialized='view') }}

{%- set yaml_metadata -%}
source_model: 'erp_customers'
ldts: 'CURRENT_TIMESTAMP'
rsrc: '!test'
include_source_columns: True
enable_ghost_records: True

derived_columns:
  credits:
    value: 'TRY_CAST(credits AS DOUBLE)'
    datatype: 'DOUBLE'

hashed_columns:
  customer_hk:
    - customer_id
  hashdiff:
    is_hashdiff: true
    columns:
      - customer_name
      - customer_segment
      - credits
{%- endset -%}

{{ datavault4dbt.stage(yaml_metadata=yaml_metadata) }}

Error:

Database Error in model stg_customers
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `credits`
cannot be resolved. Did you mean one of the following?
[`loaded_at`, `customer_id`, `record_source`, `customer_name`, `customer_segment`].
SQLSTATE: 42703; line 61 pos 11

The compiled SQL makes the root cause visible. credits is dropped from ldts_rsrc_data so the downstream derived_columns CTE has nothing to work with:

ldts_rsrc_data AS (
  SELECT
    CAST( CURRENT_TIMESTAMP as TIMESTAMP ) AS loaded_at,
    CAST( 'test' as STRING ) AS record_source,
    customer_id,
    customer_name,
    customer_segment
    -- credits is missing here
  FROM source_data
),

derived_columns AS (
  SELECT
    loaded_at,
    record_source,
    customer_id,
    customer_name,
    customer_segment,
    TRY_CAST(credits AS DOUBLE) AS credits  -- fails, credits doesn't exist upstream
  FROM ldts_rsrc_data
),

Scenario 2: include_source_columns false, with src_cols_required (attempted workaround)

{{ config(materialized='view') }}

{%- set yaml_metadata -%}
source_model: 'erp_customers'
ldts: 'CURRENT_TIMESTAMP'
rsrc: '!test'
include_source_columns: False
enable_ghost_records: True

derived_columns:
  credits:
    value: 'TRY_CAST(credits AS DOUBLE)'
    datatype: 'DOUBLE'
    src_cols_required: credits

hashed_columns:
  customer_hk:
    - customer_id
  hashdiff:
    is_hashdiff: true
    columns:
      - customer_name
      - customer_segment
      - credits
{%- endset -%}

{{ datavault4dbt.stage(yaml_metadata=yaml_metadata) }}

Error:

Database Error in model stg_customers
[AMBIGUOUS_REFERENCE] Reference `credits` is ambiguous, could be:
[`derived_columns`.`credits`, `derived_columns`.`credits`].
SQLSTATE: 42704; line 78 pos 8

The compiled SQL shows that credits is now carried through correctly as an input, but it ends up selected twice in derived_columns, once as the raw source column and once as the derived expression. This duplication then propagates through every subsequent CTE:

ldts_rsrc_data AS (
  SELECT
    CAST( CURRENT_TIMESTAMP as TIMESTAMP ) AS loaded_at,
    CAST( 'test' as STRING ) AS record_source,
    credits,         -- raw input column
    customer_id,
    customer_name,
    customer_segment
  FROM source_data
),

derived_columns AS (
  SELECT
    loaded_at,
    record_source,
    credits,         -- raw column selected here...
    customer_id,
    customer_name,
    customer_segment,
    TRY_CAST(credits AS DOUBLE) AS credits  -- ...and again as derived expression
  FROM ldts_rsrc_data
),

hashed_columns AS (
  SELECT
    loaded_at,
    record_source,
    credits,         -- duplicated
    customer_id,
    customer_name,
    customer_segment,
    credits,         -- duplicated
    ...
  FROM derived_columns
),
...
columns_to_select AS (
  SELECT
    loaded_at,
    record_source,
    credits,         -- duplicated
    customer_id,
    customer_name,
    customer_segment,
    credits,         -- duplicated
    customer_hk,
    hashdiff
  FROM hashed_columns
  ...
)

Expected behavior

In both scenarios, the output should contain a single credits column cast to DOUBLE. The overwrite pattern is documented at Derived Columns as a supported feature and should work out of the box without requiring workarounds or hitting ambiguous reference errors.

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions