Skip to content

refactor: declarative schema for configuration#963

Open
cmgzn wants to merge 5 commits intodatajuicer:mainfrom
cmgzn:refactor_config
Open

refactor: declarative schema for configuration#963
cmgzn wants to merge 5 commits intodatajuicer:mainfrom
cmgzn:refactor_config

Conversation

@cmgzn
Copy link
Copy Markdown
Collaborator

@cmgzn cmgzn commented Apr 8, 2026

Core Changes

  • Add data_juicer/config/schema.py (938 lines): Define DJConfig Pydantic model as single source of truth for global config
  • Refactor data_juicer/config/config.py: Migrate 673 lines of field definitions to schema module; retain only parser construction logic
  • Enhance ConfigValidator: Support merging BASE_CONFIG_RULES with CONFIG_VALIDATION_RULES

Technical Benefits

  • Centralized config field definitions, eliminating scattered parser.add_argument() calls
  • Provide query APIs (get_json_schema(), get_defaults()) for external tool integration
  • Type validation delegated to Pydantic; runtime checks moved to model definition phase

Impact Scope

  • data_juicer/core/data/dataset_builder.py: Adapt to new config structure
  • data_juicer/core/data/load_strategy.py: Adapt to merged validation rules
  • data_juicer/tools/DJ_mcp_recipe_flow.py: Adapt to config API changes
  • Test cases updated accordingly

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a declarative configuration schema using Pydantic, centralizing configuration definitions and refactoring the argument parser and default settings logic. It also updates dataset building and event logging to support the new configuration structure. Feedback highlights a redundant conditional block in the schema registration function and an unused helper function in the recipe flow tool that should be removed.

Comment thread data_juicer/config/schema.py
Comment thread data_juicer/tools/DJ_mcp_recipe_flow.py Outdated
@cmgzn cmgzn force-pushed the refactor_config branch from a60bc66 to 8dd7f01 Compare April 9, 2026 07:31
@cmgzn cmgzn changed the title Refactor config schema refactor: declarative schema for configuration Apr 9, 2026
@cmgzn cmgzn marked this pull request as ready for review April 9, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant