Skip to content

Commit 4472789

Browse files
yxdycHYLcoolcmgzn
authored
feat(agent): interaction quality ops & recipe, bad-case HTML report, and robust JSONL / HF meta loading (#957)
* poc of agent-related ops Made-with: Cursor * preliminary test of minimal_configs (01 to 05) Made-with: Cursor * preliminary test of minimal_configs (01 to 05) (#944) * test of minimal_configs (06 to 08), optimize some ops Made-with: Cursor * conflits resolved and gemini'suggestion adopted Made-with: Cursor * test of minimal_configs (06 to 08); optimize some agent ops (#947) * preliminary test of minimal_configs (01 to 05) Made-with: Cursor * test of minimal_configs (06 to 08), optimize some ops Made-with: Cursor * conflits resolved and gemini'suggestion adopted Made-with: Cursor * end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples Made-with: Cursor * feat(agent ops): end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples (#949) * preliminary test of minimal_configs (01 to 05) Made-with: Cursor * test of minimal_configs (06 to 08), optimize some ops Made-with: Cursor * conflits resolved and gemini'suggestion adopted Made-with: Cursor * end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples Made-with: Cursor * fix(agent): multi-turn tool dialog, bad-case gating, report zh-tier + evidence Made-with: Cursor * fix(agent): llm_quality record schema + dialog history caps & prompt truncation - Normalize LLM recommendation to list[str] in parse_output (fixes HF datasets shard align) - agent_dialog_normalize_mapper: configurable history caps, head+tail write-back, meta flag - dialog_* mappers: shared max_*_chars_for_prompt via dialog_llm_input_utils - Recipe/docs: agent_interaction_quality_analysis, PERFORMANCE_LLM, BAD_CASE_INSIGHTS - Tests: agent_dialog_normalize_mapper, llm_analysis_filter parse_output - build_op_doc: exclude dialog_llm_input_utils helper; video_camera_pose droid_args docstring Made-with: Cursor * fix(agent-ops): multi-turn tool dialog, bad-case gating, report zh-tier + evidence, dialog history caps (#950) * preliminary test of minimal_configs (01 to 05) Made-with: Cursor * test of minimal_configs (06 to 08), optimize some ops Made-with: Cursor * conflits resolved and gemini'suggestion adopted Made-with: Cursor * end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples Made-with: Cursor * fix(agent): multi-turn tool dialog, bad-case gating, report zh-tier + evidence Made-with: Cursor * fix(agent): llm_quality record schema + dialog history caps & prompt truncation - Normalize LLM recommendation to list[str] in parse_output (fixes HF datasets shard align) - agent_dialog_normalize_mapper: configurable history caps, head+tail write-back, meta flag - dialog_* mappers: shared max_*_chars_for_prompt via dialog_llm_input_utils - Recipe/docs: agent_interaction_quality_analysis, PERFORMANCE_LLM, BAD_CASE_INSIGHTS - Tests: agent_dialog_normalize_mapper, llm_analysis_filter parse_output - build_op_doc: exclude dialog_llm_input_utils helper; video_camera_pose droid_args docstring Made-with: Cursor * feat(agent): bad-case HTML report UX + HF meta stability for normalize Bad-case report (generate_bad_case_report.py): - CJK fonts for matplotlib/body; bar labels; section order (charts → insights → cases) - LLM page-top summary: compact digest, shorter prompt/tokens/timeout; default qwen3.5-plus - Drilldown: page cap + sidecar *_drilldown_full.jsonl; copy and nav tweaks - Richer agent_insight_llm cards; rule-based fallback summary agent_dialog_normalize_mapper: - Stable HF Arrow meta: always agent_dialog_history_compressed bool; list[str] placeholders for empty tool/skill types; filter falsy in tool_type_mapper and skill_insight_mapper Pipeline: run_bad_case_pipeline report uses argv array safe under set -u; BAD_CASE_REPORT_LLM=1 Tests + recipe yaml aligned. Made-with: Cursor * feat(agent): bad-case HTML report UX + HF meta stability for normalize (#951) * preliminary test of minimal_configs (01 to 05) Made-with: Cursor * test of minimal_configs (06 to 08), optimize some ops Made-with: Cursor * conflits resolved and gemini'suggestion adopted Made-with: Cursor * end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples Made-with: Cursor * fix(agent): multi-turn tool dialog, bad-case gating, report zh-tier + evidence Made-with: Cursor * fix(agent): llm_quality record schema + dialog history caps & prompt truncation - Normalize LLM recommendation to list[str] in parse_output (fixes HF datasets shard align) - agent_dialog_normalize_mapper: configurable history caps, head+tail write-back, meta flag - dialog_* mappers: shared max_*_chars_for_prompt via dialog_llm_input_utils - Recipe/docs: agent_interaction_quality_analysis, PERFORMANCE_LLM, BAD_CASE_INSIGHTS - Tests: agent_dialog_normalize_mapper, llm_analysis_filter parse_output - build_op_doc: exclude dialog_llm_input_utils helper; video_camera_pose droid_args docstring Made-with: Cursor * feat(agent): bad-case HTML report UX + HF meta stability for normalize Bad-case report (generate_bad_case_report.py): - CJK fonts for matplotlib/body; bar labels; section order (charts → insights → cases) - LLM page-top summary: compact digest, shorter prompt/tokens/timeout; default qwen3.5-plus - Drilldown: page cap + sidecar *_drilldown_full.jsonl; copy and nav tweaks - Richer agent_insight_llm cards; rule-based fallback summary agent_dialog_normalize_mapper: - Stable HF Arrow meta: always agent_dialog_history_compressed bool; list[str] placeholders for empty tool/skill types; filter falsy in tool_type_mapper and skill_insight_mapper Pipeline: run_bad_case_pipeline report uses argv array safe under set -u; BAD_CASE_REPORT_LLM=1 Tests + recipe yaml aligned. Made-with: Cursor * fix: optional stdlib json for HF datasets JSONL (ujson Value too big) - Add DATA_JUICER_USE_STDLIB_JSON env patch in init_configs - Document workaround in config_all.yaml and DatasetCfg guides Made-with: Cursor * fix large int error (#952) * preliminary test of minimal_configs (01 to 05) Made-with: Cursor * test of minimal_configs (06 to 08), optimize some ops Made-with: Cursor * conflits resolved and gemini'suggestion adopted Made-with: Cursor * end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples Made-with: Cursor * fix(agent): multi-turn tool dialog, bad-case gating, report zh-tier + evidence Made-with: Cursor * fix(agent): llm_quality record schema + dialog history caps & prompt truncation - Normalize LLM recommendation to list[str] in parse_output (fixes HF datasets shard align) - agent_dialog_normalize_mapper: configurable history caps, head+tail write-back, meta flag - dialog_* mappers: shared max_*_chars_for_prompt via dialog_llm_input_utils - Recipe/docs: agent_interaction_quality_analysis, PERFORMANCE_LLM, BAD_CASE_INSIGHTS - Tests: agent_dialog_normalize_mapper, llm_analysis_filter parse_output - build_op_doc: exclude dialog_llm_input_utils helper; video_camera_pose droid_args docstring Made-with: Cursor * feat(agent): bad-case HTML report UX + HF meta stability for normalize Bad-case report (generate_bad_case_report.py): - CJK fonts for matplotlib/body; bar labels; section order (charts → insights → cases) - LLM page-top summary: compact digest, shorter prompt/tokens/timeout; default qwen3.5-plus - Drilldown: page cap + sidecar *_drilldown_full.jsonl; copy and nav tweaks - Richer agent_insight_llm cards; rule-based fallback summary agent_dialog_normalize_mapper: - Stable HF Arrow meta: always agent_dialog_history_compressed bool; list[str] placeholders for empty tool/skill types; filter falsy in tool_type_mapper and skill_insight_mapper Pipeline: run_bad_case_pipeline report uses argv array safe under set -u; BAD_CASE_REPORT_LLM=1 Tests + recipe yaml aligned. Made-with: Cursor * fix: optional stdlib json for HF datasets JSONL (ujson Value too big) - Add DATA_JUICER_USE_STDLIB_JSON env patch in init_configs - Document workaround in config_all.yaml and DatasetCfg guides Made-with: Cursor * feat: lenient JSONL load (stdlib json, skip bad lines) - Add load_jsonl_lenient config and DATA_JUICER_JSONL_LENIENT env - Stream jsonl-only inputs via Dataset.from_generator; document in DatasetCfg - Add unit tests for jsonl_lenient_loader Made-with: Cursor * feat: lenient JSONL load (stdlib json, skip bad lines) (#953) * preliminary test of minimal_configs (01 to 05) Made-with: Cursor * test of minimal_configs (06 to 08), optimize some ops Made-with: Cursor * conflits resolved and gemini'suggestion adopted Made-with: Cursor * end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples Made-with: Cursor * fix(agent): multi-turn tool dialog, bad-case gating, report zh-tier + evidence Made-with: Cursor * fix(agent): llm_quality record schema + dialog history caps & prompt truncation - Normalize LLM recommendation to list[str] in parse_output (fixes HF datasets shard align) - agent_dialog_normalize_mapper: configurable history caps, head+tail write-back, meta flag - dialog_* mappers: shared max_*_chars_for_prompt via dialog_llm_input_utils - Recipe/docs: agent_interaction_quality_analysis, PERFORMANCE_LLM, BAD_CASE_INSIGHTS - Tests: agent_dialog_normalize_mapper, llm_analysis_filter parse_output - build_op_doc: exclude dialog_llm_input_utils helper; video_camera_pose droid_args docstring Made-with: Cursor * feat(agent): bad-case HTML report UX + HF meta stability for normalize Bad-case report (generate_bad_case_report.py): - CJK fonts for matplotlib/body; bar labels; section order (charts → insights → cases) - LLM page-top summary: compact digest, shorter prompt/tokens/timeout; default qwen3.5-plus - Drilldown: page cap + sidecar *_drilldown_full.jsonl; copy and nav tweaks - Richer agent_insight_llm cards; rule-based fallback summary agent_dialog_normalize_mapper: - Stable HF Arrow meta: always agent_dialog_history_compressed bool; list[str] placeholders for empty tool/skill types; filter falsy in tool_type_mapper and skill_insight_mapper Pipeline: run_bad_case_pipeline report uses argv array safe under set -u; BAD_CASE_REPORT_LLM=1 Tests + recipe yaml aligned. Made-with: Cursor * fix: optional stdlib json for HF datasets JSONL (ujson Value too big) - Add DATA_JUICER_USE_STDLIB_JSON env patch in init_configs - Document workaround in config_all.yaml and DatasetCfg guides Made-with: Cursor * feat: lenient JSONL load (stdlib json, skip bad lines) - Add load_jsonl_lenient config and DATA_JUICER_JSONL_LENIENT env - Stream jsonl-only inputs via Dataset.from_generator; document in DatasetCfg - Add unit tests for jsonl_lenient_loader Made-with: Cursor * fix(lenient jsonl): do not fall back to HF when folder mixes .json Mixed extensions previously forced HuggingFace JSON loader and ujson (Value too big). Now only jsonl* shards are read; others are skipped with warnings. Log line [lenient jsonl] ACTIVE confirms the path. Made-with: Cursor * fix(lenient jsonl): do not fall back to HF when folder mixes .json (#954) * preliminary test of minimal_configs (01 to 05) Made-with: Cursor * test of minimal_configs (06 to 08), optimize some ops Made-with: Cursor * conflits resolved and gemini'suggestion adopted Made-with: Cursor * end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples Made-with: Cursor * fix(agent): multi-turn tool dialog, bad-case gating, report zh-tier + evidence Made-with: Cursor * fix(agent): llm_quality record schema + dialog history caps & prompt truncation - Normalize LLM recommendation to list[str] in parse_output (fixes HF datasets shard align) - agent_dialog_normalize_mapper: configurable history caps, head+tail write-back, meta flag - dialog_* mappers: shared max_*_chars_for_prompt via dialog_llm_input_utils - Recipe/docs: agent_interaction_quality_analysis, PERFORMANCE_LLM, BAD_CASE_INSIGHTS - Tests: agent_dialog_normalize_mapper, llm_analysis_filter parse_output - build_op_doc: exclude dialog_llm_input_utils helper; video_camera_pose droid_args docstring Made-with: Cursor * feat(agent): bad-case HTML report UX + HF meta stability for normalize Bad-case report (generate_bad_case_report.py): - CJK fonts for matplotlib/body; bar labels; section order (charts → insights → cases) - LLM page-top summary: compact digest, shorter prompt/tokens/timeout; default qwen3.5-plus - Drilldown: page cap + sidecar *_drilldown_full.jsonl; copy and nav tweaks - Richer agent_insight_llm cards; rule-based fallback summary agent_dialog_normalize_mapper: - Stable HF Arrow meta: always agent_dialog_history_compressed bool; list[str] placeholders for empty tool/skill types; filter falsy in tool_type_mapper and skill_insight_mapper Pipeline: run_bad_case_pipeline report uses argv array safe under set -u; BAD_CASE_REPORT_LLM=1 Tests + recipe yaml aligned. Made-with: Cursor * fix: optional stdlib json for HF datasets JSONL (ujson Value too big) - Add DATA_JUICER_USE_STDLIB_JSON env patch in init_configs - Document workaround in config_all.yaml and DatasetCfg guides Made-with: Cursor * feat: lenient JSONL load (stdlib json, skip bad lines) - Add load_jsonl_lenient config and DATA_JUICER_JSONL_LENIENT env - Stream jsonl-only inputs via Dataset.from_generator; document in DatasetCfg - Add unit tests for jsonl_lenient_loader Made-with: Cursor * fix(lenient jsonl): do not fall back to HF when folder mixes .json Mixed extensions previously forced HuggingFace JSON loader and ujson (Value too big). Now only jsonl* shards are read; others are skipped with warnings. Log line [lenient jsonl] ACTIVE confirms the path. Made-with: Cursor * * no need to convert to str * use list type for the arg to avoid ckpt failure * feat(agent): dialog quality mappers, bad-case report UX, pipeline & tests - Add dialog_* LLM axis mappers, trace coherence, tool relevance, PII suspect - agent_output_locale; extend bad-case signals, insight & usage/tool mappers - generate_bad_case_report: TOC/sidebar, insight↔drill links, snapshot 算子 row - Recipe/docs/Operators.md/pyproject; mapper & locale tests - build_op_doc: exclude dialog_quality_llm_utils (helper, not an OP) Made-with: Cursor * feat(demos/agent): enrich bad-case report and agent skill insights - HTML report: macro distributions (tools, skills, intent/topic/sentiment) with bar charts and optional word clouds; TOC and chart section wiring. - Omit PII audit / redaction–related samples from high_precision and watchlist insight excerpts (drilldown/export unchanged). - agent_skill_insight_mapper: prompt asks for concrete ~10-char (zh) / 4–8-word (en) capability phrases; forbid vague read/write–style tags. - Docs: root README link to demos/agent; maintainer checklist in demos/agent README; YAML/minimal_configs notes. - Tests: generate_bad_case_report smoke (PII omission); agent_skill_insight prompt assertions. Made-with: Cursor * * fix some problems * * update op doc building logics: remove those base classes that are not registered into the OPERATORS * * fix accelerator assignment * * check if it's a dict before nested query * feat(agent): enrich bad-case report and skill insight parsing - agent_skill_insight_mapper: split labels on ,,、;; for CN/EN separators - generate_bad_case_report: mirror split in macro stats (no re-run required) - Optional semantic clustering for insight headlines/audit (scikit-learn) - Insight model tabs: default full model id, family mode flag; order by batch volume - Stack request_model chart: Top 5 by requests + merged remainder bar - Extend model family hints (Kimi, GLM, MiniMax); tab/chart copy updates - Smoke tests: PII omission, skill-insight macro split; --no-insight-semantic-cluster in PII run Made-with: Cursor * feat(agent): PII mappers, recipe notes, and bad-case report PII grouping PII / redaction: - Expand pii_redaction_mapper (PEM/JWT/URL/IP/MAC ordering, optional extended PII) with tests. - pii_llm_suspect_mapper: spaCy install/locks, safer logging; prompts mention URL/IP/MAC/JWT/PEM leaks. Reporting / demo: - generate_bad_case_report: non-PII vs PII-flagged insight subsections; minimal PII cards; headline clusters use non-PII rows only; align redaction placeholders for grouping. - agent_interaction_quality_analysis.yaml: pii_redaction indentation and default-behavior comment. Recent branch history (already on upstream before this commit): - OP doc build skips unregistered base classes; accelerator assignment fix; nested-query dict guard; bad-case report + skill insight parsing enrichments. Made-with: Cursor * feat(agent): dual PII-safe/audit reports and diversified drilldown - Default writes safe HTML plus *_pii_audit.html; --report-pii-variants safe|audit|both - Case study: ~half high_precision / half watchlist quota with spillover - Reuse char TF-IDF + MiniBatchKMeans round-robin for Insight cards and case-study page - Remove in-page PII minimal-card split; safe variant omits PII rows from insight + drill - run_bad_case_pipeline.sh echoes audit path; smoke tests updated Made-with: Cursor * feat(agent): dialog shape and token/latency metrics in bad-case report - New section #sec-dialog-metrics: messages length, user turns, agent_turn_count, text chars, choices length, tool-touch message count, tokens (meta then stats), latency. - Optional matplotlib histograms when --no-charts is off; TOC and charts intro link to section. - Smoke test asserts sec-dialog-metrics anchor. Made-with: Cursor * feat(agent): multi-jsonl report input and provenance fold - Repeatable --input in generate_bad_case_report and verify; load_merged_rows reads paths in order. - run_bad_case_pipeline report: multiple JSONL, optional trailing OUT.html. - Multi-input: compact page meta and bottom #sec-data-provenance details for audit. Made-with: Cursor * + add dataset encryption and decryption * feat(agent): streaming bad-case report scan and message in case study - Default one-pass stream: exact tier/signal/cohort/macro counts without loading all rows. - Bounded drill/insight candidate pools (--drill-retention-cap, --insight-retention-cap); dialog histograms use per-metric reservoir sampling (--dialog-metric-samples). - --eager-load-all-rows restores legacy full in-memory load (small data / debug). - Case study block shows message (fallback response); jsonl export includes both; PII/signal snippets and suspect_empty evidence prefer message text. Made-with: Cursor * + make json compat work + use DataJuicerTestCaseBase instead of unittest.TestCase + add test cases for datasets_json_compat * feat(agent): polish bad-case report case study and refresh agent docs - Case study: format messages[] as role-labeled turns; show response in its own panel - Support User:/Assistant:-style plain-text threads for readability - Update BAD_CASE_INSIGHTS.md and QUICKSTART_BAD_CASE.md Made-with: Cursor * - remove redundant docs and contents + add a demo video for bad case report + add a script to download demo dataset * * limit the cpu count of single container to 128 * + add ray tag for ray executor encrypt test cases * * update datasets to >= 4.7.0 to support JSON type and ujson loading * fix test case bugs * update cudf-cu12 to follow newer pyarrow * refactor: enhance signature annotation resolution for ops search functionality * * use num_proc=None to disable multiprocessing for get_access_log * use num_proc None to disable multi-processing * * fix test cases * remove image_captioning_from_gpt4v_mapper + add normalization to llm_analysis results * * loose the assertion condition * check num_proc before comparison * normalize the records as well * * use dumped string for llm tags * fix test case * * fix test cases * * update model dist --------- Co-authored-by: 烈霖 <lielin.hyl@alibaba-inc.com> Co-authored-by: cmgzn <zdongs@outlook.com>
1 parent 003e2a8 commit 4472789

File tree

139 files changed

+15739
-1047
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

139 files changed

+15739
-1047
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,9 @@ tmp/
3939
# perf bench data
4040
perf_bench_data/
4141

42+
# some local demo data
43+
demos/local/*
44+
4245
# env file
4346
.env
4447

.pre-commit-hooks/build_op_doc.py

Lines changed: 56 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,14 @@
7676
# >>> OP code/test paths and exclusive files/dirs
7777
OP_CODE_PREFIX = "data_juicer/ops/"
7878
OP_TEST_PREFIX = "tests/ops/"
79-
OP_EXCLUDE = {"__init__.py", "common", "__pycache__"}
79+
OP_EXCLUDE = {
80+
"__init__.py",
81+
"common",
82+
"__pycache__",
83+
# Helper module under mapper/ (not a registered OP)
84+
"dialog_llm_input_utils.py",
85+
"dialog_quality_llm_utils.py",
86+
}
8087

8188
FORMATTER_CODE_PREFIX = "data_juicer/format/"
8289
FORMATTER_TEST_PREFIX = "tests/format/"
@@ -273,6 +280,16 @@ def pick_doc_for_op(docstrings: List[tuple], op_stem: str) -> str:
273280
return docstrings[-1][1]
274281

275282

283+
def is_registered_op(code_path):
284+
"""
285+
Return True only if the file contains an OPERATORS.register_module call,
286+
indicating it defines a concrete registered OP rather than a base class.
287+
"""
288+
with open(code_path, "r", encoding="utf-8") as fin:
289+
content = fin.read()
290+
return "OPERATORS.register_module" in content
291+
292+
276293
def get_class_and_docstring(code_path):
277294
"""
278295
Get (class_name, first-sentence doc) for each ClassDef in the file that has a class docstring.
@@ -375,6 +392,8 @@ def get_op_list_from_code():
375392
continue
376393
if not code_path.endswith(".py") or "_cpp" in code_path:
377394
continue
395+
if not is_registered_op(code_path):
396+
continue
378397
docstrings = get_class_and_docstring(code_path)
379398
stem = op.replace(".py", "")
380399
doc = pick_doc_for_op(docstrings, stem)
@@ -658,6 +677,41 @@ def check_and_update_op_record(old_op_record_list, new_op_record_list):
658677
return updated_op_record_list
659678

660679

680+
def print_op_doc_diff(old_op_num_dict, new_op_num_dict, old_op_record_list, updated_op_record_list):
681+
"""
682+
Print the difference between the old and new op_num_dict and op_record_list.
683+
"""
684+
all_types = set(old_op_num_dict) | set(new_op_num_dict)
685+
for t in sorted(all_types):
686+
old_cnt = old_op_num_dict.get(t)
687+
new_cnt = new_op_num_dict.get(t)
688+
if old_cnt != new_cnt:
689+
print(f" [op_num] type={t}: {old_cnt} -> {new_cnt}")
690+
691+
old_record_dict = {r.name: r for r in old_op_record_list}
692+
new_record_dict = {r.name: r for r in updated_op_record_list}
693+
old_names = set(old_record_dict)
694+
new_names = set(new_record_dict)
695+
for name in sorted(new_names - old_names):
696+
print(f" [op_record] ADDED: {new_record_dict[name]}")
697+
for name in sorted(old_names - new_names):
698+
print(f" [op_record] REMOVED: {old_record_dict[name]}")
699+
for name in sorted(old_names & new_names):
700+
old_r, new_r = old_record_dict[name], new_record_dict[name]
701+
if old_r != new_r:
702+
print(f" [op_record] CHANGED: {name}")
703+
if old_r.type != new_r.type:
704+
print(f" type: {old_r.type!r} -> {new_r.type!r}")
705+
if set(old_r.tags) != set(new_r.tags):
706+
print(f" tags: {old_r.tags} -> {new_r.tags}")
707+
if old_r.desc != new_r.desc:
708+
print(f" desc: {old_r.desc!r} -> {new_r.desc!r}")
709+
if old_r.info != new_r.info:
710+
print(f" info: {old_r.info!r} -> {new_r.info!r}")
711+
if old_r.ref != new_r.ref:
712+
print(f" ref: {old_r.ref!r} -> {new_r.ref!r}")
713+
714+
661715
def main():
662716
old_op_record_list, old_op_num_dict = parse_op_record_from_current_doc()
663717
new_op_record_list, new_op_num_dict = get_op_list_from_code()
@@ -666,6 +720,7 @@ def main():
666720
if new_op_num_dict == old_op_num_dict and old_op_record_list == updated_op_record_list:
667721
exit(0)
668722
else:
723+
print_op_doc_diff(old_op_num_dict, new_op_num_dict, old_op_record_list, updated_op_record_list)
669724
generate_new_doc(updated_op_record_list, old_op_record_list)
670725
print("Operator document is updated.")
671726
exit(1)

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,7 @@ For detailed documentation, please see [here](https://datajuicer.github.io/data-
187187

188188
**Quick Links:**
189189
- **[operator zoo](https://datajuicer.github.io/data-juicer/en/main/docs/Operators.html)** — Browse 200+ operators with examples
190+
- **[Agent interaction quality & bad-case](demos/agent/README.md)** — In-repo recipe, JSONL pipeline, HTML report (`demos/agent/`; operators such as `agent_bad_case_signal_mapper` are also listed in [docs/Operators.md](docs/Operators.md))
190191
- **[data-juicer-hub](https://github.com/datajuicer/data-juicer-hub)** — Community-driven recipes and best practices
191192
- **[developer guide](https://datajuicer.github.io/data-juicer/en/main/docs/DeveloperGuide.html)** — Build your own code and contribute to DJ
192193
- **[data-juicer-cookbook](https://datajuicer.github.io/data-juicer/en/main/docs/tutorial/DJ-Cookbook.html)** — resource archive

data_juicer/config/config.py

Lines changed: 89 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -269,6 +269,33 @@ def build_base_parser() -> ArgumentParser:
269269
"an error will be raised. Should contain aws_access_key_id, aws_secret_access_key, aws_region, "
270270
"and optionally aws_session_token and endpoint_url.",
271271
)
272+
parser.add_argument(
273+
"--decrypt_after_reading",
274+
type=bool,
275+
default=False,
276+
help="Whether to decrypt input dataset files after reading. When True, "
277+
"each input file is decrypted in memory using a Fernet key before "
278+
"being loaded by HuggingFace datasets. No plaintext file is written "
279+
"to disk. HuggingFace cache is automatically disabled to prevent "
280+
"plaintext Arrow files from being persisted. Default: False.",
281+
)
282+
parser.add_argument(
283+
"--encrypt_before_export",
284+
type=bool,
285+
default=False,
286+
help="Whether to encrypt output dataset files before writing to disk. "
287+
"When True, each exported file is encrypted in-place with a Fernet "
288+
"key immediately after being written. Default: False.",
289+
)
290+
parser.add_argument(
291+
"--encryption_key_path",
292+
type=Optional[str],
293+
default=None,
294+
help="Path to a file containing the Fernet encryption key (base64 "
295+
"url-safe string). If not provided, the key is read from the "
296+
"environment variable DJ_ENCRYPTION_KEY. Required when either "
297+
"decrypt_after_reading or encrypt_before_export is True.",
298+
)
272299
parser.add_argument(
273300
"--keep_stats_in_res_ds",
274301
type=bool,
@@ -764,6 +791,13 @@ def init_configs(args: Optional[List[str]] = None, which_entry: object = None, l
764791
setting up logger.
765792
:return: a global cfg object used by the DefaultExecutor or Analyzer
766793
"""
794+
# Optional: stdlib json for HF datasets JSONL (avoids ujson "Value is too big!")
795+
from data_juicer.utils.datasets_json_compat import (
796+
apply_stdlib_json_patch_for_datasets,
797+
)
798+
799+
apply_stdlib_json_patch_for_datasets()
800+
767801
if args is None:
768802
args = sys.argv[1:]
769803
with timing_context("Total config initialization time"):
@@ -985,6 +1019,37 @@ def init_setup_from_cfg(cfg: Namespace, load_configs_only=False):
9851019
os.makedirs(cfg.temp_dir, exist_ok=True)
9861020
tempfile.tempdir = cfg.temp_dir
9871021

1022+
# encryption mode: force disable HF cache to prevent plaintext Arrow files
1023+
# from being persisted to disk, and validate the key early.
1024+
if cfg.get("decrypt_after_reading", False) or cfg.get("encrypt_before_export", False):
1025+
if cfg.get("use_cache", True):
1026+
logger.warning(
1027+
"Encryption mode is enabled: forcing use_cache=False to "
1028+
"prevent plaintext Arrow cache files from being written to disk."
1029+
)
1030+
from datasets import disable_caching
1031+
1032+
disable_caching()
1033+
cfg.use_cache = False
1034+
if cfg.cache_compress:
1035+
logger.warning("Disable cache compression due to disabled cache.")
1036+
cfg.cache_compress = None
1037+
import tempfile
1038+
1039+
logger.warning(
1040+
f"Set temp directory to store temp files to [{cfg.temp_dir}]. "
1041+
"For maximum security, set temp_dir to a memory-backed "
1042+
"filesystem such as /dev/shm."
1043+
)
1044+
if cfg.temp_dir is not None and not os.path.exists(cfg.temp_dir):
1045+
os.makedirs(cfg.temp_dir, exist_ok=True)
1046+
tempfile.tempdir = cfg.temp_dir
1047+
# Validate key availability early so the job fails fast on
1048+
# misconfiguration rather than deep into processing.
1049+
from data_juicer.utils.encryption_utils import load_fernet_key
1050+
1051+
load_fernet_key(cfg.get("encryption_key_path", None))
1052+
9881053
# The checkpoint mode is not compatible with op fusion for now.
9891054
if cfg.get("op_fusion", False):
9901055
cfg.use_checkpoint = False
@@ -1203,15 +1268,20 @@ def update_op_process(cfg, parser, used_ops=None):
12031268
# Add new operator
12041269
cfg.process.append({op_name: None if internal_op_para is None else namespace_to_dict(internal_op_para)})
12051270

1206-
# Optimize type checking
1271+
# Optimize type checking: deepcopy(parser) does not replicate nested add_class_arguments,
1272+
# so only pass global args to temp_parser to avoid "Unrecognized arguments" for op.* keys.
12071273
recognized_args = {
12081274
action.dest for action in parser._actions if hasattr(action, "dest") and isinstance(action, ActionTypeHint)
12091275
}
1276+
exclude_prefixes = tuple(used_ops) + tuple(f"{op_name}." for op_name in (used_ops or ()))
12101277

1211-
# check the op params via type hint
12121278
temp_parser = copy.deepcopy(parser)
1213-
1214-
temp_args = namespace_to_arg_list(temp_cfg, includes=recognized_args, excludes=["config"])
1279+
temp_args = namespace_to_arg_list(
1280+
temp_cfg,
1281+
includes=recognized_args,
1282+
excludes=["config"],
1283+
exclude_prefixes=exclude_prefixes,
1284+
)
12151285

12161286
if temp_cfg.config:
12171287
temp_args.extend(["--config", os.path.abspath(temp_cfg.config[0])])
@@ -1224,15 +1294,27 @@ def update_op_process(cfg, parser, used_ops=None):
12241294
return cfg
12251295

12261296

1227-
def namespace_to_arg_list(namespace, prefix="", includes=None, excludes=None):
1297+
def namespace_to_arg_list(namespace, prefix="", includes=None, excludes=None, exclude_prefixes=None):
12281298
arg_list = []
1299+
exclude_prefixes = exclude_prefixes or ()
12291300

12301301
for key, value in vars(namespace).items():
1302+
concat_key = f"{prefix}{key}"
1303+
if exclude_prefixes and (
1304+
concat_key in exclude_prefixes
1305+
or any(concat_key.startswith(p + ".") for p in exclude_prefixes if "." not in p)
1306+
):
1307+
continue
12311308
if issubclass(type(value), Namespace):
1232-
nested_args = namespace_to_arg_list(value, f"{prefix}{key}.")
1309+
nested_args = namespace_to_arg_list(
1310+
value,
1311+
f"{prefix}{key}.",
1312+
includes=includes,
1313+
excludes=excludes,
1314+
exclude_prefixes=exclude_prefixes,
1315+
)
12331316
arg_list.extend(nested_args)
12341317
elif value is not None:
1235-
concat_key = f"{prefix}{key}"
12361318
if includes is not None and concat_key not in includes:
12371319
continue
12381320
if excludes is not None and concat_key in excludes:

data_juicer/config/config_all.yaml

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,13 @@ export_in_parallel: false # whether to export
3333
keep_stats_in_res_ds: false # whether to keep the computed stats in the result dataset. The intermediate fields to store the stats computed by Filters will be removed if it's False. It's False in default.
3434
keep_hashes_in_res_ds: false # whether to keep the computed hashes in the result dataset. The intermediate fields to store the hashes computed by Deduplicators will be removed if it's False. It's False in default.
3535
export_extra_args: {} # Other optional arguments for exporting in dict. For example, the key mapping info for exporting the WebDataset format.
36+
# If loading local JSONL fails with ujson ValueError: Value is too big!, run with:
37+
# DATA_JUICER_USE_STDLIB_JSON=1 dj-process --config your.yaml
38+
# Per-line tolerance (stdlib json, skip bad lines; jsonl-only inputs), see DatasetCfg:
39+
# load_jsonl_lenient: true
40+
# # or: DATA_JUICER_JSONL_LENIENT=1
3641
load_dataset_kwargs: {} # extra kwargs passed to datasets.load_dataset(). Useful for format-specific options, e.g. chunksize (JSON), columns (Parquet), delimiter (CSV).
42+
load_jsonl_lenient: false # if true, stream jsonl* shards with stdlib json and skip bad lines; other suffixes in the same folder are ignored (not HF fallback). Confirm logs contain "[lenient jsonl] ACTIVE".
3743

3844
auto_op_parallelism: true # whether to automatically set num_proc according to system resources. It is true in default.
3945
np: 4 # number of subprocess to process your dataset
@@ -56,6 +62,11 @@ fusion_strategy: 'probe' # OP fusion strategy
5662
cache_compress: null # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.
5763
adaptive_batch_size: false # whether to use adaptive batch sizes for each OP according to the probed results. It's False in default.
5864

65+
# for data encryption / decryption
66+
decrypt_after_reading: false # whether to decrypt input files after reading. Each file is decrypted in-memory (plaintext never written to disk). HF cache is automatically disabled. Requires encryption_key_path or DJ_ENCRYPTION_KEY env var.
67+
encrypt_before_export: false # whether to encrypt output files before writing to disk. Each exported file is encrypted in-place immediately after being written. Requires encryption_key_path or DJ_ENCRYPTION_KEY env var.
68+
encryption_key_path: null # path to a file containing the Fernet key (base64 url-safe string). Falls back to environment variable DJ_ENCRYPTION_KEY if not set.
69+
5970
# for multimodal data processing
6071
image_key: 'images' # key name of field to store the list of sample image paths.
6172
image_bytes_key: 'image_bytes' # key name of field to store the list of sample image bytes.
@@ -338,16 +349,6 @@ process:
338349
blur_type: 'gaussian' # type of blur kernel, including ['mean', 'box', 'gaussian']
339350
radius: 2 # radius of blur kernel
340351
save_dir: null # The directory where generated files will be stored. If not specified, outputs will be saved in the same directory as their corresponding input files. This path can alternatively be defined by setting the `DJ_PRODUCED_DATA_DIR` environment variable.
341-
- image_captioning_from_gpt4v_mapper: # generate samples whose texts are generated based on gpt-4-vision and the image
342-
mode: 'description' # mode of text generated from images, can be one of ['reasoning', 'description', 'conversation', 'custom']
343-
api_key: '' # the API key to authenticate the request
344-
max_token: 500 # the maximum number of tokens to generate. Default is 500.
345-
temperature: 1.0 # controls the randomness of the output (range from 0 to 1). Default is 0.
346-
system_prompt: '' # a string prompt used to set the context of a conversation and provide global guidance or rules for the gpt4-vision so that it can generate responses in the expected way. If `mode` set to `custom`, the parameter will be used
347-
user_prompt: '' # a string prompt to guide the generation of gpt4-vision for each samples. It's "" in default, which means no prompt provided
348-
user_prompt_key: null # the key name of fields in samples to store prompts for each sample. It's used for set different prompts for different samples. If it's none, use prompt in parameter "prompt". It's None in default
349-
keep_original_sample: true # whether to keep the original sample. If it's set to False, there will be only generated text in the final datasets and the original text will be removed. It's True in default
350-
any_or_all: 'any' # keep this sample with 'any' or 'all' strategy of all images. 'any': keep this sample if any images meet the condition. 'all': keep this sample only if all images meet the condition
351352
- image_captioning_mapper: # generate captions for images to augment datasets
352353
hf_img2seq: 'Salesforce/blip2-opt-2.7b' # model name on huggingface to generate caption
353354
caption_num: 1 # how many candidate captions to generate for each image

data_juicer/core/data/dj_dataset.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ def __init__(self, *args, **kargs):
152152
# batched sample, (k & v) are organized by list manner
153153
for k, v in self.items():
154154
if isinstance(v, list) and len(v) > 0 and isinstance(v[0], dict):
155-
self[k] = [NestedQueryDict(item) for item in v]
155+
self[k] = [NestedQueryDict(item) if isinstance(item, dict) else item for item in v]
156156

157157
def __getitem__(self, key):
158158
return nested_query(self, key)

data_juicer/core/executor/default_executor.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,8 @@ def __init__(self, cfg: Optional[Namespace] = None):
113113
self.np,
114114
keep_stats_in_res_ds=self.cfg.keep_stats_in_res_ds,
115115
keep_hashes_in_res_ds=self.cfg.keep_hashes_in_res_ds,
116+
encrypt_before_export=getattr(self.cfg, "encrypt_before_export", False),
117+
encryption_key_path=getattr(self.cfg, "encryption_key_path", None),
116118
**export_extra_args,
117119
)
118120

data_juicer/core/executor/ray_executor.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,8 @@ def __init__(self, cfg: Optional[Namespace] = None):
114114
self.cfg.export_shard_size,
115115
keep_stats_in_res_ds=self.cfg.keep_stats_in_res_ds,
116116
keep_hashes_in_res_ds=self.cfg.keep_hashes_in_res_ds,
117+
encrypt_before_export=getattr(self.cfg, "encrypt_before_export", False),
118+
encryption_key_path=getattr(self.cfg, "encryption_key_path", None),
117119
**export_extra_args,
118120
)
119121

data_juicer/core/executor/ray_executor_partitioned.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -244,6 +244,8 @@ def __init__(self, cfg: Optional[Namespace] = None):
244244
getattr(self.cfg, "export_shard_size", 0),
245245
keep_stats_in_res_ds=getattr(self.cfg, "keep_stats_in_res_ds", True),
246246
keep_hashes_in_res_ds=getattr(self.cfg, "keep_hashes_in_res_ds", False),
247+
encrypt_before_export=getattr(self.cfg, "encrypt_before_export", False),
248+
encryption_key_path=getattr(self.cfg, "encryption_key_path", None),
247249
**export_extra_args,
248250
)
249251

0 commit comments

Comments
 (0)