fix(#25764): Implement UTF-8 encoding standardization for CSV import/…#27409
fix(#25764): Implement UTF-8 encoding standardization for CSV import/…#27409Darshan3690 wants to merge 19 commits intoopen-metadata:mainfrom
Conversation
…r CSV import/export ## Overview Resolve Chinese character garbling in CSV import/export workflows by implementing end-to-end UTF-8 encoding standardization across backend REST endpoints and frontend file handling. ## Root Causes Fixed 1. Missing charset=UTF-8 declarations on CSV transport layer (HTTP headers) 2. No UTF-8 BOM handling for Windows Excel compatibility 3. Inconsistent encoding across 9+ independent resource classes 4. Browser FileReader lacking explicit encoding specification 5. No UTF-8 BOM prepending in CSV downloads ## Changes Implemented ### Backend (11 files) **CSV Utility (CsvUtil.java)** - Added UTF8_BOM constant (\uFEFF) - Added stripUtf8Bom(String value) utility method for safe BOM removal - Handles null, empty string, and multi-byte character scenarios **Shared Import Flow (EntityResource.java)** - Import CsvUtil dependency - Normalize CSV input by stripping BOM before repository parsing - Applied to all entity types (Table, Glossary, Team, User, TestCase, etc.) **REST Endpoints (9 resource files)** - ColumnResource.java: Updated 3 @Produces/@consumes annotations - TableResource.java: Updated 4 annotations (export, async export, import, async import) - UserResource.java: Updated 3 annotations - TeamResource.java: Updated 4 annotations - TestCaseResource.java: Updated 3 annotations - GlossaryResource.java: Updated 4 annotations - GlossaryTermResource.java: Updated 4 annotations - LineageResource.java: Updated 1 annotation (export) - All changed from TEXT_PLAIN → TEXT_PLAIN + "
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
HI , @harshach add safe test label |
There was a problem hiding this comment.
Pull request overview
Implements end-to-end UTF-8 handling for CSV import/export to prevent non-ASCII (e.g., Chinese) character corruption by standardizing charset usage across UI requests, backend endpoints, and CSV parsing/downloading.
Changes:
- Standardize CSV import request encoding (UI sends
text/plain; charset=UTF-8; backend consumes/produces UTF-8 explicitly). - Add UTF-8 BOM handling (backend strips BOM on import; UI prepends BOM for CSV downloads for Excel compatibility).
- Extend automated coverage (Java unit tests, Jest tests, and a Playwright E2E scenario with Chinese content).
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-ui/src/main/resources/ui/src/utils/Export/ExportUtils.ts | Prepends BOM and enforces CSV MIME type for downloads to improve Excel UTF-8 handling. |
| openmetadata-ui/src/main/resources/ui/src/utils/Export/ExportUtils.test.tsx | Updates/adds tests for BOM behavior on CSV vs non-CSV downloads. |
| openmetadata-ui/src/main/resources/ui/src/rest/teamsAPI.ts | Adds UTF-8 charset to CSV import request headers for team/user imports. |
| openmetadata-ui/src/main/resources/ui/src/rest/tableAPI.ts | Adds UTF-8 charset to CSV import request headers for table import. |
| openmetadata-ui/src/main/resources/ui/src/rest/importExportAPI.ts | Adds UTF-8 charset to CSV import request headers for multiple entity import APIs. |
| openmetadata-ui/src/main/resources/ui/src/rest/importExportAPI.test.ts | Updates assertions to validate UTF-8 charset headers in import requests. |
| openmetadata-ui/src/main/resources/ui/src/rest/databaseAPI.ts | Adds UTF-8 charset to CSV import request headers for database/schema imports. |
| openmetadata-ui/src/main/resources/ui/src/rest/columnAPI.ts | Adds UTF-8 charset to CSV import request headers for column CSV import APIs. |
| openmetadata-ui/src/main/resources/ui/src/components/UploadFile/UploadFile.tsx | Forces FileReader.readAsText(..., 'utf-8') for CSV uploads. |
| openmetadata-ui/src/main/resources/ui/playwright/e2e/Pages/GlossaryImportExport.spec.ts | Adds Chinese glossary term data to validate E2E import/export behavior. |
| openmetadata-service/src/test/java/org/openmetadata/csv/CsvUtilTest.java | Adds unit tests for BOM stripping and Chinese character preservation in CSV formatting. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/UserResource.java | Adds UTF-8 charset to CSV import/export endpoint annotations for users. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/TeamResource.java | Adds UTF-8 charset to CSV import/export endpoint annotations for teams. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/lineage/LineageResource.java | Adds UTF-8 charset to lineage CSV export endpoint annotation. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryTermResource.java | Adds UTF-8 charset to glossary term CSV import/export endpoint annotations. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryResource.java | Adds UTF-8 charset to glossary CSV import/export endpoint annotations. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/dqtests/TestCaseResource.java | Adds UTF-8 charset to test case CSV import/export endpoint annotations. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/databases/TableResource.java | Adds UTF-8 charset to table CSV import/export endpoint annotations. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/columns/ColumnResource.java | Adds UTF-8 charset to column CSV import endpoint annotations. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/EntityResource.java | Centralizes BOM stripping for entity CSV imports via CsvUtil.stripUtf8Bom(...). |
| openmetadata-service/src/main/java/org/openmetadata/csv/CsvUtil.java | Introduces UTF-8 BOM constant and helper to strip BOM from imported CSV strings. |
Comments suppressed due to low confidence (6)
openmetadata-ui/src/main/resources/ui/src/components/UploadFile/UploadFile.tsx:49
setUploading(false)runs in thefinallyblock immediately afterreadAsText(...)is initiated, butFileReadercompletes asynchronously. This means the loader state will be cleared beforeonload/onerrorfires (and errors thrown insidereader.onerrorwon't be caught by this try/catch). Move thesetUploading(false)intoreader.onloadend(oronload/onerror) and surface errors via the callback/rejection rather thanthrowing a string in an async handler.
setUploading(true);
try {
const reader = new FileReader();
reader.onload = onCSVUploaded;
reader.onerror = () => {
throw t('server.unexpected-error');
};
reader.readAsText(options.file as Blob, 'utf-8');
} catch (error) {
showErrorToast(error as AxiosError);
} finally {
setUploading(false);
}
openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/TeamResource.java:755
- The sync
exportCsv(...)endpoint produces plain text CSV, but the@ApiResponsecontent is still declared asapplication/json. This makes the generated OpenAPI spec incorrect for clients. Update the response@Content(mediaType=...)totext/plain(ortext/csv) to match what is actually returned.
@GET
@Path("/name/{name}/export")
@Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
@Valid
@Operation(
operationId = "exportTeams",
summary = "Export teams in CSV format",
responses = {
@ApiResponse(
responseCode = "200",
description = "Exported csv with teams information",
content =
@Content(
mediaType = "application/json",
schema = @Schema(implementation = String.class)))
openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryResource.java:575
- The sync
exportCsv(...)endpoint returns CSV (String) and is annotated as@Produces(text/plain; charset=UTF-8), but the@ApiResponsestill declaresapplication/json. Adjust the documented response media type totext/plain(ortext/csv) so generated clients don’t try to parse JSON.
@GET
@Path("/name/{name}/export")
@Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
@Valid
@Operation(
operationId = "exportGlossary",
summary = "Export glossary in CSV format",
responses = {
@ApiResponse(
responseCode = "200",
description = "Exported csv with glossary terms",
content =
@Content(
mediaType = "application/json",
schema = @Schema(implementation = String.class)))
openmetadata-service/src/main/java/org/openmetadata/service/resources/databases/TableResource.java:622
- The sync
exportCsv(...)endpoint returns plain-text CSV but its@ApiResponsestill advertisesapplication/json. This makes the OpenAPI spec inaccurate for CSV consumers. Update the documented response@Content(mediaType=...)totext/plain(ortext/csv).
@GET
@Path("/name/{name}/export")
@Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
@Valid
@Operation(
operationId = "exportTable",
summary = "Export table in CSV format",
responses = {
@ApiResponse(
responseCode = "200",
description = "Exported csv with columns from the table",
content =
@Content(
mediaType = "application/json",
schema = @Schema(implementation = String.class)))
})
openmetadata-service/src/main/java/org/openmetadata/service/resources/lineage/LineageResource.java:416
exportLineage(...)is annotated to produce plain text, but the OpenAPI@ApiResponseis documented as returning aSearchResponseJSON payload. Since the method returns a CSVString, update the documented response content/media type totext/plain(ortext/csv) to avoid generating incorrect clients.
@GET
@Path("/export")
@Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
@Operation(
operationId = "exportLineage",
summary = "Export lineage",
responses = {
@ApiResponse(
responseCode = "200",
description = "search response",
content =
@Content(
mediaType = "application/json",
schema = @Schema(implementation = SearchResponse.class)))
})
openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/UserResource.java:1701
exportUsersCsv(...)is annotated as producing plain text, but the OpenAPI@ApiResponsecontent is still declared asapplication/json. This makes the generated spec misleading for CSV consumers. Update the documented response@Content(mediaType=...)totext/plain(ortext/csv) to match the actual response body.
@GET
@Path("/export")
@Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
@Valid
@Operation(
operationId = "exportUsers",
summary = "Export users in a team in CSV format",
responses = {
@ApiResponse(
responseCode = "200",
description = "Exported csv with user information",
content =
@Content(
mediaType = "application/json",
schema = @Schema(implementation = String.class)))
})
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi @harshach @PubChimps @pmbrull add safe to test label |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
openmetadata-ui/src/main/resources/ui/src/components/UploadFile/UploadFile.tsx:48
setUploading(false)runs in thefinallyblock immediately after callingFileReader.readAsText(...), butFileReaderis asynchronous. This makes the loader state inaccurate (it will flip back to false beforeonload/onerrorfires). MovesetUploading(false)into theonloadandonerrorhandlers (and calloptions.onSuccess/onErrorif needed) so the UI reflects the actual read lifecycle.
reader.readAsText(options.file as Blob, 'utf-8');
} catch (error) {
showErrorToast(error as AxiosError);
} finally {
setUploading(false);
| displayName: '中文术语展示名', | ||
| description: '这是用于验证导入导出编码的中文描述。', | ||
| synonyms: '中文同义词;测试', | ||
| references: '参考;https://example.com/中文', |
There was a problem hiding this comment.
references includes a URL with raw non-ASCII characters (https://example.com/中文). In the GlossaryTerm schema, termReference.endpoint is format: uri, so validators may reject IRIs that are not RFC3986-encoded. To avoid a flaky/invalid test while still exercising Chinese text, keep Chinese in the reference name and percent-encode the URL path (or use an ASCII-only URL).
| references: '参考;https://example.com/中文', | |
| references: '参考;https://example.com/%E4%B8%AD%E6%96%87', |
| expect(MockBlob).toHaveBeenCalledWith(['content'], { | ||
| expect(MockBlob).toHaveBeenCalledWith(['\uFEFFcontent'], { | ||
| type: 'text/csv;charset=utf-8;', | ||
| }); |
There was a problem hiding this comment.
These tests overwrite global.Blob but never restore it. jest.restoreAllMocks() won’t revert direct assignments, so the mocked Blob can leak into later tests/files and cause hard-to-debug failures. Capture the original global.Blob and restore it in afterEach (or use a spy/mocking approach that is automatically restored).
|
@Darshan3690 this requires lot of test coverage.
|
okay sir i will update the pr and add safe to test label |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
hi @harshach @PubChimps add safe to test label |
|
@harshach @PubChimps add safe to test label . |
Code Review ✅ Approved 2 resolved / 2 findingsStandardizes UTF-8 encoding for CSV imports by removing the redundant BOM double-prepend and cleaning up the accidentally committed large debug.json artifact. ✅ 2 resolved✅ Quality: Accidentally committed 101K-line debug.json test artifact
✅ Bug: BOM may be double-prepended if content already contains one
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
|
|
@harshach @PubChimps @pmbrull review this pr |



PR Summary: Fix Chinese Character Garbling in CSV Import/Export (#25764)
Issue
Chinese and other non-ASCII characters were getting garbled during CSV import and export flows.
Root Cause
Encoding was not consistently enforced across the full pipeline:
What Changed
Backend updates
UTF8_BOMconstant andstripUtf8Bom()method@Produces(APPLICATION_JSON)to match actual JSON response type (TableResource, GlossaryResource, GlossaryTermResource)Frontend updates
text/csv; charset=utf-8(removed trailing semicolon)Playwright updates
术语{uuid}中文术语展示名这是用于验证导入导出编码的中文描述。中文同义词;测试https://example.com/%E4%B8%AD%E6%96%87(avoids URI validation errors)Cleanup updates
debug.jsonto .gitignore to prevent future re-commitCopilot Review Comment Fixes (Completed)
text/csv;charset=utf-8;→text/csv; charset=utf-8(RFC-compliant)@Produces(TEXT_PLAIN + "; charset=UTF-8")to@Produces(APPLICATION_JSON)in:Additional Review Follow-up Fixes
Based on review comments, this PR includes:
Validation Status
✅ Passed Locally
✅ Test Coverage Added
✅ Code Quality
📝 Environment Note
Integration-test module compile/run requires local snapshot artifacts. All code updates are complete and validated. Full integration module execution depends on snapshot dependencies being available in CI or a fully bootstrapped local build.
Files Modified
Compatibility and Risk
Impact
CSV import/export now reliably preserves Chinese and other Unicode characters across backend and frontend workflows, including Excel-friendly CSV download behavior. The implementation is consistent across all entity types and follows RFC standards for media type declarations.
Merge Status: ✅ READY TO MERGE
b17a8ff264Summary by Gitar
UploadFile.tsxto importTransi18nextfrom../../utils/i18next/LocalUtilinstead of../../utils/CommonUtils.This will update automatically on new commits.