Skip to content

[python] Unicode data generates error upon write to dataframe #415

@bkmartinjr

Description

@bkmartinjr

Creating a dataframe with a string column works fine, but when you try to write to it, it generates an error. It also appears there is no unit test for this, which would be a nice addition.

See #420 for an example unit test (currently marked xfail).

Test case:

import numpy as np
import pyarrow as pa
import pandas as pd
import tiledbsoma as soma

soma_df = soma.SOMADataFrame("./test_dataframe")

df = pd.DataFrame(data={
  'soma_rowid': np.arange(2, dtype=np.int64),
  'soma_joinid': np.arange(2, dtype=np.int64),
  'unicode': ['\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}', 'a \N{GREEK CAPITAL LETTER DELTA} test'], 
  'ascii': ['aa', 'bbb',]
})
tbl = pa.Table.from_pandas(df)
print(tbl.schema)

soma_df.create(schema=tbl.schema)
print(soma_df.schema)

soma_df.write(tbl)

Output:

soma_rowid: int64
soma_joinid: int64
unicode: string
ascii: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 718
soma_rowid: int64
soma_joinid: int64
unicode: large_string
ascii: large_string
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
File tiledb/libtiledb.pyx:4617, in tiledb.libtiledb._setitem_impl_sparse()

UnicodeEncodeError: 'ascii' codec can't encode character '\u0302' in position 1: ordinal not in range(128)

During handling of the above exception, another exception occurred:

TileDBError                               Traceback (most recent call last)
Cell In [18], line 20
     17 soma_df.create(schema=tbl.schema)
     18 print(soma_df.schema)
---> 20 soma_df.write(tbl)

File ~/projects/TileDB-SOMA/apis/python/src/tiledbsoma/soma_dataframe.py:263, in SOMADataFrame.write(self, values)
    260 if self._get_is_sparse():
    261     # sparse write
    262     with self._tiledb_open("w") as A:
--> 263         A[rowids] = attr_cols_map
    264 else:
    265     # TODO: This was a quick thing to bootstrap some early ingestion tests but needs more thought.
    266     # In particular, rowids needn't be either zero-up or contiguous.
    267     assert len(rowids) > 0

File tiledb/libtiledb.pyx:4691, in tiledb.libtiledb.SparseArrayImpl.__setitem__()

File tiledb/libtiledb.pyx:4619, in tiledb.libtiledb._setitem_impl_sparse()

TileDBError: Attr's dtype is "ascii" but attr_val contains invalid ASCII characters

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions