Creating a dataframe with a string column works fine, but when you try to write to it, it generates an error. It also appears there is no unit test for this, which would be a nice addition.
See #420 for an example unit test (currently marked xfail).
Test case:
import numpy as np
import pyarrow as pa
import pandas as pd
import tiledbsoma as soma
soma_df = soma.SOMADataFrame("./test_dataframe")
df = pd.DataFrame(data={
'soma_rowid': np.arange(2, dtype=np.int64),
'soma_joinid': np.arange(2, dtype=np.int64),
'unicode': ['\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}', 'a \N{GREEK CAPITAL LETTER DELTA} test'],
'ascii': ['aa', 'bbb',]
})
tbl = pa.Table.from_pandas(df)
print(tbl.schema)
soma_df.create(schema=tbl.schema)
print(soma_df.schema)
soma_df.write(tbl)
Output:
soma_rowid: int64
soma_joinid: int64
unicode: string
ascii: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 718
soma_rowid: int64
soma_joinid: int64
unicode: large_string
ascii: large_string
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
File tiledb/libtiledb.pyx:4617, in tiledb.libtiledb._setitem_impl_sparse()
UnicodeEncodeError: 'ascii' codec can't encode character '\u0302' in position 1: ordinal not in range(128)
During handling of the above exception, another exception occurred:
TileDBError Traceback (most recent call last)
Cell In [18], line 20
17 soma_df.create(schema=tbl.schema)
18 print(soma_df.schema)
---> 20 soma_df.write(tbl)
File ~/projects/TileDB-SOMA/apis/python/src/tiledbsoma/soma_dataframe.py:263, in SOMADataFrame.write(self, values)
260 if self._get_is_sparse():
261 # sparse write
262 with self._tiledb_open("w") as A:
--> 263 A[rowids] = attr_cols_map
264 else:
265 # TODO: This was a quick thing to bootstrap some early ingestion tests but needs more thought.
266 # In particular, rowids needn't be either zero-up or contiguous.
267 assert len(rowids) > 0
File tiledb/libtiledb.pyx:4691, in tiledb.libtiledb.SparseArrayImpl.__setitem__()
File tiledb/libtiledb.pyx:4619, in tiledb.libtiledb._setitem_impl_sparse()
TileDBError: Attr's dtype is "ascii" but attr_val contains invalid ASCII characters
Creating a dataframe with a
stringcolumn works fine, but when you try to write to it, it generates an error. It also appears there is no unit test for this, which would be a nice addition.See #420 for an example unit test (currently marked
xfail).Test case:
Output: