feat(collection): add get_embeddings_by_id method#70
feat(collection): add get_embeddings_by_id method#70lpbirch wants to merge 7 commits intoAlayaDB-AI:mainfrom
Conversation
76cd8fc to
e6c3d6e
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
e6c3d6e to
c6c2a74
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new low-level API on Collection to fetch embeddings by external IDs (preserving input order) and introduces unit tests to validate correct ordering and missing-ID behavior, as groundwork for future MMR/VectorStore features.
Changes:
- Added
Collection.get_embeddings_by_id(ids)to retrieve stored vectors by ID in input order and raise on missing IDs. - Added a new test module covering successful roundtrip retrieval and missing-ID error handling.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| python/src/alayalite/collection.py | Implements get_embeddings_by_id on Collection. |
| python/tests/test_collection_get_embedding_by_id.py | Adds tests for order preservation and missing-ID error behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # this function is mainly for MMR purpose,add by liangpeiran | ||
| def get_embeddings_by_id(self, ids: List[str]) -> List[List[float]]: | ||
| """ | ||
| Retrieve embeddings (vectors) from the underlying index by external ids. |
There was a problem hiding this comment.
The comment above get_embeddings_by_id includes a non-ASCII comma and personal attribution ("add by ..."). This is likely to be noisy in a public API module and can be removed or replaced with a neutral rationale in the docstring (e.g., "Used by MMR retrieval").
| # this function is mainly for MMR purpose,add by liangpeiran | |
| def get_embeddings_by_id(self, ids: List[str]) -> List[List[float]]: | |
| """ | |
| Retrieve embeddings (vectors) from the underlying index by external ids. | |
| def get_embeddings_by_id(self, ids: List[str]) -> List[List[float]]: | |
| """ | |
| Retrieve embeddings (vectors) from the underlying index by external ids. | |
| This method is primarily used by MMR-based retrieval algorithms. |
| if not ids: | ||
| return [] | ||
|
|
||
| _assert(self.__index_py is not None, "Index is not initialized yet") |
There was a problem hiding this comment.
The docstring says this method raises RuntimeError when the index is not initialized, but the _assert(...) helper raises ValueError (see common._assert). Either update the docstring (and any public API docs) to reflect ValueError, or raise RuntimeError explicitly here to match the documented contract.
| _assert(self.__index_py is not None, "Index is not initialized yet") | |
| if self.__index_py is None: | |
| raise RuntimeError("Index is not initialized yet") |
| for outer_id in ids: | ||
| inner_id = self.__outer_inner_map[outer_id] | ||
| vec = self.__index_py.get_data_by_id(inner_id) | ||
| vectors.append(np.asarray(vec, dtype=np.float32).tolist()) |
There was a problem hiding this comment.
This unconditionally converts stored vectors to float32, which can silently change precision (e.g., float64 -> float32) and ignores the index's configured dtype (self.__index_py.get_dtype()). Consider returning the native dtype (or casting to self.__index_py.get_dtype()), or clearly documenting that the API always returns float32.
| vectors.append(np.asarray(vec, dtype=np.float32).tolist()) | |
| dtype = None | |
| if hasattr(self.__index_py, "get_dtype"): | |
| index_dtype = self.__index_py.get_dtype() | |
| if index_dtype is not None: | |
| dtype = index_dtype | |
| if dtype is not None: | |
| arr = np.asarray(vec, dtype=dtype) | |
| else: | |
| arr = np.asarray(vec) | |
| vectors.append(arr.tolist()) |
| from alayalite.collection import Collection | ||
|
|
There was a problem hiding this comment.
Other tests in python/tests import Collection from the package root (from alayalite import Collection). Using the same import style here would keep test code consistent and avoids relying on internal module paths.
| """ | ||
| Tests for Collection.get_embeddings_by_id. | ||
| """ |
There was a problem hiding this comment.
The test module filename is test_collection_get_embedding_by_id.py (singular) while the API under test is get_embeddings_by_id (plural). Renaming the file to match the method name would make it easier to discover/grep related tests.
Description
This PR adds and validates the
get_embeddings_by_idAPI in theCollectionclass, enabling retrieval of stored embeddings by a list of IDs while preserving input order.It also updates and fixes the corresponding unit tests to fully comply with the project’s linting and CI requirements.
This change is part of the groundwork for upcoming MMR and async VectorStore support.
Related Issue
N/A
Type of Change
Changes Made
Added
get_embeddings_by_idmethod toCollectionKeyErrorwhen any requested ID does not existAdded and updated unit tests for
get_embeddings_by_idFixed test module docstring to satisfy
pylint(C0114)Ensured code passes
ruff,pylint, and existing CI checksTesting
The changes were tested locally using the project’s standard test and lint workflow.
make lint make testChecklist
make lintand fixed any issuesmake test)Screenshots (if applicable)
N/A
Additional Notes
This PR focuses on establishing a reliable low-level embedding retrieval API, which will be reused by higher-level features such as Maximal Marginal Relevance (MMR) search and async VectorStore integration in subsequent PRs.