A modular, production-ready vector search engine for IKEA products using Qdrant, CLIP, and OpenAI embeddings. Supports both text and image similarity search with a clean, extensible architecture.
- Text-to-text, image-to-image, and text-to-image search
- CLIP for images and text, OpenAI for text
- Clean separation of concerns with pluggable components
- Comprehensive error handling, logging, and configuration
- Command-line interface
- Efficient processing of large datasets
- Full integration with Qdrant vector database
src/vector_search/
├── core/ # Core functionality
│ ├── search_engine.py # Main search engine
│ ├── embedders.py # CLIP and OpenAI embedders
│ └── qdrant_client.py # Qdrant operations
├── data/ # Data loading and processing
│ └── product_loader.py # Product data utilities
├── utils/ # Utilities
│ ├── config.py # Configuration management
│ └── logger.py # Logging setup
└── cli/ # Command-line interface
└── main.py # CLI entry point
cd qdrant
poetry installcd qdrant
pip install -e .For web scraping functionality:
poetry install --extras scrapingCreate a .env file in the project root:
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=your_qdrant_api_key
OPENAI_API_KEY=your_openai_api_key
TEXT_COLLECTION=ikea_products
IMAGE_COLLECTION=furniture_images
BATCH_SIZE=32
DEFAULT_LIMIT=10
DEFAULT_THRESHOLD=0.7The main entry point is the vector-search command:
# Build text embeddings from JSON file
poetry run vector-search build-text --source json --input-file scripts/out/ikea_products.json
# Build image embeddings from existing Qdrant collection
poetry run vector-search build-image --source qdrant --source-collection ikea_products
# Search by text
poetry run vector-search search-text --query "modern white sofa"
# Search by image
poetry run vector-search search-image --query "https://example.com/sofa.jpg"
# List available collections
poetry run vector-search list-collectionsfrom vector_search import VectorSearchEngine, ProductLoader
# Initialize search engine
search_engine = VectorSearchEngine(
qdrant_url="http://localhost:6333",
openai_api_key="your_key"
)
# Load products
products = ProductLoader.load_from_json("products.json")
# Build embeddings
search_engine.build_text_embeddings(products, "ikea_products")
search_engine.build_image_embeddings(products, "furniture_images")
# Search
results = search_engine.search_by_text("modern sofa", "ikea_products")
similar_images = search_engine.search_by_image("sofa.jpg", "furniture_images")Build text embeddings using OpenAI or CLIP.
poetry run vector-search build-text [OPTIONS]
Options:
--source {json,qdrant} Source of products
--input-file PATH Input JSON file (for json source)
--source-collection TEXT Source collection (for qdrant source)
--collection TEXT Target collection name
--batch-size INTEGER Batch size for processingBuild image embeddings using CLIP.
poetry run vector-search build-image [OPTIONS]
Options:
--source {json,qdrant} Source of products
--input-file PATH Input JSON file (for json source)
--source-collection TEXT Source collection (for qdrant source)
--collection TEXT Target collection name
--batch-size INTEGER Batch size for processingSearch using text queries.
poetry run vector-search search-text [OPTIONS]
Options:
--query TEXT Search query
--collection TEXT Collection to search
--limit INTEGER Number of results
--threshold FLOAT Similarity threshold
--use-clip Use CLIP instead of OpenAISearch using image URLs.
poetry run vector-search search-image [OPTIONS]
Options:
--query TEXT Image URL
--collection TEXT Collection to search
--limit INTEGER Number of results
--threshold FLOAT Similarity thresholdList available Qdrant collections.
poetry run vector-search list-collectionscore/: Core search engine functionalitydata/: Data loading and processing utilitiesutils/: Configuration and logging utilitiescli/: Command-line interface
- Create a new embedder class inheriting from
BaseEmbedder - Implement the
get_embedding()method - Add to the search engine initialization
- Add a new method to
ProductLoader - Update the CLI to support the new source
- Add appropriate validation
- Batch Processing: Configurable batch sizes for optimal performance
- GPU Support: Automatic GPU detection for CLIP models
- Memory Efficient: Streaming data processing for large datasets
- Parallel Processing: Concurrent embedding generation where possible
- CUDA out of memory: Reduce batch size or use CPU
- Image download failures: Check image URLs and network connectivity
- Qdrant connection errors: Verify QDRANT_URL and authentication
- Missing API keys: Ensure required environment variables are set
Enable debug logging by setting the log level:
import logging
logging.getLogger("vector_search").setLevel(logging.DEBUG)# 1. Build text embeddings
poetry run vector-search build-text --source json --input-file data/products.json
# 2. Build image embeddings
poetry run vector-search build-image --source qdrant --source-collection ikea_products
# 3. Search by text
poetry run vector-search search-text --query "modern white sofa" --limit 5
# 4. Search by image
poetry run vector-search search-image --query "https://example.com/sofa.jpg" --limit 5from vector_search import VectorSearchEngine, ProductLoader, Config
# Load configuration
Config.validate()
# Initialize search engine
engine = VectorSearchEngine(
qdrant_url=Config.QDRANT_URL,
qdrant_api_key=Config.QDRANT_API_KEY,
openai_api_key=Config.OPENAI_API_KEY
)
# Load and process products
products = ProductLoader.load_from_json("products.json")
products = ProductLoader.filter_products_with_images(products)
# Build embeddings
engine.build_text_embeddings(products, "ikea_products")
engine.build_image_embeddings(products, "furniture_images")
# Search
text_results = engine.search_by_text("modern sofa", "ikea_products")
image_results = engine.search_by_image("sofa.jpg", "furniture_images")