Building Scalable RAG Systems with Vector Databases
Learn how we architected Rivet's RAG pipeline to handle millions of documents with sub-100ms query times using advanced vector embeddings and intelligent caching strategies.
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for building AI applications that need to access and reason over large amounts of domain-specific knowledge. At ElseBlock Labs, we've built Rivet, a production-grade RAG system that processes millions of documents while maintaining sub-100ms response times. In this deep dive, we'll share the architecture decisions, optimization techniques, and lessons learned from building and scaling our RAG infrastructure.
The Challenge: Scale Meets Speed
When we started building Rivet, we faced several critical challenges:
- Volume: Processing and indexing over 10 million documents from various sources
- Velocity: Maintaining query response times under 100ms for 95% of requests
- Variety: Handling diverse document formats including PDFs, HTML, Markdown, and structured data
- Veracity: Ensuring accurate retrieval with minimal hallucination
Traditional approaches using simple semantic search weren't sufficient. We needed a sophisticated architecture that could handle enterprise-scale requirements while delivering consumer-grade performance.
Architecture Overview
Our RAG system consists of several key components working in harmony:
1. Document Processing Pipeline
The ingestion pipeline is built on Apache Kafka for reliable, distributed processing:
class DocumentProcessor:
def __init__(self):
self.chunker = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ".", " ", ""]
)
self.embedder = OpenAIEmbeddings(model="text-embedding-3-large")
async def process_document(self, document: Document):
# Extract text based on document type
text = await self.extract_text(document)
# Smart chunking with context preservation
chunks = self.chunker.split_text(text)
# Generate embeddings in batches
embeddings = await self.embedder.embed_documents(chunks)
# Store in vector database
await self.vector_store.add_documents(chunks, embeddings)
Key innovations in our processing pipeline:
- Adaptive Chunking: We dynamically adjust chunk sizes based on document structure and content type
- Context Preservation: Each chunk maintains metadata about its position and surrounding context
- Parallel Processing: Documents are processed in parallel across multiple workers with automatic load balancing
2. Vector Database Architecture
We chose Qdrant as our primary vector database for several reasons:
- Native support for multiple vector spaces
- Efficient HNSW (Hierarchical Navigable Small World) indexing
- Built-in filtering and metadata support
- Horizontal scalability through sharding
Our vector database schema:
interface VectorDocument {
id: string;
vector: number[];
payload: {
text: string;
source: string;
document_id: string;
chunk_index: number;
total_chunks: number;
metadata: {
title?: string;
author?: string;
date?: string;
tags?: string[];
category?: string;
};
context: {
previous_chunk?: string;
next_chunk?: string;
};
};
}
3. Hybrid Search Strategy
Pure semantic search alone isn't always optimal. We implemented a hybrid approach combining:
- Dense Retrieval: Vector similarity search for semantic understanding
- Sparse Retrieval: BM25 for keyword matching and exact terms
- Re-ranking: Cross-encoder models for final relevance scoring
class HybridRetriever:
async def retrieve(self, query: str, k: int = 10):
# Dense retrieval using embeddings
dense_results = await self.vector_search(query, k=k*2)
# Sparse retrieval using BM25
sparse_results = await self.keyword_search(query, k=k*2)
# Reciprocal Rank Fusion
combined = self.reciprocal_rank_fusion(
dense_results,
sparse_results,
weights=[0.7, 0.3]
)
# Re-rank with cross-encoder
reranked = await self.cross_encoder.rerank(
query,
combined[:k*2]
)
return reranked[:k]
Performance Optimizations
1. Intelligent Caching
We implement multi-level caching to minimize latency:
- Query Cache: LRU cache for frequent queries (Redis)
- Embedding Cache: Pre-computed embeddings for common phrases
- Result Cache: Cached retrieval results with TTL based on update frequency
2. Quantization and Compression
To reduce memory footprint and improve search speed:
# Product quantization for vector compression
quantizer = ProductQuantizer(
num_subvectors=32,
bits_per_subvector=8
)
# Reduces vector size by 75% with minimal accuracy loss
compressed_vectors = quantizer.fit_transform(vectors)
3. Asynchronous Processing
All I/O operations are asynchronous, allowing efficient handling of concurrent requests:
async def process_query(query: str):
# Parallel execution of independent operations
embedding_task = asyncio.create_task(embed_query(query))
metadata_task = asyncio.create_task(extract_metadata(query))
embedding = await embedding_task
metadata = await metadata_task
# Stream results as they become available
async for result in retrieve_streaming(embedding, metadata):
yield result
Scaling Strategies
1. Horizontal Scaling
Our infrastructure is designed for horizontal scaling:
- Vector DB Sharding: Documents are distributed across multiple Qdrant nodes based on content hash
- Load Balancing: HAProxy distributes queries across retrieval servers
- Auto-scaling: Kubernetes HPA scales pods based on CPU and memory metrics
2. Index Optimization
We continuously optimize our vector indices:
# Qdrant collection configuration
collections:
documents:
vector_size: 3072
distance: Cosine
hnsw_config:
m: 16
ef_construct: 200
full_scan_threshold: 10000
optimizers_config:
deleted_threshold: 0.2
vacuum_min_vector_number: 1000
default_segment_number: 4
memmap_threshold: 50000
indexing_threshold: 20000
flush_interval_sec: 5
max_optimization_threads: 2
Monitoring and Observability
We track key metrics to ensure system health:
- Latency Metrics: P50, P95, P99 query response times
- Throughput: Queries per second, documents processed per hour
- Quality Metrics: Retrieval accuracy, relevance scores
- Resource Utilization: CPU, memory, and storage usage
# Prometheus metrics
query_latency = Histogram(
'rag_query_latency_seconds',
'Query latency in seconds',
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)
retrieval_quality = Gauge(
'rag_retrieval_quality_score',
'Average retrieval quality score'
)
@query_latency.time()
async def handle_query(query: str):
results = await retriever.retrieve(query)
retrieval_quality.set(calculate_quality_score(results))
return results
Lessons Learned
After running Rivet in production for over a year, here are our key takeaways:
1. Chunking Strategy Matters
The way you chunk documents significantly impacts retrieval quality. We found that:
- Preserving paragraph boundaries improves context understanding
- Including overlap between chunks reduces information loss
- Adaptive chunk sizes based on content type yields better results
2. Hybrid Search is Essential
Pure semantic search fails for:
- Acronyms and technical terms
- Exact phrase matching requirements
- Queries with specific numerical values or codes
3. Metadata is Powerful
Rich metadata enables:
- Filtered searches (date ranges, categories, authors)
- Contextual ranking adjustments
- Better explainability of results
4. Continuous Evaluation is Critical
We continuously evaluate our system using:
- A curated dataset of 1,000+ query-document pairs
- A/B testing of retrieval strategies
- User feedback integration
Future Directions
We're actively working on several enhancements:
1. Multi-modal RAG
Extending our system to handle images, tables, and diagrams alongside text.
2. Adaptive Retrieval
Using reinforcement learning to automatically adjust retrieval parameters based on query patterns.
3. Federated Search
Enabling secure search across distributed data sources without centralizing sensitive information.
Conclusion
Building a scalable RAG system requires careful consideration of architecture, performance optimization, and continuous refinement. The techniques we've shared here have enabled Rivet to serve millions of queries daily while maintaining excellent performance and accuracy.
Key takeaways for building your own RAG system:
- Start with a solid document processing pipeline
- Implement hybrid search from the beginning
- Design for horizontal scalability
- Invest in monitoring and observability
- Continuously evaluate and improve retrieval quality
The RAG landscape is evolving rapidly, and we're excited to continue pushing the boundaries of what's possible with retrieval-augmented generation. If you're building similar systems or have questions about our approach, we'd love to hear from you.
Want to learn more about our AI solutions? Contact our team to discuss how we can help you build scalable AI systems.