Selecting a Vector Database for Enhanced Semantic Search

Status	(Proposed / Accepted / Implemented / Obsolete)
RFC #	13 (opens in a new tab)
Author(s)	Sarfraaz Talat (sarfraaz.talat@@oslash.com)
Updated	2023-06-16

Objective

This RFC aims to find a robust vector database solution that efficiently manages and retrieves high-dimensional document vectors, enhancing our semantic search capabilities. The ideal solution must provide speedy, accurate retrieval, handle customer data segregation, support metadata storage, and offer efficient search with filtering. It should also ensure scalability, cost-efficiency, require minimal maintenance, and come with dependable support and clear documentation. We are particularly interested in evaluating options that are actively maintained and well-regarded by the community. Through a thorough evaluation of multiple vector database options, we strive to find the solution that best fits our requirements.

Motivation

In our quest to provide a cutting-edge assistant service, we aim to enable efficient and precise query answering based on extensive customer documentation. The richness and extent of this documentation make it crucial to adopt a strategy that transcends traditional text search. With text-based search, limitations like exact vocabulary matches, inability to capture semantic nuances, and sensitivity to syntactical differences could potentially hamper the effectiveness of the service. Furthermore, simple text search might struggle with language ambiguity, as it does not account for the context in which words are used. User queries can range widely and may not adhere strictly to the terminology or phrasing present in the documentation. Therefore, we require a solution that promotes semantic-based search, capable of extracting relevant document sections, even when the user's language doesn't mirror the documentation's vocabulary or phrasing perfectly.

To accomplish this, we propose converting our text data into vector embeddings. These embeddings can capture the semantic essence of the text data and make it possible to perform similarity-based searches. We aim to index and query these vectors using an efficient vector database.

The chosen vector database must meet the following criteria:

Performance: The solution must retrieve the nearest matching vectors swiftly and accurately, aiming for a response time of less than 100ms.
Data Segregation: The database should support separate namespaces or indexes to prevent customer data overlap.
Metadata Handling: The solution must support the storage of additional data (metadata) alongside vectors.
Search Efficiency: The database should provide efficient filtering during nearest neighbor searches based on the metadata fields without substantial performance degradation.
Scalability: The system must maintain performance when handling a large number of vectors in a single index. For benchmarking purposes, we're considering a large dataset to consist of 1 million vectors.
Economic Efficiency: The total cost, including the database and hosting, should be affordable while storing 1.3m vectors (approx for 100 customers) with a usage of 400k writes (x/4 approximation based on data) and 10m million read operations(100k users, 50k monthly active users, making 200 requests in month each) per month, with a query per second (QPS) of 100(peak of 6000 active users at a given minute), and p99 performance below 100ms.
Maintenance: After initial development and deployment, the ongoing maintenance of the solution should ideally not exceed 2 hours per week.
Support and Community: The chosen solution should have a reliable support line or a thriving community to help us troubleshoot and overcome potential obstacles.
Documentation: Clear and well-maintained documentation is vital, along with client libraries for Node.js and Python.
Active Development: If the solution is open-source, it should have an active group of maintainers and contributors.

With these requirements in mind, the objective of this RFC is to evaluate multiple vector database solutions and identify the one best suited to our unique needs.

Design Proposal

Pinecone is a managed solution for vector databases, so we do not need to spend much time on setup / infra for this. We can just create projects & base indexes from their dashboard, allocate the pods to the index and start using them. It also provides a client library for Nodejs & python, so we can easily integrate it with our existing codebase. The documentation is extensive at least for things we are using and things we care for so that's good too.

They do provide indexes and namespaces on each index. They do have one limitation which is that each index resides on a separate machine, so if we want separate index for each customer it will cost us a bomb. But luckily they have namespaces which guarantees the separation of data that we require in order to maintain complete data segregation. So we can create a single index and use namespaces to separate data for each customer.

They do support metadata which we can use to store additional information about the index, and also supports filtering based on the values of metadata. Only issue is that metadata has very limited support for data types, it only supports string, int, float and bool & list of string. We can not store complex json type data inside but should not be a big issue at the moment.

Scaling is zero downtime as per their documentations so we can increase pods in the index as and when required directly from the dashboard. So from the maintenance aspect also this seems good since we need not do any heavy lifting in terms of managing and scaling the infra when required.

The response time SLA for support is 4 hrs (critical prod issues) to 2 days (max for any query) for our plan which seems pretty good. They do have a forum too though it does not seem very active as of now but I assume there would be some support from support team there.

They suggest using p1 pods for low latency requirements, which costs around 70 usd per month. We are using top_k=5 query mostly, they have QPS estimations of 30 for single instance of p1 when top_k is 10. we can increase this in multiples of pods by creating more replica pods of same index. so for achieving QPS of 100 we would probably need 3 instances of p1 (considering we are using top_k as 5 instead of their estimation of 10) which would cost around 210 usd per month, but very roughly estimating QPS of 30 would be able to handle peak of 1000 active users hitting 100 queries each in an hour. We can increase this as required by adding more replicas to achieve 100 QPS.

In terms of storage, a single p1 instance can hold 1 million vectors of 768 dimentions, since we are using openai embeddings for now which is of 1536 dimentions, we can cut this estimate by half which would be 500k without any metadata. They haven't given any indication on in exactly what factor would the size of metadata affect the amount of vectors we could store, but based on their estimation of vectors storage it looks like we are getting somewhere around 3GB storage with p1 pods. So assuming the 3 GB limit on storage, we are going to approximate the how many vectors could we store in this p1 pods.

Number of vectors can be stored without any metadata = 3221225472 (3 GB in bytes) / 6144 (1536[dimentions] * 4bytes[space required for 1 number in bytes]) = 524288 = roughly 525k vectors

Number of vectors can be stored with around 15 kb metadata for each vector (10 kb for the 10k chunking limit we have for text and 5 kb for other metadata) = 3221225472 / (6144(vector size) + 15360(metadata size)) = 3221225472 / 21504 = 149796 = roughly 150k vectors

Number of vectors can be stored with on average 1 kb metadata for each vector (only storing necessary data for metadata) = 3221225472 / (6144 + 1024(metadata size)) = 3221225472 / 7168 = 449389 = roughly 450k vectors

Only catch here in storing with bare minimum metadata is that we need to make one additional query to opensearch or wherever the actual data is stored to get the actual data for the vector. This would add additional latency to the request, this is the tradeoff we will have to make if we want to store more vectors in a single instance. (Optimising cost)

These are the estimates for p1 pod, as mentioned earlier we can use s1 instance and the estimates can be multiplied by 4 for the same.

Now for some rough estimates on how much data we will have for each customer, we crawled www.oslash.com (opens in a new tab) for upto depth of 5, it crawled 320 pages total with content of around 40 mb of size. which averages around 41,943,040 bytes would be same number of characters. Which translates to 131072 characters per page. Based on current token limits of embedding model, we are splitting our text into chunks of 10000 characters before encoding it into vector. So for our data of 320 pages we would be having 320 * 131072 / 10000 = 4194 vectors in total. If we estimate our average customer website to be around 1000 pages with similar size, we would have to store around 13000 vector for each customer.

With this estimates we can store around 450000/13000 = 35 customers data in a single p1 instance if we go by storing only necessary metadata. And if we go by storing full data (embedding + 15KB metadata) for each vector we can store around 150000/13000 = 11-12 customers data in a single p1 instance.

If we are to compromise for performance and optimize for storage we could go for s1 instance which can hold 4x as many vectors as we could with p1 at the exact same cost. For now we are looking to have < 100ms latency for pinecone so we should go for p1 instance.

TLDR; for pinecone

We can start with p1 instance for now based on limited storage & < 100ms performance requirements which will give us around 30 QPS throughput and storage capacity for 12 customers with full data & 35 customers with just necessary embeddings and metadata. This would cost around 70 usd per month.

The ideal solution for storing 1.3m vectors with QPS fo 100 under < 100ms latency would require us to store data inside index, meaning we will require 1.3m/150k = 9 p1 instances which would cost around 630 usd per month. This is the cost for ideal solution with scale of 100 customers, but we can start with single p1 instance and scale as required.

Drawbacks

Pinecone is costly, there are cheaper solutions available with as good or even better performance (claimed, performance not verified ourself).
Pinecone is 100% managed, and machine configuration for running the db has very limited options. so we can not modify it much to get better performance which is also cost optimized according to our needs.
The managed solution of pinecone is limited in few regions, So if we want a multi-replica setup with close to customers it might not be possible for some regions. (Closest one to India is Singapore)

Alternatives

There are many alternatives of pinecone. Qdrant and Weaviate are the more popular ones among them. We are going with pinecone because there are not much red flags with them and we have worked with it so development would be faster. We will evaluate other solution when we have more time and resources and add further RFC for change of vector databases if needed.

Getting started with Docs 20230714 Rfc Chunking Strategy