Chosing the optimal chunking strategy for Vectors

Status	Proposed
RFC #	18 (opens in a new tab)
Author(s)	Sarfraaz Talat (sarfraaz.talat@@oslash.com)
Updated	2023-07-14

Objective

We are facing some accuracy issues with our answers in the question answering solution we have implemented. This RFC aims to find out the root cause of the inaccuracies and propose a solution to fix it.

Motivation

For providing the question answers based on the existing KnowledgeBase of the clients, we are first crawling the raw content from all the pages on given KnowledgeBase (we'll call it KB from here onwards) url by the user. After that, we are generating embeddings from the content of the full page, feeding it into the vector database and then using the question embeddings to find the most similar embeddings from the KB. We are then using the top 3 most similar embeddings to find the most relevant answers from the KB.

The crucial part here is the chunking strategy and how we decide the optimal chunk for generating embedding. We started with very naive approach and chose the chunking size based on the limitation of the token size of our embedding model. We were using text-emnedding-ada-002 model by openai at this point for generating embeddings, the model has token limit of 8192 for generating embeddings, so we did some trial and error and found out that 10000 characters is the optimal chunk size for generating embeddings. So we used that as the main logic of our chunking strategy and started chunking content based on the 10k limit.

The issue with this approach is, that bigger the chunk grows, the more possibilities of it losing the context of the exact sentences or words written inside the chunk and also more chances of it being having generalised semantics instead of having specialized according to the context of the content.

This was highlighted by one particular question we tried to get answer from our qna api the way we mentioned above. We figured that since the more specific answer was there in some KB record but it was deep in some 9/10th paragraph and the title of the document wasn't about that topic specifically. Hence this was ranked way lower compared to some other documents which ranked higher because they were more on the lines of the question asked, but not exactly the answer for the specific question. So we did a poc to figure out if we generate embedding on smaller chunks and query against them in vectorDB would it rank higher? And it indeed did rank higher now, and came in top 3 matching documents, hence it proved our intuition that the bigger the chunk, the more chances of it losing the context and hence the more chances of it not being ranked higher. We did achieve this by reducing the chunking size from 10k characters to 1k characters.

Design Proposal

We have figured out that the chunking size is the main issue here, and we need to find out the optimal chunking size for generating embeddings. We can do this by doing some trial and error and finding out the optimal chunking size for generating embeddings. But this would be a very time consuming process and we would need to also create a process to evaluate this on each separate chunking size we take for testing, might also vary depending on the different kinds of knowledge bases. We have decided that chunk of 1k works fine for now, but we can do further research on this and find out the optimal chunking size for generating embeddings if we encounter any issues with this one.

Since we are creating multiple chunks for same documents, storing the content of each chunk in the vector database might not give us the full context while preparing the final answer based on the matching KB records. If we do store full content inside metadata of each vector it would cost us a lot and still we will be limited by the 40kb limit of content in each vector, which we might run out of soon in our next big KB indexing. So we need to store the full content of the KB somewhere else and just store the embeddings and some metadata in the vector database. We can store the full content in the opensearch index and store the id of the document in the vector database, so that we can fetch the full content from the opensearch when we need it while preparing the answer.

Drawbacks

We will be using opensearch as storage layer for the content, and vector DB for the embeddings and some essential metadata only. There will be additional cost of storing the content inside opensearch that we would have to bear. But this also would mean we would save significantly on the pinecone storage by being able to store more vectors in same pod due to not having to store huge content part there as metadata, so this can be more of a tradeoff than a drawback.

This solution will add overhead of making an extra call to opensearch for getting full content for the matching vectors, but it would be always be a single request and significantly faster than usual opensearch queries since we will be doing exact match for the document ids, which is the primary key for all records in opensearch. We can later add caching also on top of this to make this faster if needed. For now, we can expect the overhead of around 50ms after introducing this change in first response of all qna answer responses.

20230616 Rfc Vector Databases 20230802 Rfc Versioning and Release