Release: openai-python 1.33.0
Want to share your content on python-bloggers? click here.
A new minor version of the openai-python package was released late on Friday 7 June 2024, only a couple of days after the last minor release. This release adds a chunking_strategy argument to the methods for adding files to vector stores.
What is Chunking?
Chunking is the process of breaking a large piece of text down into smaller segments (or “chunks”). Selecting an appropriate chunk size should ensure that LLM results are accurate and relevant. Ideally you want the size of the chunks to be neither too small nor too large. If the chunks are too small then the LLM might fail to understand the necessary context surrounding the chunk (although chunk overlap can help with this!). If the chunks are too large then the LLM might find it difficult to identify the relevant content within the chunk. As a general principle, if a chunk makes sense to a human without the surrounding context, then it should also make sense to the LLM. A good chunk size can be determined empirically, starting from large chunks and gradually decreasing size until there’s a marked deterioration in results.
Let’s take a quick look at how this works.
Create an OpenAI Client
import os
from openai import OpenAI
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)
Create a Vector Store
If you haven’t already created a vector store then do so now.
store = client.beta.vector_stores.create(name="Test")
This is what the resulting vector store object looks like:
VectorStore(
  id='vs_T7xl1a13glOcHLK7Xzjn08DT',
  created_at=1717821463,
  file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=0, total=0),
  last_active_at=1717821463,
  metadata={},
  name='Test',
  object='vector_store',
  status='completed',
  usage_bytes=0,
  expires_after=None,
  expires_at=None
)
Assign the vector store ID to a variable.
VECTOR_STORE_ID = "vs_T7xl1a13glOcHLK7Xzjn08DT"
Upload File (Default Chunking)
First let’s upload a file using the default chunking strategy.
# Path of file to upload.
FILE_PATH = "pg844.txt"
with open(FILE_PATH, "rb") as f:
    file = client.beta.vector_stores.files.upload_and_poll(
        vector_store_id=VECTOR_STORE_ID,
        file=f
    )
The resulting file object looks like this:
VectorStoreFile(
  id='file-dxc8vvwhT5j3obhaaZw0WVdF',
  created_at=1717825354,
  last_error=None,
  object='vector_store.file',
  status='completed',
  usage_bytes=236235,
  vector_store_id='vs_T7xl1a13glOcHLK7Xzjn08DT',
  chunking_strategy=ChunkingStrategyStatic(
    static=ChunkingStrategyStaticStatic(
      chunk_overlap_tokens=400,
      max_chunk_size_tokens=800
    ),
    type='static'
  )
)
The default values for max_chunk_size_tokens and chunk_overlap_tokens mean that files are indexed by being split into 800-token chunks with 400-token overlap between consecutive chunks.
You’d get the same result if you set chunking_strategy to the default auto strategy.
with open(FILE_PATH, "rb") as f:
    file = client.beta.vector_stores.files.upload_and_poll(
        vector_store_id=VECTOR_STORE_ID,
        file=f,
        chunking_strategy={"type": "auto"}
    )
Upload File (Static Chunking)
Alternatively, you can specify static chunking and provide particular values for chunk_overlap_tokens and max_chunk_size_tokens.
with open(FILE_PATH, "rb") as f:
    file = client.beta.vector_stores.files.upload_and_poll(
        vector_store_id=VECTOR_STORE_ID,
        file=f,
        chunking_strategy={
            "type": "static",
            "static": {
                "chunk_overlap_tokens": 200,
                "max_chunk_size_tokens": 400
            }
        }
    )
And you can see the changes in the returned object.
VectorStoreFile(
  id='file-rnEXDvko5UEfW1DSKwelWawQ',
  created_at=1717825361,
  last_error=None,
  object='vector_store.file',
  status='completed',
  usage_bytes=332491,
  vector_store_id='vs_T7xl1a13glOcHLK7Xzjn08DT',
  chunking_strategy=ChunkingStrategyStatic(
    static=ChunkingStrategyStaticStatic(
      chunk_overlap_tokens=200,
      max_chunk_size_tokens=400
    ),
    type='static'
  )
)
Note that chunk_overlap_tokens and max_chunk_size_tokens reflect the specified values and usage_bytes has changed due to the different way that the file content has been chunked.
There are a couple of constraints on these parameters:
- max_chunk_size_tokensmust be between 100 and 4096, while
- chunk_overlap_tokensmust be non-negative (zero overlap is allowed) and not more than half of- max_chunk_size_tokens.
💡 You can use client.beta.vector_stores.file_batches.upload_and_poll to upload multiple files in a batch.
List Files in Vector Store
List the files in the vector store.
client.beta.vector_stores.files.list(vector_store_id=VECTOR_STORE_ID)
Delete Vector Store
Finally, delete the vector store.
client.beta.vector_stores.delete(vector_store_id=VECTOR_STORE_ID)
Want to share your content on python-bloggers? click here.