openai / tiktoken
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing openai/tiktoken in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler view⏳ tiktoken tiktoken is a fast BPE tokeniser for use with OpenAI's models. The open source version of can be installed from PyPI: The tokeniser API is documented in . Example code using can be found in the OpenAI Cookbook. Performance is between 3-6x faster than a comparable open source tokeniser: Performance measured on 1GB of text using the GPT-2 tokeniser, using from , and . Getting help Please post questions in the issue tracker. If you work at OpenAI, make sure to check the internal documentation or feel free to contact @shantanu. What is BPE anyway? Language models don't see text like you and I, instead they see a sequence of numbers (known as tokens). Byte pair encoding (BPE) is a way of converting text into tokens. It has a couple desirable properties: 1) It's reversible and lossless, so you can convert tokens back into the original text 2) It works on arbitrary text, even text that is not in the tokeniser's training data 3) It compresses the text: the token sequence is shorter than the bytes corresponding to the original text. On average, in practice, each token corresponds to about 4 bytes. 4) It attempts to let the model see common subwords. For instance, "ing" is a common subword in English, so BPE encodings will often split "encoding" into tokens like "encod" and "ing" (instead of e.g. "enc" and "oding"). Because the model will then see the "ing" token again and again in different contexts, it helps models generalise and better understand grammar. contains an educational submodule that is friendlier if you want to learn more about the details of BPE, including code that helps visualise the BPE procedure: Extending tiktoken You may wish to extend to support new encodings. There are two ways to do this. **Create your object exactly the way you want and simply pass it around.** **Use the plugin mechanism to register your objects with .** This is only useful if you need to find your encoding, otherwise prefer option 1. To do this, you'll need to create a namespace package under . Layout your project like this, making sure to omit the file: should be a module that contains a variable named . This is a dictionary from an encoding name to a function that takes no arguments and returns arguments that can be passed to to construct that encoding. For an example, see . For precise details, see . Your should look something like this: Then simply and you should be able to use your custom encodings! Make sure **not** to use an editable install.