dqbd · TechnoBlogger14o3 · Mar 7, 2026
diff --git a/README.md b/README.md
@@ -1,17 +1,310 @@
+# Tiktokenizer
+
 ![Tiktokenizer](https://user-images.githubusercontent.com/1443449/222597674-287aefdc-f0e1-491b-9bf9-16431b1b8054.svg)
 
-***
+**An online playground for token counting and inspection** — accurately count tokens for prompts using OpenAI’s [tiktoken](https://github.com/openai/tiktoken) and Hugging Face tokenizers, with visual segment-level breakdowns.
 
-# Tiktokenizer
+[Demo video](https://user-images.githubusercontent.com/1443449/222598119-0a5a536e-6785-44ad-ba28-e26e04f15163.mp4)
+
+---
+
+## Table of contents
+
+- [What it does](#what-it-does)
+- [Features](#features)
+- [Supported models and encodings](#supported-models-and-encodings)
+- [Tech stack](#tech-stack)
+- [Project structure](#project-structure)
+- [Getting started](#getting-started)
+- [Environment variables](#environment-variables)
+- [Scripts](#scripts)
+- [How tokenization works](#how-tokenization-works)
+- [API](#api)
+- [Configuration](#configuration)
+- [Development](#development)
+- [Testing](#testing)
+- [License and acknowledgments](#license-and-acknowledgments)
+
+---
+
+## What it does
+
+Tiktokenizer lets you:
+
+- **Count tokens** for any supported model or encoding before sending requests to OpenAI or other APIs, so you can stay within context limits and estimate cost.
+- **Inspect token boundaries** — see exactly which substring corresponds to each token, with grapheme-aware segment highlighting (handles emoji and complex scripts).
+- **Compare encodings** — switch between raw encodings (e.g. `cl100k_base`, `o200k_base`) and full model presets (e.g. GPT-4o, GPT-3.5-turbo) to see how tokenization differs.
+- **Use chat-style formatting** — for chat models, build multi-turn conversations with system/user/assistant messages and see the exact token count of the serialized prompt (including special tokens like `<|im_start|>` and `<|im_end|>`).
+- **Try open-source tokenizers** — run Hugging Face tokenizers (CodeLlama, Llama 3, Phi-2, Gemma, etc.) in the browser via [Transformers.js](https://huggingface.co/docs/transformers.js), with tokenizer files either pre-downloaded at build time or loaded from your deployment.
+
+The app runs tokenization in the browser where possible (e.g. tiktoken for OpenAI, Transformers.js for Hugging Face), and falls back to server-side encoding when needed. The encode API can be used programmatically for automation or integration.
+
+---
+
+## Features
+
+- **Real-time token count** as you type, with optional whitespace visualization (spaces, tabs, newlines).
+- **Segment-level highlighting** — hover over a segment to see its token IDs and the exact text span; each segment is colored for clarity.
+- **URL-driven model selection** — the chosen model/encoder is stored in the query string (`?model=...`), so you can share or bookmark a specific configuration.
+- **Chat composer** for OpenAI chat models: add system/user/assistant messages and optional names; the serialized format (with chat special tokens) is tokenized so the count matches what the API will see.
+- **Searchable model picker** — filter by name to quickly switch between dozens of models and encodings.
+- **Pre-downloaded Hugging Face tokenizers** — `tokenizer.json` and `tokenizer_config.json` are fetched at build/dev time and served from `public/hf/`, so open-source tokenizers work without hitting Hugging Face on every load (gated models require `HF_API_KEY`).
+
+---
+
+## Supported models and encodings
+
+### OpenAI encodings (raw)
+
+Use these when you only care about the encoding, not a specific model:
+
+| Encoding     | Typical use                    |
+|-------------|---------------------------------|
+| `gpt2`      | Legacy GPT-2                    |
+| `r50k_base` | Base BPE (e.g. Davinci, Codex)  |
+| `p50k_base` | Codex, text-davinci-*           |
+| `p50k_edit` | Edit models                     |
+| `cl100k_base` | GPT-3.5-turbo, GPT-4, embeddings |
+| `o200k_base` | GPT-4o                          |
+
+### OpenAI chat models
+
+- `gpt-4o`, `gpt-3.5-turbo`, `gpt-4`, `gpt-4-32k`, `gpt-4-1106-preview`
+
+These use the correct encoding and chat special tokens (`<|im_start|>`, `<|im_end|>`, `<|im_sep|>`).
+
+### OpenAI legacy text and embedding models
+
+Including `text-davinci-003`, `text-embedding-ada-002`, `text-embedding-3-small`, `text-embedding-3-large`, and other legacy text/embedding/code models defined in the codebase.  
+*(Note: some newer embedding models may be disabled with a “Model may be too new” message until support is added.)*
+
+### Open-source (Hugging Face)
+
+Tokenizers are loaded via [@xenova/transformers](https://github.com/xenova/transformers.js) from pre-downloaded files under `public/hf/`:
+
+- **Code Llama:** `codellama/CodeLlama-7b-hf`, `codellama/CodeLlama-70b-hf`
+- **Llama 3:** `meta-llama/Meta-Llama-3-8B`, `meta-llama/Meta-Llama-3-70B`
+- **Others:** `microsoft/phi-2`, `google/gemma-7b`, `deepseek-ai/DeepSeek-R1`, `Qwen/Qwen2.5-72B`, `tiiuae/falcon-7b`, `01-ai/Yi-6B`, `openai/whisper-tiny`
+
+Gated models require a Hugging Face API key (see [Environment variables](#environment-variables)).
+
+---
+
+## Tech stack
+
+| Layer        | Technology |
+|-------------|------------|
+| **Framework** | [Next.js 13](https://nextjs.org/) (Pages Router) |
+| **UI**      | [React 18](https://react.dev/), [Tailwind CSS](https://tailwindcss.com/), [Radix UI](https://www.radix-ui.com/) (Select, Popover, Checkbox), [cmdk](https://cmdk.paco.me/) (command palette), [Lucide](https://lucide.dev/) icons |
+| **Styling** | [class-variance-authority](https://cva.style/), [clsx](https://github.com/lukeed/clsx), [tailwind-merge](https://github.com/dcastil/tailwind-merge) |
+| **Data / API** | [tRPC](https://trpc.io/) v10, [TanStack React Query](https://tanstack.com/query), [Zod](https://zod.dev/) for validation |
+| **Tokenizers** | [tiktoken](https://github.com/openai/tiktoken) (OpenAI), [@xenova/transformers](https://github.com/xenova/transformers.js) (Hugging Face) |
+| **Text / segments** | [graphemer](https://github.com/orling/graphemer) for grapheme-aware splitting (emoji, etc.) |
+| **Misc**     | [superjson](https://github.com/blitz-js/superjson), [bignumber.js](https://mikemcl.github.io/bignumber.js/), [Vercel Analytics](https://vercel.com/analytics) |
+
+The project follows the [T3 Stack](https://create.t3.gg/) (TypeScript, tRPC, Tailwind, Next.js) and uses shadcn/ui-style components with Radix primitives.
+
+---
+
+## Project structure
+
+```
+tiktokenizer/
+├── public/
+│   └── hf/                    # Pre-downloaded Hugging Face tokenizers (org/model/tokenizer.json, etc.)
+├── src/
+│   ├── components/            # Reusable UI (Button, Input, Select, Command, Popover, Checkbox)
+│   ├── env.mjs                # Server/client env validation (Zod)
+│   ├── models/
+│   │   ├── index.ts           # Model/encoding enums and helpers (AllOptions, isChatModel, etc.)
+│   │   └── tokenizer.ts       # TiktokenTokenizer, OpenSourceTokenizer, createTokenizer()
+│   ├── pages/
+│   │   ├── _app.tsx           # App shell, React Query + tRPC providers
+│   │   ├── index.tsx          # Main page: encoder select, editor, token viewer
+│   │   └── api/
+│   │       ├── trpc/[trpc].ts # tRPC handler
+│   │       └── v1/
+│   │           ├── encode.ts  # REST: encode text by model or encoder
+│   │           └── edge.ts   # Edge runtime demo (tiktoken WASM)
+│   ├── sections/
+│   │   ├── ChatGPTEditor.tsx  # Chat message composer for OpenAI chat models
+│   │   ├── EncoderSelect.tsx  # Model/encoder dropdown with search
+│   │   └── TokenViewer.tsx    # Token count + segment highlighting
+│   ├── server/
+│   │   └── api/               # tRPC router and root
+│   ├── scripts/
+│   │   └── download.ts        # Fetches HF tokenizer files into public/hf/
+│   ├── styles/
+│   │   └── globals.css
+│   └── utils/
+│       ├── api.ts
+│       ├── cn.ts              # className helper
+│       └── segments.ts        # Map tokens to text segments (tiktoken + Hugging Face)
+├── .env.example
+├── next.config.mjs            # Env validation, webpack async WASM
+├── package.json
+├── tailwind.config.cjs
+└── tsconfig.json
+```
+
+- **`models/`** — Defines which models and encodings exist and how they map to tokenizers.
+- **`tokenizer.ts`** — Implements `TiktokenTokenizer` (tiktoken) and `OpenSourceTokenizer` (Transformers.js), plus `createTokenizer(model)` which chooses and instantiates the right one.
+- **`utils/segments.ts`** — Builds segment lists (text + token IDs) for the UI; uses graphemer so boundaries respect graphemes (e.g. emoji).
+- **`scripts/download.ts`** — Run at `dev`/`build`; downloads `tokenizer.json` and `tokenizer_config.json` for each open-source model into `public/hf/`.
+
+---
+
+## Getting started
+
+### Prerequisites
+
+- **Node.js** 18+ (for Next.js 13 and current tooling)
+- **Yarn** (recommended; the repo uses `yarn` in scripts)
+
+### Installation
+
+```bash
+git clone https://github.com/dqbd/tiktokenizer.git
+cd tiktokenizer
+yarn install
+```
+
+Copy environment variables and set a Hugging Face API key if you need gated open-source models:
+
+```bash
+cp .env.example .env
+# Edit .env and set HF_API_KEY (required for build/dev unless you skip validation — see below)
+```
+
+Run the app (this will run the download script first, then start Next.js):
+
+```bash
+yarn dev
+```
+
+Open [http://localhost:3000](http://localhost:3000). Select a model or encoding, type or paste text (or use the chat editor for chat models), and see the token count and segment view.
+
+---
+
+## Environment variables
+
+Defined and validated in `src/env.mjs` (Zod). Server-side only unless prefixed with `NEXT_PUBLIC_`.
+
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `NODE_ENV` | Inferred | `development`, `test`, or `production`. Defaults to `development` if unset. |
+| `HF_API_KEY` | Yes (by default) | [Hugging Face API token](https://huggingface.co/settings/tokens). Used by `download.ts` to fetch tokenizer files for gated models (e.g. Llama, Code Llama). |
+
+To skip env validation (e.g. Docker builds or when not using open-source tokenizers), set:
+
+```bash
+SKIP_ENV_VALIDATION=1
+```
+
+Then you can run `yarn build` or `yarn dev` without setting `HF_API_KEY`. The app will still work for OpenAI models and encodings; open-source tokenizers that need the download step may fail if files are missing.
+
+---
+
+## Scripts
+
+| Command | Description |
+|---------|-------------|
+| `yarn dev` | Runs `download.ts` (with env), then `next dev`. Use for local development. |
+| `yarn build` | Runs `download.ts`, then `next build`. Use for production build. |
+| `yarn start` | Runs `next start`. Serve a previously built app. |
+| `yarn lint` | Runs `next lint` (ESLint). |
+
+The download script writes into `public/hf/<org>/<model>/tokenizer.json` and `tokenizer_config.json`. If a file already exists, it is skipped. Remove files or the directory to re-download.
+
+---
+
+## How tokenization works
+
+### Two backends
+
+1. **OpenAI (tiktoken)**  
+   For OpenAI encodings and models, the app uses the [tiktoken](https://github.com/openai/tiktoken) library. Encodings are resolved via `get_encoding()` or `encoding_for_model()`. Chat models (e.g. GPT-4o, GPT-3.5-turbo) use the correct encoding and optional extra special tokens for chat (e.g. `cl100k_base` / `o200k_base` with `<|im_start|>`, `<|im_end|>`, `<|im_sep|>`).  
+   Tokenization runs in the main thread; for a minimal Edge example using tiktoken WASM, see `src/pages/api/v1/edge.ts`.
+
+2. **Open-source (Hugging Face)**  
+   For open-source models, the app uses [@xenova/transformers](https://github.com/xenova/transformers.js) (`PreTrainedTokenizer`). Tokenizer files are loaded from `public/hf/` (populated by `download.ts`). On the client, `env.remoteHost` is set to the current origin so requests go to your server.  
+   Some models (e.g. Code Llama, Llama 2) use a leading `<s>` token that is stripped in the segment logic via `hackModelsRemoveFirstToken` so the highlighted segments align with the visible text.
+
+### Segments
+
+- **Purpose:** Show which substring corresponds to which token(s), including multi-token graphemes (e.g. emoji).
+- **Implementation:**  
+  - **Tiktoken:** `utils/segments.ts` uses `getTiktokenSegments()`: encode with `"all"` (all allowed tokenizations), decode token bytes, and align to input graphemes (via [graphemer](https://github.com/orling/graphemer)) to build segment boundaries.  
+  - **Hugging Face:** `getHuggingfaceSegments()` uses the tokenizer’s `convert_ids_to_tokens` and aligns to the same grapheme-split input.  
+- **UI:** `TokenViewer` shows each segment with a background color; hovering highlights the segment and can show whitespace (spaces, tabs, newlines) for debugging.
+
+---
+
+## API
+
+### `POST /api/v1/encode`
+
+Encodes text with a given model or encoder and returns token IDs and count. Useful for scripts or external tools.
+
+**Request body (JSON):**
+
+- **By encoder:** `{ "text": string, "encoder": "<encoding>" }`  
+  `encoder` must be one of the OpenAI encoding names (e.g. `cl100k_base`, `o200k_base`).
+- **By model:** `{ "text": string, "model": "<model>" }`  
+  `model` must be one of the supported OpenAI or open-source model names.
+
+**Response (JSON):**
+
+```json
+{
+  "name": "cl100k_base",
+  "tokens": [9906, 1917],
+  "count": 2
+}
+```
+
+- `name` — Encoding or tokenizer name.
+- `tokens` — Array of token IDs.
+- `count` — Length of `tokens`.
+
+Validation is done with Zod; invalid `encoder` or `model` yields a 400-style parse error.
+
+---
+
+## Configuration
+
+- **Next.js** (`next.config.mjs`):  
+  - Imports `src/env.mjs` for env validation unless `SKIP_ENV_VALIDATION` is set.  
+  - Enables `asyncWebAssembly` and `layers` in webpack for tiktoken WASM (e.g. Edge route).  
+  - `i18n`: single locale `en`.
+
+- **Tailwind** (`tailwind.config.cjs`), **PostCSS** (`postcss.config.cjs`), **ESLint** (`.eslintrc.cjs`), **Prettier** (`prettier.config.cjs`) are standard for a T3/Next + Tailwind setup.
+
+---
+
+## Development
+
+- **Adding a new OpenAI model/encoding:** Extend the Zod enums in `src/models/index.ts` and, if needed, add branch logic in `src/models/tokenizer.ts` (e.g. for new chat special tokens or encodings).
+- **Adding an open-source model:** Add it to `openSourceModels` in `src/models/index.ts`. Run `yarn dev` or `yarn build` so `download.ts` fetches its tokenizer files into `public/hf/`. If the tokenizer uses a leading sentinel token (like `<s>`), add the model to `hackModelsRemoveFirstToken` in `src/models/index.ts` so segments align correctly.
+- **Llama 3 revisions:** Some Llama 3 variants use a non-`main` revision; `tempLlama3HackGetRevision()` in `src/models/index.ts` maps them to the correct ref (e.g. `refs/pr/35`).
+
+---
+
+## Testing
+
+Tests use [Vitest](https://vitest.dev/). Segment logic is covered in `src/utils/segments.test.ts`. Run tests with:
 
-Online playground for `openai/tiktoken`, calculating the correct number of tokens for a given prompt.
+```bash
+yarn test
+```
 
-Special thanks to [Diagram](https://diagram.com/) for sponsorship and guidance.
+(If a test script is not in `package.json`, add one, e.g. `"test": "vitest"`.)
 
-https://user-images.githubusercontent.com/1443449/222598119-0a5a536e-6785-44ad-ba28-e26e04f15163.mp4
+---
 
-## Acknowledgments
+## License and acknowledgments
 
-- [T3 Stack](https://create.t3.gg/)
-- [shadcn/ui](https://github.com/shadcn/ui)
-- [openai/tiktoken](https://github.com/openai/tiktoken)
+- **License:** [MIT](LICENSE). Copyright (c) 2023 Tat Dat Duong.
+- **Author:** [dqbd](https://duong.dev)  
+- **Sponsorship:** [Diagram](https://diagram.com/) for sponsorship and guidance.
+- **Thanks:** [T3 Stack](https://create.t3.gg/), [shadcn/ui](https://github.com/shadcn/ui), [openai/tiktoken](https://github.com/openai/tiktoken), [Transformers.js](https://github.com/xenova/transformers.js).