Tokens and Context Windows Explained: The Foundation of Context Engineering
Table of Contents
- Before You Can Engineer Context, You Need to Understand Tokens
- What Is a Token?
- What Is a Context Window?
- Context Window Sizes Across Providers (March 2026)
- Token Pricing: The Economics of Context
- How Different Content Types Tokenize
- The Token Budget Mental Model
- Key Takeaways
Before You Can Engineer Context, You Need to Understand Tokens
Every interaction with an AI model β every question, every tool call, every agent step β is measured in tokens. Tokens are the fundamental unit of AI. They determine what the model can process, how much it costs, and how fast it responds. Yet most developers have only a vague understanding of what tokens actually are.
This post breaks down tokens, tokenization, context windows, and pricing β the foundational knowledge that everything in context engineering builds upon.
What Is a Token?
A token is a chunk of text that the model processes as a single unit. Tokens are not words. They are not characters. They are subword units determined by a tokenization algorithm.
How Tokenization Works
Modern LLMs use Byte Pair Encoding (BPE) or similar subword tokenization algorithms (like SentencePiece). The process works roughly like this:
- Start with individual characters as the initial vocabulary
- Count the most frequent pair of adjacent tokens in the training data
- Merge that pair into a new token and add it to the vocabulary
- Repeat until the vocabulary reaches a target size (typically 50,000-100,000+ tokens)
The result is a vocabulary where:
- Common words are single tokens: βtheβ, βisβ, βandβ
- Common subwords are single tokens: βingβ, βtionβ, βpreβ
- Uncommon words are split into multiple tokens: βcounterintuitiveβ β βcounterβ + βintuβ + βitiveβ
- Very rare strings become individual characters or bytes
The Practical Rules of Thumb
For English text:
- 1 token β 4 characters (or roughly 0.75 words)
- 1 word β 1.3 tokens on average
- 100 tokens β 75 words
- 1,000 tokens β 750 words (about 1.5 pages of text)
These are approximations. Actual token counts vary by:
- Language: Non-Latin scripts (Chinese, Japanese, Korean, Arabic) often require more tokens per character
- Content type: Code tends to use more tokens than prose due to special characters, indentation, and camelCase splitting
- Vocabulary: Technical jargon and rare words consume more tokens than common words
Why Tokens Matter for Context Engineering
Tokens are the currency of context engineering. Every component of your context β system prompt, conversation history, tool definitions, retrieved documents, the modelβs response β is measured in tokens. Understanding token counts lets you:
- Budget context effectively: Know how much space each component consumes
- Predict costs: Calculate the dollar cost of each API call
- Optimize efficiency: Identify where tokens are being wasted
- Avoid silent degradation: Know when youβre approaching limits that affect quality
What Is a Context Window?
The context window is the maximum number of tokens a model can process in a single inference call. It is the modelβs entire working memory β everything it can βseeβ and reason about at the moment of generating a response.
The context window includes:
- Input tokens: Everything you send to the model (system prompt, conversation history, tool definitions, retrieved content, the current message)
- Output tokens: Everything the model generates (its response, tool call requests, reasoning tokens)
Both input and output tokens share the same window. If a model has a 200K context window and you send 180K tokens of input, the model can only generate up to 20K tokens of output.
How It Works Mechanically
When you make an API call:
- Your entire input is tokenized
- The model processes all input tokens through its attention mechanism
- The model generates output tokens one at a time, each attending to all previous tokens (input + already-generated output)
- Generation stops when the model produces a stop token, hits
max_tokens, or reaches the context window limit
Because LLMs are stateless, there is no persistent memory between API calls. Every call must include the entire conversation history. A 10-turn conversation means the full history is resent on every message β turn 1 is processed 10 times, turn 2 is processed 9 times, and so on.
This statelesness is why context management is so important. The model literally forgets everything between calls unless you explicitly include it in the next request.
What Happens When You Exceed It
When input tokens exceed the context window:
- Anthropic (Claude): Returns an error before processing. You must reduce input to fit.
- OpenAI (GPT): Returns an error. Some older models silently truncated, but current models reject oversized inputs.
- Google (Gemini): Returns an error for inputs exceeding the limit.
There is no graceful degradation β you either fit within the window or the request fails.
Context Window Sizes Across Providers (March 2026)
The context window race has been one of the most visible competitions in AI. Hereβs where things stand:
Anthropic (Claude)
| Model | Context Window | Max Output |
|---|---|---|
| Claude Opus 4.6 | 200K tokens (1M beta) | 32K tokens |
| Claude Sonnet 4.6 | 200K tokens (1M beta) | 16K tokens |
| Claude Haiku 4.5 | 200K tokens | 8K tokens |
Claude models support extended thinking, where reasoning tokens are billed as output tokens. With interleaved thinking (Claude 4+ models), the thinking budget can extend up to the entire context window.
OpenAI (GPT)
| Model | Context Window | Max Output |
|---|---|---|
| GPT-5.4 | 1M tokens (Codex) | 100K tokens |
| GPT-4.1 | 1M tokens | 32K tokens |
| GPT-4o | 128K tokens | 16K tokens |
| o3 / o4-mini | 200K tokens | 100K tokens |
OpenAIβs reasoning models (o-series) generate internal reasoning tokens that consume context but are not shown to the user.
Google (Gemini)
| Model | Context Window | Max Output |
|---|---|---|
| Gemini 2.5 Pro | 1M tokens (2M coming) | 65K tokens |
| Gemini 2.5 Flash | 1M tokens | 65K tokens |
Google leads on raw context window size, with a 2M token window announced for Gemini 2.5 Pro.
Open Source
| Model | Context Window |
|---|---|
| Llama 3.1 405B | 128K tokens |
| Qwen3-Coder 480B | 256K tokens |
| Mistral Large | 128K tokens |
The Reality Behind the Numbers
These are advertised maximums. The effective context β where the model still performs reliably β is significantly lower:
- Gemini 2.5 Pro: Effective up to ~200K tokens
- GPT-5: Effective up to ~200K tokens
- Claude Sonnet 4 (Thinking): Effective in the 60-120K range
As a rule of thumb: plan for 60-70% of the advertised context window as your practical ceiling. Performance degrades gradually before that point, but the degradation becomes pronounced beyond it.
Token Pricing: The Economics of Context
Token pricing directly impacts context engineering decisions. Understanding the cost structure helps you optimize where it matters most.
Current Pricing (March 2026)
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Cached Input |
|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | $0.50 (90% off) |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 (90% off) |
| Claude Haiku 4.5 | $0.80 | $4.00 | $0.08 (90% off) |
| GPT-5.4 | $2.50 | $10.00 | $1.25 (50% off) |
| GPT-4.1 | $2.00 | $8.00 | $1.00 (50% off) |
| GPT-4o | $2.50 | $10.00 | $1.25 (50% off) |
| Gemini 2.5 Pro | $1.25 | $10.00 | ~$0.05 (~96% off) |
| Gemini 2.5 Flash | $0.15 | $0.60 | ~$0.01 (~96% off) |
Prices have dropped approximately 80% from mid-2025 to early 2026.
The Critical Insight: Output Tokens Cost 4-6x More Than Input
This asymmetry has major implications for context engineering:
- A 500-token response at Claude Opus 4.6 costs $0.0125
- A 100-token response costs $0.0025 β 5x cheaper for the same information if constrained effectively
- Constraining output is often the highest-ROI optimization because output tokens are so much more expensive
Cost at Scale: A Real Example
Consider an agentic coding assistant processing 10,000 requests per month:
Without optimization:
- Average input: 50,000 tokens/request β $250/month (Sonnet 4.6)
- Average output: 2,000 tokens/request β $300/month
- Total: $550/month
With prompt caching (90% of input is stable):
- Cached input: 45,000 tokens at $0.30/MTok β $13.50/month
- Non-cached input: 5,000 tokens at $3.00/MTok β $15/month
- Output (constrained): 800 tokens/request β $120/month
- Total: $148.50/month β 73% reduction
How Different Content Types Tokenize
Understanding how different content types consume tokens helps you make informed context engineering decisions.
Text
Standard English prose: ~1 token per 4 characters. A typical blog post of 1,500 words β 2,000 tokens.
Code
Code is generally more token-dense than prose due to:
- Special characters (
{},[],(),=>) each consuming 1-2 tokens - Indentation (spaces/tabs) consuming tokens
- camelCase and snake_case identifiers being split:
getUserProfileβget,User,Profile - Comments adding tokens without functional value
A 100-line Python file typically consumes 500-1,500 tokens depending on complexity.
JSON
JSON is notably verbose in tokens:
{"temperature": 72, "unit": "fahrenheit"}consumes significantly more tokens thantemperature: 72F- JSONβs structural characters (
{,},",:,,) all consume tokens - For internal processing, compact formats (YAML, CSV, custom) can reduce token usage by 18-40%
Structured Data
Tables, lists, and structured data consume tokens proportional to their character count. Markdown formatting characters (|, -, *, #) all consume tokens.
The Token Budget Mental Model
Think of your context window as a budget. Every component has a cost:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONTEXT WINDOW β
β (e.g., 200K tokens) β
β β
β ββββββββββββββββ Fixed costs (every call) β
β β System Prompt β ~2,000-3,000 tokens β
β ββββββββββββββββ€ β
β β Tool Defs β ~5,000-55,000 tokens β
β ββββββββββββββββ€ β
β β Safety Bufferβ ~30,000-40,000 tokens β
β ββββββββββββββββ€ β
β β β β
β β Conversation β Variable (grows each turn) β
β β History β β
β β β β
β ββββββββββββββββ€ β
β β Tool Outputs β Variable (can be huge) β
β ββββββββββββββββ€ β
β β Retrieved β β
β β Documents β β
β ββββββββββββββββ€ β
β β Current Msg β Small β
β ββββββββββββββββ€ β
β β Model Output β Shares the same window β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
In a typical agentic session (like Claude Code), the budget breaks down approximately:
| Component | Tokens | % of 200K |
|---|---|---|
| System prompt | ~2,600 | 1.3% |
| Built-in tools | ~17,600 | 8.8% |
| MCP tools | 900-51,000 | 0.5-25% |
| Compaction buffer | ~33,000 | 16.5% |
| Available for work | ~114,000 | ~57% |
You start with roughly 57% of the advertised window for actual work. Add MCP servers, and that drops further. This is why context engineering matters β the window is already half-consumed before you type a single message.
Key Takeaways
-
Tokens are subword units, not words. 1 token β 4 characters, 1 word β 1.3 tokens. Code and JSON are more token-dense than prose.
-
The context window is shared between input and output. Large inputs leave less room for the modelβs response.
-
LLMs are stateless. Every API call resends the entire conversation. Context costs compound with each turn.
-
Effective context is 60-70% of advertised. Plan for performance degradation well before the limit.
-
Output tokens cost 4-6x more than input tokens. Constraining output length is often the highest-ROI optimization.
-
Prompt caching can reduce costs by 50-90%. This is the single most impactful optimization for applications with repetitive context patterns.
-
Your actual working space may be only ~57% of the window after system prompts, tools, and safety buffers.