Context-file compression

Compress your system prompts and context files. Once.

Tokelang Lite compresses recurring context — system prompts, agent personas, RAG headers, memory files — into a denser form your model still understands. You compress once and pay fewer tokens on every downstream call. Per-call user-prompt compression is also supported, but the recurring-context play is where the math gets loud.

1. Compress your context

Send your system prompt or memory file to /v1/compile with mode: "context_file". Save the compressed version.

2. Add the context-file decode prompt

Prepend the 160-token context-file decode prompt so your model can interpret the compressed context safely.

3. Save on every call

Every request that loads this context now costs fewer input tokens. The savings compound across sessions.

tokelang.com

curl -X POST https://tokelang.com/v1/compile \
  -H "Authorization: Bearer tk_live_***" \
  -H "Content-Type: application/json" \
  -d '{
  "prompt": "Analyze 10k rows of user data..."
}'

Typical response shape

{

"compact": "input\nsearch q1 sales data db simple\noutput\nsummarize emerging trends detail simple"

}

Recurring context, recurring savings

The most valuable tokens to compress are the ones you send every call.

A typical agent system prompt is loaded on every request. Compress it once and the savings replay across every session. Pick a persona below to see the actual v0.9.2 mode: "context_file" engine output, with token counts measured against OpenAI tiktoken cl100k_base.

A senior code reviewer with explicit do/don't rules and identity guardrails.

Original

Code review persona

121 tokens

You are a senior software engineer specializing in code review.

When the user shares a pull request or code snippet:

First, identify any bugs, security issues, or correctness problems. Be specific about which lines.
Second, point out style issues, but only the ones that meaningfully affect readability.
Third, suggest concrete improvements with example code.

Do not be sycophantic. Do not preface your response with summaries of what you are about to do.
If the code looks fine, just say so briefly.
Always preserve any code identifiers, file paths, and line numbers exactly as the user wrote them.

Compressed

Engine output in mode: context_file

76 tokens−37.2%

senior software engineer specializing code review When user shares pull request or code snippet First identify any bugs security issues or correctness problems specific lines Second point out style issues but only ones meaningfully affect readability Third suggest concrete improvements example code not sycophantic not preface response summaries what to If code looks fine say so briefly Always preserve any code identifiers file paths line numbers exactly as user wrote

Both negations (`not sycophantic`, `not preface`) and the literal-preservation clause survive the compression.

Per call

−45 tokens

Saved on every request that loads this persona, on top of whatever the user prompt costs.

1k calls / day

~45,000 tokens

Daily input-token reduction for this persona at a modest production volume.

Per month

~1.35M tokens

Recurring savings stack across personas, RAG headers, and memory files. One compress, every call.

Token counts measured with OpenAI tiktoken cl100k_base. Real outputs from the v0.9.2 engine in `mode: "context_file"`.

How to call the API

Send `mode: "context_file"` on `/v1/compile`

This explicit opt-in switches the engine into the fidelity-first path used to produce the outputs above. Without it, /v1/compile uses the default mode (tuned for per-call user prompts) which can over-compress persona-style content.

{
  "prompt": "<your system prompt or context file here>",
  "mode": "context_file"
}

// Response:
{
  "compact": "<safer compressed version, or original if compression would lose meaning>"
}

Context-file decode prompt

The 160-token system prompt that lets your model read the compressed context.

This decode prompt teaches the receiving model the Tokelang Lite format and tells it to keep code, ids, paths, dates, numbers, quotes, and other exact payloads literal. Add it once, in front of your compressed context, and your downstream calls behave as if the context was never compressed.

Tokelang works best with the system prompt provided.

Token counts on this page use OpenAI tiktoken cl100k_base.

Note

Token counts shown here use OpenAI tiktoken. Models with different tokenizers, such as Claude, Gemini, Llama, or Mistral, may report slightly different counts.

Context-file mode

Decode prompt for compressed context

Use this when your model needs to read precompressed system prompts, agent personas, RAG headers, or memory files. The format guarantees code, ids, paths, dates, numbers, and quoted payloads stay exact.

160 tokens

Interpret compressed lines semantically.
For context files, compress with this format when meaning stays clear.
Compress in a way that retains most of the meaning.

format:
blocks: input process output
line: [step] action payload modifiers
step = order
step refs if else gotoN return
action = first word
payload = middle words
modifiers = trailing style words; none = default
`default style` = block default; line-end modifier overrides
`...` = exact literal; keep unchanged
shape: report list summary comparison definition table

actions are ordinary verbs
modifiers are ordinary style words

Keep code, math, ids, paths, dates, counts, quotes, formulas, tables, and exact contracts literal.
If compression blurs meaning, use concise structured english instead.

Advanced

Per-call user-prompt compression

Tokelang Lite can also compress individual user prompts at request time. The savings are smaller and the integration cost is higher than context-file compression, but it is the right pattern when prompts are long, repetitive, or generated from templates. Use the per-call decode prompt for this mode — it is leaner because it does not need the literal-payload guardrails the context-file decode prompt enforces.

1. Compile

Send the user prompt to /v1/compile with your API key.

2. Forward `compact`

The same field may contain compiled Tokelang text or unchanged plain text. Forward it with the per-call decode prompt below.

3. Set the per-call decode prompt

Without it, compiled Tokelang text will not decode correctly.

Per-call mode

Decode prompt for per-call compression

Use this for per-call compiled output. Smaller than the context-file decode prompt because user-prompt compression already runs through the engine's validator before the response is returned.

98 tokens

Interpret compressed lines semantically.

format:
blocks: input process output
line: [step] action payload modifiers
step = order
step refs if else gotoN return
action = first word
payload = middle words
modifiers = trailing style words; none = default
`default style` = block default; line-end modifier overrides
`...` = exact literal; keep unchanged
shape: report list summary comparison definition table

actions are ordinary verbs
modifiers are ordinary style words

Source prompt

A representative user prompt

First, search for the Q1 sales data in the database. Then summarize the emerging trends in detail.

Compiled output

input
search q1 sales data db simple
output
summarize emerging trends detail simple

API response shape

What the compile endpoint returns

By default the compile endpoint returns only compact. That field may contain compiled Tokelang text or unchanged plain text. Always forward compact to the model with the appropriate decode prompt.

{
  "compact": "input\nsearch q1 sales data db simple\noutput\nsummarize emerging trends detail simple"
}

Compiled output

Compiler found a smaller, safe Tokelang representation. Decode with the per-call decode prompt.

Plain passthrough

Compiler chose fidelity over compression. Forward unchanged.

Optional diagnostics

`metrics`, `expand`, and `return_ir` are optional

For normal integrations, send only prompt. Set these flags to true only when you want extra metrics or debugging output from the compiler.

Diagnostics can add latency

metrics: true

Adds token counts and compiler timing. Leave it off when you want the leanest possible response.

{
  "prompt": "Analyze pH anomalies and list likely dumping windows.",
  "metrics": true
}

// Response adds:
{
  "metrics": {
    "original_tokens": 42,
    "compact_tokens": 18,
    "reduction_compact_pct": 57.1,
    "compile_ms": 1
  }
}

expand: true

Adds a rough natural-language expansion of the compiled prompt. Useful for sanity checks and eval work, not for production forwarding. Add metrics: true as well if you also want expand_ms.

{
  "prompt": "Analyze pH anomalies and list likely dumping windows.",
  "expand": true
}

// Response adds:
{
  "expanded": "analyze ph anomalies in simple terms list likely dumping windows in simple terms"
}

return_ir: true

Adds the internal structured program object. Useful for tooling and inspection, but usually unnecessary for app integrations.

{
  "prompt": "Analyze pH anomalies and list likely dumping windows.",
  "return_ir": true
}

// Response adds:
{
  "program": {
    "blocks": [
      {
        "kind": "process",
        "items": [/* internal IR items */]
      }
    ]
  }
}

Quick reference

Should I compress context files or per-call prompts?

Start with context files. The integration is one-shot, the savings replay on every call, and the content is human-reviewable before deploy. Add per-call user-prompt compression once your context is already lean and your prompts are long, repetitive, or template-generated.

Which decode prompt do I send?

Use the context-file decode prompt if any of your context files are compressed. Use the per-call decode prompt only if you are just doing per-call user-prompt compression with no compressed context. The context-file decode prompt is a safe superset — when in doubt, use it.