Cerebras

Tokenomics analysis of the Cerebras WSE-3 for inference with leading open source models. Source: SemiAnalysis Tokenomics team.

Read the Full Article

Model

Total parameterslocked

Active parameters (inference perf only)locked

B params

Model Layerslocked

layers

Max context length supported

Average input sequence length (ISL, P50)

96.3k tokens

k tokens

Default (96.3k) is the P50 input sequence length from our internal testing across Claude Code, Codex, Cursor, OpenCode, and Pi. Output tokens are derived from the workload mix in the Cost to serve panel below.

Concurrent requests

requests

System overhead (networking, scheduling, etc.)

CS-3 system cost (capex)

$M / system

Systems needed to run one copy of the model

Memory composition1.84 TB total

KV cache 97%

Weights

60 GB

KV cache

1.78 TB

Est. Capex cost (@ $1.00M / WSE-3)

$42M

Layers / wafer

0.86

Weight quantization

KV cache quantization

WSE-3 compute

Feasibility phase diagram

32 systems64 systems128 systems

Where the workload mix comes from

Token-type breakdown observed across our internal workload and spend on the leading coding assistants (Claude Code, Codex, Cursor, OpenCode, Pi). Cache reads dominate by volume but cache writes and outputs disproportionately drive spend.

Token mix: Cache Read 94.7%, Cache Write 4.7%, Input 0.4%, Output 0.3% — Token Mix by Volume

Cost mix: Cache Read 56.5%, Cache Write 34.7%, Output 8.2%, Input 0.6% — Token Mix by Cost (current pricing)

Where our ISL assumption comes from

Input sequence length distribution across agentic coding harnesses (Claude Code, Codex, Cursor, OpenCode, Pi). P50 lands at ~96.3k tokens.

Histogram of input sequence lengths across agentic coding harnesses, with a P50 around 96.3k tokens — Click to enlarge.

Where our OSL assumption comes from

Output sequence length across the same harnesses. P50 lands at ~213 tokens: most turns are short replies.

Histogram of output sequence lengths across agentic coding harnesses, with a P50 around 213 tokens — Click to enlarge.

Where our interactivity assumption comes from

Interactivity (output tok/s) on Cerebras: smaller models go faster, larger models go slower.

Sources: Artificial Analysis: Cerebras provider page; Cerebras: Kimi K2.6 (981 tok/s on a 1T-param model); Kimi API model list (256k context capacity).

Output tokens per second comparison from Artificial Analysis: Llama 3.1 8B at 2349, gpt-oss 120B high at 2059, gpt-oss 120B low at 1755, GLM-4.7 at 1201 and 1159 — Click to enlarge.

Per-model interactivity defaults used in the calculator

Fit to interactivity ≈ 3322 / active_params_B^0.294 using gpt-oss 120B (2059) and GLM 4.7 (1201) as anchors. Kimi K2.6 uses the Cerebras-reported 981 tok/s measurement directly. DeepSeek V3/V4 are bumped slightly above the curve to reflect MLA's smaller per-step KV bandwidth.

Model	Active params	Interactivity (tok / sec)
DeepSeek V4	80 B	1,150
Kimi K2.6	32 B	981
gpt-oss 120B	5.1 B	2,059
GLM 4.7	32 B	1,201
DeepSeek V3	37 B	1,350

Cerebras WSE-3 vs other chips

Comparing chip and system specs.

Comparison chart of Cerebras WSE-3 vs H100, H200, B200, B300, Groq LP30, and R200 across FP16/FP8/FP4 performance, HBM capacity and bandwidth, SRAM capacity and bandwidth, and perf ratios — Click to enlarge.