Cerebras

Tokenomics analysis of the Cerebras WSE-3 for inference with leading open source models. Source: SemiAnalysis Tokenomics team.

Read the Full Article
Model
B params
layers
96.3k tokens
k tokens

Default (96.3k) is the P50 input sequence length from our internal testing across Claude Code, Codex, Cursor, OpenCode, and Pi. Output tokens are derived from the workload mix in the Cost to serve panel below.

requests
%
$M / system
Systems needed to run one copy of the model
42
Memory composition1.84 TB total
KV cache 97%
Weights
60 GB
KV cache
1.78 TB
Est. Capex cost (@ $1.00M / WSE-3)
$42M
Layers / wafer
0.86
Weight quantization
KV cache quantization
WSE-3 compute
Feasibility phase diagram
32 systems64 systems128 systems
16k64k256k1M10M125B250B500B1T2T4TAvg ISL (log)Model size (log)DeepSeek V4@1MDeepSeek V3@1MKimi K2.6@1Mgpt-oss 120B@1MGLM 4.7@1M

Where the workload mix comes from

Token-type breakdown observed across our internal workload and spend on the leading coding assistants (Claude Code, Codex, Cursor, OpenCode, Pi). Cache reads dominate by volume but cache writes and outputs disproportionately drive spend.

Token Mix by Volume
Click to enlarge.
Token Mix by Cost (current pricing)
Click to enlarge.

Where our ISL assumption comes from

Input sequence length distribution across agentic coding harnesses (Claude Code, Codex, Cursor, OpenCode, Pi). P50 lands at ~96.3k tokens.

Click to enlarge.

Where our OSL assumption comes from

Output sequence length across the same harnesses. P50 lands at ~213 tokens: most turns are short replies.

Click to enlarge.

Where our interactivity assumption comes from

Interactivity (output tok/s) on Cerebras: smaller models go faster, larger models go slower.

Sources: Artificial Analysis: Cerebras provider page; Cerebras: Kimi K2.6 (981 tok/s on a 1T-param model); Kimi API model list (256k context capacity).

Click to enlarge.
Per-model interactivity defaults used in the calculator

Fit to interactivity ≈ 3322 / active_params_B0.294 using gpt-oss 120B (2059) and GLM 4.7 (1201) as anchors. Kimi K2.6 uses the Cerebras-reported 981 tok/s measurement directly. DeepSeek V3/V4 are bumped slightly above the curve to reflect MLA's smaller per-step KV bandwidth.

ModelActive paramsInteractivity (tok / sec)
DeepSeek V480 B1,150
Kimi K2.632 B981
gpt-oss 120B5.1 B2,059
GLM 4.732 B1,201
DeepSeek V337 B1,350

Cerebras WSE-3 vs other chips

Comparing chip and system specs.

Click to enlarge.