logo

Know which model to use

Compare local and API models on your own prompts. Pick the one that works before you ship.

Real prompts
Test your actual task.
Local vs API
Compare cost and quality.
Prompt checks
See where models fail.
Proof
Keep evidence for the choice.
USE CASES
Arena run
User input
Support reply tone check
Claude
Best fit 96%
GPT mini
Score 82%
Local 8B
Score 78%
01

Pick a model for a customer workflow

Compare candidates on the exact task your customer needs.

Use when a real workflow needs a model decision.

Provider comparison
ClaudeAPI
score92%
cost$0.009
Local 8BLocal
score88%
cost$0.000
viable
GPT miniAPI
score84%
cost$0.004
02

Compare local models against API models

See whether paid providers beat the models you can run locally.

Use when cost and quality both matter.

Evaluation setup
Input
Return strict JSON
Expected JSON
{ r_count: 3 }
Claudecorrect
GPT miniformat drift
Qwencount error
03

Validate prompts before production

Run the same prompt across models before it reaches users.

Use when model behavior must be predictable.

Fine-tune review
Base model
78%
accuracyok
Fine-tune v2winner
91%
+13 pts
Alt model
83%
task fitmid
04

Evaluate fine-tuned models

Check if a fine-tune actually improves the task.

Use when training needs proof, not hope.

Cost decision
qualitycost
Claude$0.009
GPT mini$0.004
Local 8B$0.000
05

Reduce model cost without losing quality

Test cheaper options against the model you already trust.

Use when you want lower cost without weaker output.

Saved result
Decision recordsaved evidence
promptJSON count
models3 compared
winnerClaude
score96%
06

Create evidence for model decisions

Keep prompts, outputs, and scores behind the choice.

Use when a team or client asks why this model won.

Prompt
Models
Expected output
Evidence
Decision