Know which model to use

Compare local and API models on your own prompts. Pick the one that works before you ship.

Built for prompt checks, local model trials, fine-tune reviews, and cost decisions.

Real prompts

Test your actual task.

Local vs API

Compare cost and quality.

Prompt checks

See where models fail.

Proof

Keep evidence for the choice.

USE CASES

Arena run

User input

Support reply tone check

Claude

Best fit 96%

GPT mini

Score 82%

Local 8B

Score 78%

Pick a model for a customer workflow

Compare candidates on the exact task your customer needs.

Use when a real workflow needs a model decision.

Provider comparison

ClaudeAPI

score92%

cost$0.009

Local 8BLocal

score88%

cost$0.000

viable

GPT miniAPI

score84%

cost$0.004

Compare local models against API models

See whether paid providers beat the models you can run locally.

Use when cost and quality both matter.

Evaluation setup

Input

Return strict JSON

Expected JSON

{ r_count: 3 }

Claudecorrect

GPT miniformat drift

Qwencount error

Validate prompts before production

Run the same prompt across models before it reaches users.

Use when model behavior must be predictable.

Fine-tune review

Base model

78%

accuracyok

Fine-tune v2winner

91%

+13 pts

Alt model

83%

task fitmid

Evaluate fine-tuned models

Check if a fine-tune actually improves the task.

Use when training needs proof, not hope.

Cost decision

qualitycost

Claude$0.009

GPT mini$0.004

Local 8B$0.000

Reduce model cost without losing quality

Test cheaper options against the model you already trust.

Use when you want lower cost without weaker output.

Saved result

Decision recordsaved evidence

promptJSON count

models3 compared

winnerClaude

score96%

Create evidence for model decisions

Keep prompts, outputs, and scores behind the choice.

Use when a team or client asks why this model won.

Prompt

Models

Expected output

Evidence

Decision