Home Technical How We Track AI Model Costs: Real Data, Not Marketing Claims

How We Track AI Model Costs: Real Data, Not Marketing Claims

Behind the scenes of our cost analysis methodology. See how we track token usage, calculate real costs, and determine which AI models actually save you money.

Zane Merrick
January 24, 2026
AI technical costs methodology transparency

Every AI tool review I write includes real cost data. Not estimates. Not “contact sales for pricing.” Actual dollars spent running standardized tests.

Here’s how that works.

The Problem with AI Pricing Transparency

AI companies love talking about “affordability” and “competitive pricing” without telling you what anything actually costs.

  • ChatGPT: Flat $20/month. Simple. But what if you hit limits?
  • Claude Pro: Also $20/month. Different limits.
  • API pricing: Per token. But how many tokens is a typical task?
  • Enterprise tiers: “Contact sales” = pricing opacity

When I test AI tools, I need to know: What does it actually cost to run 100 MMLU questions? Or generate 50 code solutions? Or process 1000 customer queries?

Marketing pages won’t tell you. So I built tools to measure it myself.

Our Cost Tracking System

Every benchmark run automatically tracks:

  • Input tokens: What you send to the model
  • Output tokens: What the model generates
  • Total cost: Calculated from current API pricing
  • Cost per task: How much each question/task costs
  • Cost-effectiveness: Cost per accuracy point

Models Tracked

30+

Pricing Updates

Weekly

Token Precision

100%

Cost Accuracy

±$0.0001

How Token Counting Works

Different providers count tokens differently:

OpenAI (GPT-4, GPT-3.5):

  • Uses tiktoken library
  • Same tokenizer the API uses
  • Exact token counts

Anthropic (Claude):

  • Returns token counts in API response
  • Roughly 1 token per 4 characters
  • Exact counts from API

Google (Gemini):

  • Different tokenization than OpenAI
  • Longer prompts = relatively fewer tokens
  • Uses official tokenizer

We don’t estimate. We count exactly what the API charges for.

Current Model Pricing (January 2026)

Here’s what the major models actually cost per 1 million tokens:

Input Pricing per 1M Tokens ($)

GPT-4 30 USD
GPT-4 Turbo 10 USD
Claude 3 Opus 15 USD
Claude 3 Sonnet 3 USD
Gemini 1.5 Pro 3.5 USD
GPT-3.5 Turbo 0.5 USD
Claude 3 Haiku 0.25 USD

Output tokens typically cost 2-3x more than input tokens.

Why? Generating text requires more compute than processing it.

Real Example: MMLU Benchmark Costs

I ran 100 MMLU questions (knowledge test across 57 subjects) on 4 models. Here’s what it cost:

MMLU 100 Questions - Actual Costs

Same test. Wildly different costs.

  • GPT-4: $2.10
  • Claude 3 Opus: $1.50
  • GPT-4 Turbo: $0.70
  • Claude 3 Sonnet: $0.30
  • GPT-3.5 Turbo: $0.04

GPT-4 costs 52.5x more than GPT-3.5 Turbo for the same benchmark.

But does it perform 52x better? Let’s see:

Cost vs. Performance

Here’s where it gets interesting. Cost means nothing without quality.

ModelMMLU ScoreCost (100q)Cost per Point
GPT-486.4%$2.10$0.0243
Claude 3 Opus86.8%$1.50$0.0173
GPT-4 Turbo85.2%$0.70$0.0082
Claude 3 Sonnet79.0%$0.30$0.0038
GPT-3.5 Turbo70.0%$0.04$0.0006

Cost per point = Total cost ÷ Score

This is the metric that actually matters. Not “cheap” or “expensive” - value.

Performance vs Cost (MMLU Benchmark)

Claude 3 Opus wins on cost-effectiveness: Best score, second-lowest cost per point.

GPT-3.5 Turbo is absurdly cheap but loses 16 points compared to Claude.

GPT-4 Turbo hits the sweet spot: 85% of GPT-4’s performance at 33% of the cost.

How We Calculate Costs

The system is fully automated:

  1. API call made → Model returns token counts
  2. Token counts recorded → Input + output tokens
  3. Pricing looked up → Current rates for that model
  4. Cost calculated → (Input tokens × input price) + (Output tokens × output price)
  5. Results saved → JSON with full cost breakdown

Every benchmark result includes:

{
  "cost_summary": {
    "total_cost": 0.4350,
    "total_tokens": 12450,
    "total_input_tokens": 10200,
    "total_output_tokens": 2250,
    "num_calls": 20,
    "cost_per_call": 0.0218,
    "model": "gpt-4"
  }
}

Nothing hidden. Full transparency.

Why This Matters for Reviews

When I review an AI tool, I can tell you:

Exact cost to run specific tasks
Cost comparisons vs. alternatives
Cost-effectiveness (performance per dollar)
Real-world projections (1000 queries = $X)

Not “it’s affordable” or “competitively priced.” Actual numbers.

Example: Coding Assistant Review

Testing Cursor AI vs GitHub Copilot:

  • Task: Generate 50 Python functions
  • Cursor (GPT-4): $1.20, 95% correct
  • Copilot (Codex): $0.45, 89% correct
  • Verdict: Copilot is 62% cheaper but 6% less accurate

Is 6% accuracy worth 2.7x the cost? Depends on your use case.

For production code? Maybe worth it.
For prototyping? Probably not.

The data lets you decide. Not marketing.

Estimating Before You Run

Our tools can estimate costs before you commit:

python scripts/analyze_costs.py estimate \
  --models gpt-4 claude-3-opus-20240229 \
  --benchmark mmlu \
  --num-questions 100

Output:

Model                          | Cost      | Tokens
gpt-4                          | $2.1000   | 60,000
claude-3-opus-20240229         | $1.5000   | 60,000

Prevents expensive surprises.

Common Pricing Gotchas

1. Context Window Costs

Longer context = more input tokens = higher costs.

  • GPT-4: 8K context is standard. 32K costs the same per token but you use more tokens.
  • Claude 3: 200K context! Sounds great until you realize you’re paying for all 200K tokens every call.

2. Streaming vs. Batch

Some APIs charge different rates for streaming vs. batch. We track both.

3. Image Tokens

Vision models (GPT-4 Vision, Claude 3) charge based on image resolution:

  • Low res: ~85 tokens per image
  • High res: ~170-765 tokens per image

We include image token costs in vision model tests.

4. Tool Use Overhead

Function calling adds tokens:

  • Function definitions in prompt
  • Function call results
  • Can double token usage

Tracked in our tool use benchmarks.

How Pricing Changes Over Time

We update pricing weekly from official docs:

When prices drop (they often do), we re-run key benchmarks to update reviews.

Example: GPT-4 Turbo launched at 70% lower cost than GPT-4. Immediately changed recommendations.

GPT-4 Pricing History (Input, per 1M tokens)

Open Source Cost Tracking

All our cost tracking code is open source: github.com/noelniles/ai-tools-testing

You can:

  • Run the same tests
  • Verify our numbers
  • Track your own costs
  • Contribute pricing updates

No proprietary magic. Just code.

Limitations & Accuracy

What we track accurately:

  • API costs (100% accurate)
  • Token usage (exact counts)
  • Benchmark costs (down to $0.0001)

What we estimate:

  • Real-world usage patterns (everyone’s different)
  • Subscription value (depends on usage)
  • Hidden costs (rate limits, infrastructure)

What we can’t track:

  • Enterprise pricing (varies by contract)
  • Compute costs for self-hosted models
  • Internal API costs (some companies don’t expose them)

When we estimate, we say so. When we measure, the data is exact.

Why Transparency Matters

AI pricing is deliberately opaque. Companies want you to “contact sales” or use vague “credits” instead of real dollars.

We believe:

  1. Users deserve to know what things cost
  2. Comparisons should use real data
  3. Marketing claims should be verified

Every review includes actual cost data because you can’t make informed decisions without it.

Tools You Can Use

Want to track costs yourself?

For API Users:

  • OpenAI: Token counts in API response
  • Anthropic: Usage data in response headers
  • Our tools: Automatic tracking across providers

For Developers:

from utils.cost_tracker import CostTracker

tracker = CostTracker()
tracker.start_session("my-test", "gpt-4")

# ... make API calls ...

summary = tracker.end_session()
print(f"Total cost: ${summary['total_cost']:.4f}")

Full example code in our GitHub repo.

The Bottom Line

Real cost tracking changes everything:

  • Reveals which “cheap” models are actually expensive at scale
  • Shows which “premium” models offer better value
  • Exposes hidden costs (rate limits, token overhead)
  • Enables actual cost-benefit analysis

When I say Claude 3 Opus is worth the premium, I can show you the numbers.

When I say GPT-3.5 Turbo is a better deal for most tasks, the data backs it up.

When I recommend GPT-4 Turbo over GPT-4, you can see the 70% cost savings vs. 1.4% accuracy loss.

Data beats marketing. Every time.


Want to verify our numbers? All code is open source. All results are published. All methodology is documented.

Because if I’m going to tell you what’s worth paying for, you deserve to see the receipts.