The Focus AI logo

The Focus AI

Archives
Subscribe
March 25, 2026

Same Weights, Different Results

We gave the same model — NVIDIA Nemotron Nano 30B — the same prompt on four different providers. Analyze battery health and charging data by calling tools via the TezLab MCP server. Here's what each one did.

same-weights-different-results.png Ollama (running locally, no middleware) called 17 tools autonomously:

list_vehicles → get_vehicle_status → get_battery_health → get_charges_brief →
get_drives_brief → find_nearby_chargers → get_my_chargers → get_efficiency →
find_nearby_chargers → search_public_chargers → search_public_chargers →
current_time → get_aggregations → get_aggregations → current_time →
get_aggregations ...

It explored everything — vehicle status, battery health, charges, drives, nearby chargers, efficiency stats, aggregations. Score: 6/6.

NVIDIA NIM called 7 tools:

list_vehicles → get_battery_health → get_vehicle_status → get_charges →
get_my_chargers → find_nearby_chargers → search_public_chargers

Methodical and complete. It covered all the key tool categories. Score: 5/6.

OpenRouter called 1 tool:

list_vehicles

Then stopped and asked:

"I see you have two vehicles linked to your TezLab account: Rivian R1S and Tesla X. Which of these vehicles would you like me to analyze?"

Score: 1/6. Same weights. Same prompt. It asked for permission instead of acting.

DeepInfra didn't even get that far — it hit a context overflow error. The tool definitions pushed it past DeepInfra's 34K token context limit, while the same model handled 54K+ tokens fine on the other three providers.

Same weights. Same architecture. Same training data. Four completely different outcomes.

How We Got Here

We didn't set out to test inference providers. We were building a model showdown — 48 LLMs across 5 dimensions — and we happened to include the same Nemotron models on multiple providers. Nemotron Nano 30B scored 83.7% on OpenRouter and 77.0% on DeepInfra. That 6.7-point gap is larger than the difference between many distinct models.

So we ran a focused follow-up: the same three Nemotron models across four providers — OpenRouter, DeepInfra, Ollama (local), and NVIDIA's own NIM inference platform. Same prompts. Same evaluation harness. Three tests: MCP tool use, instruction following, and reasoning.

Here's the Nano 30B side by side:

Test OpenRouter DeepInfra Ollama NVIDIA NIM
MCP Tool Use (/6) 1 ERR 6 5
Instruction (/15) 8 15 10 15
Reasoning time 51s 57s 32s 27s

It's Not Just Tool Calling

The more we looked, the wider the divergence spread.

Instruction following breaks down. We asked each provider's Nano 30B to write a 12-word sentence about the ocean:

Provider Response Words Pass?
DeepInfra "The endless blue horizon whispers secrets of distant worlds beyond boundless imagination." 12 Yes
NVIDIA NIM "The ocean waves rhythmically crash a soothing sound that calms the soul" 12 Yes
Ollama "Ocean waves whisper ancient secrets, painting shifting horizons beneath endless skies." 10 No
OpenRouter "Waves whispered secrets to wandering soulsbeneath endless blue horizons of mystery." 11 No

Notice OpenRouter's response — "soulsbeneath" is missing a space. The tokenization or response processing is subtly corrupting the output.

Speed varies 16x. OpenRouter's Nano 9B took 729 seconds for a reasoning question that NVIDIA NIM answered in 44 seconds. Twelve minutes versus forty-four seconds. Same model.

Coding swings wildly. In the original showdown, DeepInfra's Nano 30B scored 110/126 on coding while OpenRouter scored 71/126. Simple tasks (FizzBuzz) were identical. Complex tasks (Custom Cipher, Data Pipeline) diverged completely — 7-vs-0, pass-vs-fail differences.

The Reversal

Here's where it gets worse. On Nano 9B, the pattern inverts.

OpenRouter's Nano 9B called 4 tools autonomously — list_vehicles, get_battery_health, get_charges, find_nearby_chargers — and produced a useful analysis. Score: 4/6.

DeepInfra's Nano 9B called 1 tool (list_vehicles) and asked which vehicle to analyze. Score: 1/6.

NVIDIA NIM's Nano 9B did the same — called list_vehicles and asked for clarification. Score: 1/6.

The provider that was worst for the 30B model was best for the 9B model. Whatever OpenRouter does differently in its tool-calling pipeline, it helps the smaller model act more autonomously while hindering the larger one. This isn't just "providers differ" — it's that the interaction between model and provider is unpredictable.

What Survives the Infrastructure Layer

Not everything breaks. The divergence follows a gradient.

Reasoning is rock solid. The surgeon riddle, bat-and-ball, lily pads — these are pattern-recognition tasks with unambiguous answers. Quantization and prompt formatting don't move the needle on pure logic.

Knowledge is mostly stable. Factual recall — capital of Kazakhstan, speed of light, when the Berlin Wall fell — barely changes across providers. The facts are baked deep into the weights.

The further you get from pure pattern recognition — into instruction following, coding, and especially tool calling — the more the infrastructure layer determines the outcome. The more a task depends on prompt formatting and API translation, the more it breaks.

The Full Picture

Here are all three models across all providers.

Nemotron Super 120B

Test OpenRouter DeepInfra NVIDIA NIM
MCP Tool Use (/6) 2 3 5
Instruction (/15) 15 15 15
Reasoning time 177s 59s 215s

Nemotron Nano 9B

Test OpenRouter DeepInfra NVIDIA NIM
MCP Tool Use (/6) 4 1 1
Instruction (/15) 10 10 5
Reasoning time 729s 56s 44s

Speed Across the Board

Model OpenRouter DeepInfra NVIDIA NIM Ollama
Nano 30B 51s 57s 27s 32s
Super 120B 177s 59s 215s —
Nano 9B 729s 56s 44s —

A pattern emerges in the data: NVIDIA NIM — the model maker's own platform — consistently outperforms third-party providers on tool calling. On Nano 30B, NIM scores 5/6 while OpenRouter scores 1/6. On Super 120B, NIM gets 5/6 while OpenRouter gets 2/6. This isn't surprising — they control the serving stack, the chat templates, and the quantization. They know exactly how the model was trained and can serve it the way it was intended.

The exception is Nano 9B, where NIM scores only 1/6 on MCP and 5/15 on instruction following. The smaller model may be more sensitive to serving configuration, or NVIDIA may optimize their platform primarily for their larger models.

Why This Happens

They're not actually the same model. DeepInfra uses FP8 quantization on their own GPU clusters. OpenRouter's free endpoint routes through NVIDIA's infrastructure with "unknown" quantization. NVIDIA NIM runs natively on NVIDIA's inference stack. Same weights, different numerics, different behavior.

Tool calling is implemented differently. All four providers present an OpenAI-compatible API, but the transformation to what the model actually sees differs. DeepInfra applies the native chat template. OpenRouter adds a translation layer. NVIDIA NIM presumably uses the exact template the model was trained with. OpenRouter has acknowledged this — they created "Exacto," a routing variant specifically because tool-calling accuracy varies so dramatically across providers.

Context limits vary. DeepInfra couldn't fit the Nano 30B MCP test into its 34K context window. The same model handled 54K+ tokens fine everywhere else. Providers make different decisions about context length, and those decisions can prevent your application from working at all.

Free tiers are a different product. On OpenRouter, the free Nemotron endpoint routes to different infrastructure than the paid one. You're not getting a free version of the same thing — you're getting a different thing entirely. If you're choosing a model for production, test it on the provider and tier you'll actually use.

The Uncomfortable Implication

The model evaluation ecosystem treats model names as the unit of comparison. Leaderboards rank "Llama 4 Maverick" or "Nemotron 3 Super" as if these are fixed, reproducible artifacts. They're not. A model name is a pointer to a set of weights. What you get when you call that name depends on who's serving it, how they've quantized it, what chat template they've applied, and what middleware sits between you and the inference engine.

We found MCP scores ranging from 1/6 to 6/6 across providers. Instruction following from 8/15 to 15/15. Speed from 27 seconds to 729 seconds. All on the same model.

The industry benchmarks every new model to death while largely ignoring the infrastructure layer that determines what users actually experience. Until we start treating inference infrastructure as a first-class variable in model evaluation, we're measuring the wrong thing.


This analysis is based on the Model Showdown — 48 LLMs tested across reasoning, knowledge, instruction following, coding, and MCP tool use for $4.63 — and a focused follow-up testing three Nemotron models across four providers. The interactive report includes per-model detail panels where you can see every tool call and judge verdict. The full technical investigation of provider differences is available on GitHub.

Don't miss what's next. Subscribe to The Focus AI:
Share this email:
Share on Facebook Share on LinkedIn Share on Reddit
GitHub
Twitter
LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.