AI COST CONTROL

The End of Cheap Unlimited AI Coding Is Here

GitHub paused new individual Copilot subscriptions. Flat-fee AI is buckling under the weight of GPU compute. Cost-aware engineering is the next core skill, and most developers have never had to think this way before.

AI Cost Control Hero
Matt Perry - CTO

Curated by Matt Perry

CTO

28 April 2026

The Canary in the GPU Mine

In late April 2026, GitHub quietly paused new individual Copilot subscriptions. There was no big announcement. No press tour. Just a quiet message on the signup page and a long waitlist forming behind it.

This is the first visible crack in the flat-fee AI model. And it is not going to be the last.

For the past two years, developers have lived through what felt like infinite AI. Twenty pounds a month. Type a prompt. Watch a feature appear. Type another one. Refactor an entire service. Try three different framings of the same idea, just to compare. The marginal cost felt like zero.

The economics behind that experience were never sustainable. Running modern coding agents is not like serving a static website. Every interaction lights up a GPU cluster. Every long context, every tool call, every retry pulls real electricity from real data centres. The infrastructure has finally grown too heavy for £20 a month to carry.

The era of cheap unlimited AI coding is ending. What replaces it is an engineering discipline most developers have never had to learn: cost-aware AI engineering.

Why Flat-Fee AI Is Buckling

Cloud hosting got cheaper every year for two decades. AI inference has done the opposite. Each new generation of models is more capable and more expensive to run. Reasoning models burn through tokens internally before they answer. Coding agents loop through tool calls dozens of times per task. The compute behind a single thirty-second prompt can rival a full hour of traditional cloud usage.

The Three Forces Squeezing Subscriptions

  • GPU scarcity: Nvidia H100 and B200 chips are still sold out months in advance. Cloud GPU pricing has not fallen the way CPU pricing did.
  • Reasoning models eat tokens: A modern agent does not answer in one shot. It plans, calls tools, reflects, and retries. A simple feature request can consume 200,000 tokens of reasoning before the user sees a single line of code.
  • Heavy users break the average: A small group of power users running coding agents all day can consume 100x the compute of a casual user. Flat-fee economics depend on the opposite distribution.

Expect to see more metering, tier-based billing, hard usage caps, and credit-style top-ups across every major AI tool over the next twelve months. GitHub paused new signups. Cursor moved to credit tiers. Anthropic and OpenAI both ship usage limits to all their plans. The pattern is consistent.

What Cost-Aware Engineering Actually Means

Cost-aware AI engineering is the practice of building AI features that deliver business value at a sustainable unit economics. It is not about being cheap. It is about being deliberate.

Think of it like the early days of cloud. AWS launched in 2006. By 2012, every senior engineer needed to understand what an EC2 reserved instance was, why an unindexed query could cost a fortune, and how to read an S3 bill. The skill was not glamorous, but the people who had it shipped products that survived contact with reality.

AI is now at the same crossroads. The skill stack has five layers.

LayerWhat It IsWhy It Matters
Model selectionPicking the cheapest model that meets the quality barCost differences between tiers are 10x to 75x
Prompt efficiencySending only the context the model actually needsInput tokens dominate cost in long-context apps
CachingReusing prompts and responses where safeCached input tokens cost up to 90% less
Usage controlsPer-user caps, budgets, and rate limitsStops one bad actor from blowing your monthly bill
Quality controlsEvals, observability, guardrailsCheap output is worthless if it is wrong

The rest of this post walks through each layer.

Model Selection: Match the Model to the Job

The single biggest cost lever you have is choosing the right model. Most teams default to the most capable model on every call. That is the AI equivalent of running every web request on the largest VM in the catalogue.

The Three-Tier Mental Model

Every major provider now ships a small, medium, and large model. The naming differs, but the pattern is the same. Costs roughly follow a 10x curve at each step.

  • Small (Haiku, GPT mini, Gemini Flash): Fast, cheap, good at narrow tasks. Roughly £0.80 per million input tokens.
  • Medium (Sonnet, GPT, Gemini Pro): The workhorse for most coding and reasoning. Roughly £2.40 per million input tokens.
  • Large (Opus, GPT large reasoning, Gemini Ultra): Best in class for hard reasoning, planning, and agents. Roughly £12 per million input tokens.

How to Pick

  • Use small for: classification, extraction, summarisation, simple rewrites, structured data parsing
  • Use medium for: code generation, multi-step reasoning, customer-facing chat, content writing
  • Use large for: agentic workflows, complex planning, ambiguous requirements, anything where a wrong answer costs more than the price difference

A common pattern is the cascade. Start with the small model. If confidence is low, escalate to medium. If still uncertain, escalate to large. Done well, you can serve 80% of traffic on the small tier and reserve the expensive model for the cases that genuinely need it.

When NOT to Optimise

Skip model optimisation when the volume is low. If a feature runs ten times a day, the difference between Haiku and Opus is pence. Spend your engineering hours where the volume justifies them.

Prompt Efficiency: Stop Sending the Kitchen Sink

Input tokens are quietly the largest line on most AI bills. People worry about output, but a 50,000 token system prompt sent on every request burns more money than the model's response.

Where the Waste Lives

  • Stale system prompts: Old instructions that no longer apply but still ship on every call
  • Whole-document context: Pasting an entire 200-page manual when only one section is relevant
  • Conversation history bloat: Sending the full chat back on every turn, including reasoning the model already finished
  • Verbose few-shot examples: Five 1,000-token examples when one 200-token example would do

What to Do Instead

  • Retrieve only relevant context using semantic search before the model call. This is what RAG (retrieval-augmented generation) is for
  • Compress chat history. Summarise older turns instead of replaying them verbatim
  • Use structured outputs (JSON schemas) so the model does not waste tokens on prose wrapping
  • Audit your system prompt monthly. Old rules accumulate like dead code

Caching: The Cheapest Tokens Are the Ones You Do Not Send

Prompt caching is the most underused cost lever in AI engineering. Most major providers now offer it. Cached input tokens are charged at roughly 10% of the standard rate, sometimes less.

How Prompt Caching Works

The provider stores the prefix of your prompt (system instructions, tool definitions, large reference documents) and reuses the computation on subsequent calls. As long as the prefix is identical, you pay the cached rate.

Practical Wins

  • Stable system prompts: Your instructions, persona, and tool list. Cache the lot
  • Large reference docs: A pricing schedule, a product catalogue, a knowledge base. Cache once, query many times
  • Few-shot examples: The teaching examples in your prompt. Static, perfect for caching

A well-cached production app routinely sees 70% to 90% of input tokens billed at the cached rate. That is the difference between a feature that pays for itself and one that quietly bleeds your margin.

Usage Caps, Metering, and Budgets

One bad prompt loop can cost £500 in an afternoon. One leaked API key can cost £50,000 overnight. Production AI without usage controls is a credit card with no PIN.

The Controls You Need

  • Per-user daily caps: Each end user has a maximum daily token budget. Stops one heavy user breaking unit economics
  • Per-feature budgets: Each AI feature has a monthly cap. Spend goes to alerts and pauses before it goes to support tickets
  • Rate limits: Maximum calls per minute per user. Stops runaway agent loops
  • Cost telemetry: Every call logged with model, input tokens, output tokens, and cost. You cannot optimise what you do not measure

The Tier-Based Future

Customer-facing AI products are moving toward tiered pricing as fast as the providers behind them. Free tier with daily limits. Paid tier with monthly limits. Top-up credits for power users. This is not greedy product strategy. It is the only structure that survives the underlying compute economics.

Production Quality Controls

Cheap output is worthless if it is wrong. The cost-aware developer also has to be a quality-aware developer. The tools for this are evals, observability, and guardrails.

Evals

An eval is a test suite for AI behaviour. You write a set of input cases and expected outcomes (or quality criteria), and you run them every time the prompt or model changes. Without evals, you cannot tell if a cheaper model is actually cheaper, or just worse.

Observability

You need to see, in production, what your AI is actually doing. Tools like Langfuse, Helicone, and built-in provider dashboards let you trace every call, inspect the prompt, see the response, and tag failures. This is where you find the wasteful patterns and fix them.

Guardrails

Guardrails are runtime checks that catch bad outputs before they reach users. Examples: content moderation filters, schema validation on structured outputs, cost ceilings per call, refusal detection. Guardrails turn AI from a probabilistic toy into a system you can put in front of customers.

Real-World Example: A Cost-Aware Customer Support Agent

Imagine a UK e-commerce company building an AI support agent. The naive version uses the most capable model on every call, sends the full FAQ document with every message, and has no usage caps. Monthly bill: £8,500 for 12,000 conversations. Unit cost: £0.71 per conversation.

The cost-aware version, built by an experienced team:

  • Routes 70% of queries to a small model that handles common questions
  • Caches the FAQ and product catalogue (90% of input tokens hit the cache)
  • Uses RAG to retrieve only the relevant policy section instead of the whole handbook
  • Caps each customer at 20 messages per day
  • Includes evals to confirm the small model still hits the quality bar
  • Escalates to the medium model only when confidence is low

Same volume. Same quality bar. Monthly bill: £950. Unit cost: £0.08 per conversation. That is the difference between a feature that scales and one that gets quietly switched off in a board meeting.

What Every Engineer Needs to Learn

The skills below are not optional any more. They are the AI equivalent of knowing how to read a SQL query plan or set up a CDN.

  • Read your AI bill. Find the line items. Which features cost the most. Which users drive the bill. Which model tiers consume the budget
  • Measure tokens, not just latency. Every call should log input tokens, output tokens, cached tokens, and the model used
  • Run evals before you switch models. Cheaper is only cheaper if quality holds
  • Design caching into the prompt structure. Static prefix first. Dynamic content last
  • Add usage caps before launch, not after the first incident. Defaults should be safe
  • Keep a model menu. Document which model your team uses for which task and why. Revisit it quarterly

How Original Objective Builds Cost-Aware AI

We have spent the last two years building AI features for UK businesses. The patterns above are not theory. They are what separates the AI projects that ship and survive from the ones that get quietly killed when the first invoice arrives.

What We Bring

  • Cost modelling before we write code: We estimate per-conversation, per-user, and per-feature cost upfront. You see the unit economics before commitment
  • Model selection by use case: We match each task to the cheapest model that meets the quality bar, with evals to prove it
  • Caching and retrieval architecture: Prompt caching, semantic search, and structured context. Standard in everything we ship
  • Production observability: Every AI call is logged, tagged, and traceable. You see exactly where the money goes
  • Usage controls from day one: Per-user caps, per-feature budgets, alerting before overage. No surprises

The Bottom Line

The flat-fee era trained a generation of developers to ignore AI cost. That training is now a liability.

GitHub pausing new Copilot subscriptions is a small headline today. In twelve months, every AI product you depend on will price like a cloud service: tiered, metered, and unforgiving of waste. The developers and businesses that adapt now will ship AI features that scale. The ones that do not will watch their margins evaporate.

Cost-aware engineering is no longer a specialism. It is part of the job.

Talk to Us

If you are building AI features and want them to survive the post-unlimited era, we can help. Original Objective designs and ships production AI systems for UK businesses, with cost, quality, and reliability engineered in from the start.

Book a free discovery call. We will review your AI roadmap, model the unit economics, and show you the highest-impact cost levers in your current build.

Ready to put AI to work in your business?

Book a free 30-minute discovery call. We will discuss your goals, identify quick wins, and outline a practical plan to get started.

Book a discovery call

Frequently Asked Questions

Why did GitHub pause new Copilot subscriptions?

The compute cost of running coding agents has outgrown what a flat-fee subscription can sustainably cover. Modern AI assistants do not just answer one prompt. They plan, call tools, retry, and reason internally, often consuming 200,000 tokens or more per task. With GPU capacity scarce and reasoning models burning more compute than ever, GitHub is the first major platform to publicly hit the wall. Expect every other AI tool to follow with usage caps, tier-based billing, and credit-style pricing over the next twelve months.

What does cost-aware AI engineering mean for developers?

Cost-aware AI engineering is the practice of building AI features that deliver business value at sustainable unit economics. It covers five layers: choosing the right model for each task, writing efficient prompts, using prompt caching, setting per-user usage caps, and putting evals and observability in place to confirm cheaper choices still hit the quality bar. It is the AI equivalent of the cloud cost discipline every senior engineer learned between 2010 and 2015.

How do I choose between Haiku, Sonnet, and Opus, or GPT mini and GPT large?

Match the model to the job. Use the small tier (Haiku, GPT mini, Gemini Flash) for classification, extraction, and simple rewrites, costs around £0.80 per million input tokens. Use the medium tier (Sonnet, GPT, Gemini Pro) for code generation, multi-step reasoning, and customer chat, around £2.40 per million input tokens. Reserve the large tier (Opus, GPT large, Gemini Ultra) for agentic workflows and hard reasoning, around £12 per million input tokens. A common pattern is the cascade: start small, escalate only when confidence is low.

What is prompt caching and how much can it save?

Prompt caching stores the stable prefix of your prompt (system instructions, tool definitions, large reference docs) so the provider can reuse the computation on subsequent calls. Cached input tokens cost roughly 10% of the standard rate. A well-cached production app routinely sees 70% to 90% of input tokens billed at the cached rate. For a high-volume feature, that can be the difference between a unit economics that works and one that quietly bleeds margin.

Subscribe to the AI Growth Newsletter

Get weekly AI insights, tools, and success stories straight to your inbox.

Here's what you'll get when you subscribe:

Subscribe to the AI Growth Newsletter
  • Cost Modelling - know your unit economics before you ship
  • Model Selection - match each task to the cheapest model that meets the bar
  • Prompt Caching - cut input token spend by 70% to 90%
  • Usage Controls - per-user caps and per-feature budgets from day one
  • Production Evals - prove cheaper models still hit the quality bar
  • Observability - every AI call traced, tagged, and costed
  • UK-Focused - GDPR, accessibility, and industry compliance handled

No spam. Just practical AI tips for growing your business.

Not sure if your app is production-ready?

Take the AI Readiness Quiz