Currently accepting new projects for Q2 2026 — spots are limited. Get a quote →

Running a local AI model for client work

We started running an open-weight language model on our own hardware. Here's why we did it, what we use it for, and what we learned.

For most of 2025 we used cloud AI APIs like every other small studio — pay per token, no infrastructure to manage, results back in milliseconds. That works great until you start running it at any real volume, or until you have a client whose data shouldn't leave the building.

Earlier this year we set up an open-weight language model running locally on our own hardware. It's now part of our internal toolkit and powers AI features in some of our client tools. Here's the honest take on what it does, what it doesn't, and when it's worth doing.

What "local AI" actually means

"Local AI" doesn't mean training a model from scratch. We're using publicly available open-weight models — there are several good ones, with new releases every month — running them on a server we own. The model files are large (gigabytes) but they're a one-time download. After that, every inference happens on our hardware with no data leaving the network.

It's the same idea as running your own database instead of using a managed cloud one. You take on the operational burden in exchange for control, privacy, and predictable cost.

Why we do it

Cost predictability. A constant-running workload on a local model has a flat cost: the electricity to power the server. There's no per-token bill that grows with usage. For features we use heavily, this matters.

Data privacy. Some client data shouldn't be sent to a third-party API. Internal HR documents, financial records, anything covered by a contractual confidentiality clause. Running the model locally means the data never leaves a server we control.

No rate limits and no API failures. Cloud AI APIs occasionally go down. They impose rate limits that bite when you don't expect them. A local model has neither problem — if our server is up, the model is up.

Iteration speed. We can experiment with different models, different prompts, different parameters as much as we want without watching a meter. That's freed up real product development.

What we use it for

Three main use cases so far:

  • SEO content drafting for landing page systems. Generating the unique content variation that makes thousands of programmatic pages not look identical to each other.
  • Internal summaries. Distilling long meeting transcripts, support tickets, or document piles into actionable highlights.
  • Embedded AI features in client tools. Where a client needs "ask a question about my data" in their dashboard, the model runs on our server rather than an external API.

What it's not

Local AI isn't a drop-in replacement for the frontier cloud models. The largest open-weight models we can practically run are still meaningfully smaller than what OpenAI or Anthropic offer at the top end. For complex reasoning, code generation in unfamiliar languages, or anything where you'd notice the gap, the cloud is still better.

It also isn't free. The hardware costs real money up front. The electricity costs real money over time. If you're going to run it for a use case that processes ten requests a week, you'll pay way more than you would on a cloud API.

When does local AI make sense?

The trade-offs line up clearly. Local AI wins when:

  • You're running the model constantly enough that hardware amortizes
  • The use case can tolerate a smaller model than the frontier
  • Privacy or data residency is a real requirement
  • You have the ops capacity to maintain the server

Cloud AI wins when:

  • Usage is bursty or low-volume
  • You need cutting-edge reasoning capability
  • You don't want to think about infrastructure at all

Both can coexist. We use cloud APIs for some workloads and local for others. The right choice is per-feature, not per-company.

If you're thinking about adding AI to your business

We help clients build AI-powered features into their existing tools — whether that means routing to a cloud API, running things locally, or some combination. The best starting point is usually a 30-minute call about what you're actually trying to accomplish, not a debate about which model to use.

If that's something you're exploring, drop us a line.

Want to work with us?

If something here resonated, let's talk about what you're building.

Start a Project