Running Local AI Models for Compliance-Sensitive Organizations

Part 4 of our series on self-hosting LLMs

Terraform hardening, startup automation, throughput benchmarking, and a cost overview. Part 1: infrastructure setup. Part 2: model selection and quota. Part 3: Ollama limits and vLLM setup.

Introduction

Part 3 left us with a working vLLM setup — MiniMax-M2.5 running on dual A100s, Claude Code connecting, real code reviews completing in ~25 seconds. But “working” and “production-ready” are different things. We had bugs in our Terraform, a default VPC we’d never hardened, a startup process that required manual docker commands after every VM boot, and no systematic performance data.

Part 4 closes most of the gaps: infrastructure hardening, startup automation, throughput benchmarking under load, and a cost overview.

TL;DR:

Terraform got a dedicated VPC, startup script template, and several bug fixes — including two that bit us during testing
vLLM context window increased from 65K to 92,160 tokens — the practical maximum on 2×A100
Throughput: 96 tokens/sec single request, ~286 tokens/sec at 5 concurrent users, degrades at 10+
Self-hosting cost: ~$4.60/hr spot — the justification is data residency, not competing on price

Part 1: Terraform Hardening

Bug Fixes from Code Review

Part 3’s MiniMax-powered code review found issues in our own infrastructure code. We addressed them all.

1. Duplicate startup script

We had both metadata_startup_script and metadata["startup-script"] defined — Terraform accepted it, GCP silently used one and ignored the other. Removed the duplicate, consolidated to metadata["startup-script"] using templatefile().

2. Firewall resource — preventive improvement

The firewall resource used count = var.enable_iap ? 1 : 0. No downstream references were broken, but count-based resources are a latent footgun — any future reference using .name instead of [0].name would fail silently when the condition changes. We switched to for_each with a fixed key, making references explicit and unambiguous:

resource "google_compute_firewall" "iap_firewall" {
  for_each = var.enable_iap ? toset(["main"]) : toset([])
  name     = "allow-iap-${each.key}"
  # ...
}
# Reference: google_compute_firewall.iap_firewall["main"].name

3. deletion_protection variable

Added deletion_protection = var.deletion_protection to the instance resource, defaulting to false for dev environments. A small guard that prevents accidental terraform destroy from removing a VM that took approximately 30 minutes to load 130GB of model weights.

Note on hf_token: The original review flagged HuggingFace token exposure via startup script metadata. For our current setup this isn't an issue — we're pulling a public model with no auth required. If you're pulling private models, move the token to GCP Secret Manager and fetch it at boot time.

Boot Disk Orphan Warning

Our config has auto_delete = var.termination_action == "DELETE". In dev, termination_action is STOP, so auto_delete = false. This means when Terraform replaces the instance — for example after a VPC or subnet change — the old boot disk is not deleted. It becomes an orphaned resource accumulating cost. Either set auto_delete = true for dev, or remember to clean up old disks after instance replacements.

Dedicated VPC

The default VPC is shared across all GCP resources in the project. For compliance workloads, a dedicated network boundary is the safer baseline.

resource "google_compute_network" "llm_vpc" {
  name                    = "llm-network"
  auto_create_subnetworks = false
}

resource "google_compute_subnetwork" "dedicated_subnet" {
  name          = var.subnet_name
  ip_cidr_range = var.vpc_cidr   # e.g. "10.0.1.0/24"
  region        = var.region
  network       = google_compute_network.llm_vpc.id
}

The dedicated subnet has no external IP. Outbound traffic (HuggingFace downloads, OS updates) routes via Cloud NAT attached to the new subnet. The IAP tunnel for SSH and port forwarding continues to work — it doesn’t require a public IP.

NAT Lifecycle Fix

During testing, terraform apply failed mid-run with a not found in list error when Terraform tried to update the NAT resource after replacing the router. The new router existed but the NAT hadn't been re-attached yet.

The fix is a replace_triggered_by lifecycle rule that keeps NAT and router in sync:

resource "google_compute_router_nat" "nat" {
  # ...
  lifecycle {
    replace_triggered_by = [
      google_compute_router.nat_router[0].id,
      google_compute_subnetwork.dedicated_subnet.id
    ]
  }
}

Without this, a partial apply leaves the VM without outbound internet on the next boot — meaning the model weights can’t download.

Part 2: Startup Script Template

As the startup script grew — vLLM docker command, conditional engine selection, HuggingFace token handling — keeping it inline in main.tf became a burden. We moved it to terraform/templates/startup.sh.tpl.

Inference Engine Conditional

The template supports both Ollama and vLLM via an inference_engine variable:

# startup.sh.tpl (simplified)
%{ if inference_engine == "ollama" }
docker run -d --gpus all --restart always \
  -v ollama:/root/.ollama -p 11434:11434 \
  ollama/ollama
sleep 10 && docker exec ollama ollama pull ${ollama_model}
%{ endif }

%{ if inference_engine == "vllm" }
# Build optional flags as shell variables first —
# inline conditionals inside docker run multiline commands don't work reliably
TRUST_FLAG=""
%{ if vllm_trust_remote_code }
TRUST_FLAG="--trust-remote-code"
%{ endif }

docker run -d --gpus all --restart always \
  -p 8000:8000 --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:nightly \
  ${vllm_model} \
  --tensor-parallel-size ${gpu_count} \
  --gpu-memory-utilization ${vllm_gpu_memory_utilization} \
  --max-model-len ${vllm_max_model_len} \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  $TRUST_FLAG
%{ endif }

Two Bugs We Hit During Testing

Bug 1: Missing trust_remote_code

Our first automated boot failed with a ValidationError — the repository requires trust_remote_code=True. We had --trust-remote-code in the manual docker command from Part 3 but forgot to add vllm_trust_remote_code = true to terraform.tfvars. The template rendered the flag as empty string. Hard to spot because the container starts, loads weights for ~8 minutes, then crashes at the next step.

Bug 2: Conditional flags inside docker run

Putting the trust_remote_code conditional directly inside the multiline docker run command didn't render correctly — Terraform's template conditionals and shell line continuation don't mix well. The reliable solution is to pre-build optional flags as shell variables above the docker command and reference them at the end.

Further Improvements

This template covers the common case. Production deployments would add a health check polling vLLM’s /health endpoint before declaring the instance ready, a model_family variable to select the right --tool-call-parser per model, disk space pre-checks before starting a 90GB download, and GPU count validation against --tensor-parallel-size. We'll revisit these in a future iteration.

Part 3: Context Window — Iterating to the Real Maximum

In Part 3 we capped --max-model-len at 65,536. vLLM calculated the hardware maximum as 92,544. Attempting to use that exact value fails:

ValueError: To serve at least one request with the model's max seq len (92544), 
10.94 GiB KV cache is needed, which is larger than the available KV cache 
memory (10.94 GiB). Based on the available memory, the estimated maximum model 
length is 92528.

So we try 92,528 — which also fails, with the ceiling now reported as 92,160. The safe approach: try the reported maximum, subtract ~384 if it fails, repeat. We landed on:

--max-model-len 92160

That’s a 42% increase over the Part 3 limit. Still not the full 196K context (which requires 4× A100), but meaningful for real codebase reviews.

The Context Window in Practice

To validate, we ran the codebase overview prompt that previously failed — this time with explicit instruction to read everything:

“Explain what this codebase does. Please review all code, not only README.”

Result:

Globpattern: "**/*.tf" → 5 files
Globpattern: "**/*.yml" → 3 files
Globpattern: "**/*.md" → 11 files
Read: README.md, QUICKSTART.md, main.tf, variables.tf, dev.tfvars, startup.sh.tpl

The model produced a structured overview covering the compliance motivation, architecture diagram, file-by-file breakdown, and current configuration. It correctly identified the 2× A100 vLLM setup with MiniMax-M2.5 AWQ — picking that up from dev.tfvars, not the README. It reconstructed the GitHub Actions CI/CD setup from the workflow files.

This is the response you’d get from a senior engineer doing a proper day-one onboarding review: reads the code, not just the docs. The 92K context window made it possible.

Part 4: Throughput Benchmarking

All tests run against a warm server (model already loaded in VRAM). Same prompt throughout: “Write a Python quicksort implementation” with max_tokens: 500. Load testing done with hey. Full infrastructure code is in the repository.

Single Request

real    0m5.218s
completion_tokens: 500
→ ~96 tokens/second

Concurrent Requests — Manual Test (5 requests)

Manual test — 2 and 5 requests fired simultaneously:

Concurrency    Avg latency    Tok/s per req    System throughput
1              5.2s           96               96 tok/s
2              6.1s           82               164 tok/s
5              7.7s           65               325 tok/s

Tokens/sec per request = 500 tokens ÷ elapsed time (from sending request to receiving full response). System throughput = tokens/sec × concurrency.

All 5 concurrent requests completed correctly. Total system throughput scales as vLLM batches requests — 5 concurrent users get 3.4× the output of a single user in the same window.

t (50 requests via hey)

Concurrency    Req/sec    System tok/s    Avg latency    Errors
5              0.57       286             8.7s           0 / 50
10             0.78       388             11.1s          10 / 50

System tok/s = req/sec × 500 tokens (fixed max_tokens per request).

At 5 concurrent users: 286 tokens/sec, zero errors, consistent latency. At 10 concurrent: throughput increases but 10/50 requests hit client timeout. The server was still processing — increasing client timeout would recover most of these — but 11+ second latency for interactive coding starts to feel slow.

The 325 tok/s (manual testing) vs 286 tok/s (extended testing) difference is expected: extended testing includes queueing effects, HTTP client overhead, and timeout behavior under prolonged load.

Practical capacity guidance: 5 concurrent developers is the comfortable operating point for this hardware.

Part 5: Cost Overview

What We’re Spending

Resource                Cost/hr (spot pricing)
2× A100-80GB            $4.54
250GB persistent disk   $0.04
Cloud NAT               ~$0.05
Total                   ~$4.63/hr

At 4 hours/day, 20 days/month: roughly $370/month when actively running.

API Pricing for Context

For reference, what the equivalent hosted API costs:

Provider               Input $/1M    Output $/1M    Context
MiniMax M2.5 API       $0.30         $1.20          200K tokens
Claude Sonnet 4.5      $3.00         $15.00         200K tokens
OpenAI GPT-4o          $2.50         $10.00         128K tokens

MiniMax’s own API is priced aggressively — at low usage, it’s cheaper than self-hosting. That’s fine. Cost isn’t the primary reason compliance-sensitive organizations self-host.

The Real Reason: Data Residency

For HIPAA, GDPR, or SOC 2 Type II workloads, the calculation isn’t about price per token. It’s about where the data goes.

With a hosted API, inference happens on vendor hardware you don’t control, in data centers you haven’t audited. A Data Processing Agreement helps, but it doesn’t change the physical reality of where your data is processed. With self-hosted infrastructure, inference traffic stays within your controlled GCP project and VPC boundary rather than being sent to an external model API. You still rely on your cloud provider as a processor, but you avoid adding an external model vendor to the data path, which simplifies governance and audit scope. Full audit trail in your own logs.

A secondary consideration: model pinning. Hosted APIs update models without notice. Self-hosted infrastructure runs exactly the model version you tested and validated — meaningful for regulated environments where model changes require re-validation.

The $4.63/hr is the price of that guarantee. For organizations where the alternative is a compliance review, legal sign-off, and ongoing audit obligations, that’s not expensive.

Results Summary

Metric                          Value
────────────────────────────────────────────────────────
Context window (Part 3)         65,536 tokens
Context window (Part 4)         92,160 tokens (+42%)
Single request throughput       96 tokens/sec
5 concurrent (extended)         286 tokens/sec total, 0 errors
10 concurrent (extended)        388 tokens/sec, 20% timeout rate
Comfortable capacity            ~5 concurrent users
Self-hosted cost                ~$4.63/hr (spot)

Lessons Learned

trust_remote_code must be explicit in tfvars — missing it causes a 30-minute wait followed by a crash at the next step.
Don’t put Terraform conditionals inside docker run multiline commands — pre-build optional flags as shell variables, reference them at the end.
vLLM’s reported maximum context length is not always achievable — vLLM reported 92,544 as the ceiling, but the actual working value was 92,160. Internal block alignment shifts the limit slightly. If the reported max fails, step down by a few hundred tokens and try again.
NAT and router replacements need explicit lifecycle triggers — without replace_triggered_by, a partial apply leaves the VM without outbound internet on next boot.
For compliance orgs, cost isn’t the primary self-hosting argument — the API may be cheaper at low usage. The justification is data residency, model pinning, and audit control.
On this hardware, 5 concurrent users is the comfortable operating point — plan capacity accordingly for interactive coding workflows.

What’s Next

Four parts in, the infrastructure is close to production-ready. What remains: health checks, download resilience, and validating the setup against a real multi-developer workflow rather than synthetic benchmarks. If your compliance requirements demand self-hosted inference, $4.63/hr on spot (or ~$8–9/hr on-demand for continuous availability) is a reasonable starting point.

Repository: github.com/leonids2005/local-llm-test

Running Local AI Models for Compliance-Sensitive Organizations was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.