Articles2025-03-1518 min

Why your company should consider on-premise LLMs in 2025

JPL

José Pedro Lecha

2025-03-15

Data governance regulations are tightening across Latin America. We break down when it makes sense to run language models on your own infrastructure, when it's a waste of money, and what technical stack you actually need to do it right.

The regulatory landscape: why this is no longer optional

Over the past 18 months, the regulatory landscape in Latin America has shifted dramatically. Argentina moved forward with implementing its Personal Data Protection Law, Brazil tightened the LGPD with penalties that have already exceeded R$50 million in cumulative fines, and Mexico updated its Federal Data Protection Law with specific guidelines for artificial intelligence. Colombia and Chile are following the same path.

For a 200-person company in the financial or healthcare sector, this has direct implications: every time an employee pastes client data into ChatGPT or your system sends sensitive information to OpenAI's API, you are potentially violating local regulations. This isn't paranoia — it's the current legal framework.

The issue isn't that OpenAI, Anthropic, or Google APIs are insecure. The issue is that you don't control where the data is processed, who accesses it, or how it's retained. And for a regulator, that's enough to consider it an unauthorized international data transfer.

The trend is clear: data sovereignty has gone from being a compliance concern to an operational requirement. Companies that don't adapt will lose contracts, face fines, or simply be excluded from public and private tenders.

References

Lei Geral de Proteção de Dados Pessoais (LGPD) — Full text

Gobierno de Brasil

Law 25.326 — Personal Data Protection (Argentina)

Argentina.gob.ar

National Data Protection Authority — Brazil (ANPD)

ANPD Brasil

What 'on-premise' means in 2025 (it's not what you think)

When we say 'on-premise,' many CTOs picture a server rack in the office basement with a sysadmin swapping out hard drives at 3 AM. That image is outdated.

On-premise in 2025 has three real modalities. First: private cloud with isolation — a dedicated VPC on AWS, GCP, or Azure with network policies that guarantee data never leaves the region. Second: bare metal in a local datacenter — dedicated servers in a datacenter like Equinix, EdgeUno, or DataCenter Paraguay, where you have physical control of the hardware. Third: your own hardware — GPUs in your existing infrastructure, ideal for large companies that already have computing capacity.

What matters in all three cases is the same: data never crosses a perimeter you don't control. You decide which model runs, what logs are kept, who has access, and how long information is retained. That's real data sovereignty, not marketing.

One detail many overlook: on-premise doesn't mean disconnected. You can have an on-premise deployment that periodically updates with new models, reports usage metrics (without sensitive data) to a central dashboard, and scales automatically based on demand. The end-user experience can be identical to using an external API.

The open-source models that already compete with GPT-4

The open-source model ecosystem exploded in 2024-2025. We're no longer talking about mediocre models that give generic answers — there are options that seriously compete with closed models on specific tasks.

Meta's Llama 3.1 405B is the most impressive in terms of general capability. For most enterprise tasks — document summarization, classification, entity extraction, report generation — it performs on par with GPT-4. The 70B version is excellent for production on more accessible hardware, and the 8B version is surprisingly capable for simple tasks with minimal latency.

Mistral Large and Mixtral 8x22B are European options with excellent performance in Spanish and Portuguese, which is critical for the LATAM market. Alibaba's Qwen 2.5 surprised everyone with its multilingual capability and efficiency on limited hardware. And DeepSeek V3 demonstrated that frontier-level performance can be achieved with more efficient architectures.

The key point is that for 80% of enterprise use cases — which don't require complex frontier reasoning — these models are more than sufficient. And you can run them on your own infrastructure without paying per token.

80% of enterprise use cases don't require frontier models. Open-source models like Llama 3.1 and Mistral already compete with GPT-4 on tasks like summarization, classification, and entity extraction.

References

Llama 3.1 — Model Card and Overview

Meta AI

Mistral Large — Documentation

Mistral AI

Qwen 2.5 — Model Collection

HuggingFace

DeepSeek-V3 Technical Report

arXiv

Real cost comparison: API vs. on-premise

Let's run the numbers with a real case. A financial services company with 150 employees that uses LLMs to analyze legal documents, generate compliance reports, and assist with customer service.

With external APIs (GPT-4o): they process approximately 2 million input tokens and 500K output tokens per day. At current OpenAI prices, that's roughly USD 25/day in input and USD 7.50 in output. About USD 975/month. Sounds cheap, right? But add: USD 200/month in orchestration tools, USD 150 in external logging and monitoring, and the hidden cost of variable latency affecting user experience. Real total: ~USD 1,400/month.

With on-premise (Llama 3.1 70B on 2x NVIDIA A100): GPU leasing costs approximately USD 3,500/month. Add USD 500 for supporting infrastructure (networking, storage, power) and USD 300 for maintenance. Total: ~USD 4,300/month. But that cost is fixed — it doesn't matter whether you process 2M tokens or 20M.

The breakeven point is at roughly 6-8 million daily tokens. If your company is going to scale AI usage (and they all do), on-premise becomes cheaper within 6-12 months. Plus, you eliminate dependency on prices that change without notice — OpenAI has already raised and lowered prices multiple times.

There's a third cost nobody puts in the spreadsheet: the cost of a data incident. A breach of client data processed through an external API can cost millions in fines and reputational damage. On-premise drastically reduces that risk.

The breakeven between API and on-premise is at 6-8 million daily tokens. If your company is going to scale AI usage, on-premise becomes cheaper within 6-12 months.

Cost comparison: External API vs. On-Premise

External API (GPT-4o)

~USD 1,400/month — variable cost per token, variable latency, dependency on provider pricing

On-premise (Llama 3.1 70B)

~USD 4,300/month — fixed cost regardless of volume, no token limits, full control

Breakeven

6-8M tokens/day — beyond this volume, on-premise is more economical

Hidden cost

Data incident with external API: millions in fines + reputational damage

Industry use cases: where on-premise is essential

Fintech and banking: banks and fintechs in the region are already using LLMs for credit risk analysis, real-time fraud detection, and automated regulatory reporting. A mid-sized bank in Argentina that implemented Llama 3 on-premise for credit application analysis reduced evaluation time from 48 hours to 15 minutes, processing data from BCRA, Veraz, and internal documentation without anything leaving its network. The regulator approved it precisely because data never left the perimeter.

Healthcare: hospitals and health insurers process medical records, lab results, and medical images containing extremely sensitive data. A clinic network in Uruguay implemented Mistral to generate medical record summaries and medication interaction alerts. Everything runs on a dedicated cluster within their datacenter, complying with local Health Data Protection laws.

Legal: law firms and corporate legal departments handle contracts, litigation, and confidential documentation. A large law firm in Buenos Aires uses Llama 3 to review contracts and detect problematic clauses. They process over 500 contracts per month without a single byte leaving their infrastructure.

Energy and mining: companies with operations in remote locations where connectivity is intermittent. On-premise guarantees that models keep running even if the internet link goes down.

The technical stack: what you actually need

Let's be concrete about the stack. For a production deployment of Llama 3.1 70B you need at minimum 2x NVIDIA A100 80GB or equivalent (H100s are better but more expensive and harder to get in the region). For the 8B model, a single A10G or even an RTX 4090 is enough.

At the inference layer, we use vLLM as the inference server — it's the de facto standard for serving LLMs in production. It supports continuous batching, PagedAttention for efficient memory usage, and is compatible with the OpenAI API, which makes migration easier. As an alternative, HuggingFace's TGI is solid as well.

For orchestration, LangChain or LlamaIndex if you need RAG (Retrieval-Augmented Generation), which is the most common enterprise use case. The vector store can be Qdrant, Weaviate, or pgvector if you already use PostgreSQL.

Monitoring with Prometheus + Grafana for inference metrics (latency, throughput, GPU utilization, queue depth). LangSmith or Langfuse for LLM chain observability — traces, quality evaluation, hallucination detection.

All of this runs on Kubernetes (EKS, GKE, or k3s on-premise) with Helm charts that we maintain and version. The internal team receives complete documentation and training to operate the cluster.

Technical stack for on-premise LLMs

Hardware

2x NVIDIA A100 80GB (or H100) — dedicated GPUs for inference

Inference

vLLM — server with continuous batching, PagedAttention, OpenAI-compatible API

Orchestration + RAG

LangChain / LlamaIndex + vector store (Qdrant, Weaviate, or pgvector)

Observability

Prometheus + Grafana (GPU metrics) + LangSmith/Langfuse (LLM traces)

Platform

Kubernetes (EKS, GKE, or k3s) with versioned Helm charts

When on-premise does NOT make sense

I'll be direct: for many companies, on-premise is a bad idea. And part of our job is to tell you that when it applies.

If your company has fewer than 50 people and isn't in a regulated sector, external APIs are almost always the best option. The infrastructure cost, the maintenance overhead, and the iteration speed you lose don't justify it. Use GPT-4o or Claude through their APIs, implement basic DLP (Data Loss Prevention) controls, and you're set.

If your use case is experimental — you're testing whether AI can improve a process but don't yet have real volume — start with APIs. Validate the use case, measure the ROI, and when you're certain it works and the volume justifies it, migrate to on-premise.

If you don't have an infrastructure team (even one person) that can monitor the deployment, don't go on-premise without a support contract. Models need updates, GPUs need monitoring, and pipelines need maintenance.

It also doesn't make sense if your use case constantly requires the latest frontier model. If you always need the newest version of GPT or Claude as soon as it's released, on-premise will always leave you one step behind. But let's be honest: most enterprise use cases don't need frontier.

The hybrid path: the best of both worlds

The reality is that most of our clients end up with a hybrid architecture. It's not all on-premise or all API — it's an intelligent combination based on data type and use case.

The pattern we implement most often: sensitive data (customer information, financial data, medical records) is processed exclusively with the on-premise model. Non-sensitive data (marketing content, public trend analysis, generic internal documentation generation) goes to external APIs where latency is lower and models are more powerful.

This requires a smart router that classifies requests by sensitivity and directs them to the appropriate model. It sounds complex, but with a good gateway architecture it can be solved in a week of implementation.

The benefit is clear: you comply with regulations where it matters, you leverage the power of closed models where you can, and you optimize costs. One of our clients in the insurance sector reduced their total AI spending by 40% with this approach while improving their compliance posture.

The hybrid architecture is the most adopted pattern: sensitive data goes to the on-premise model, non-sensitive data goes to external APIs. A smart router classifies and directs each request.

Hybrid architecture: intelligent request routing

Incoming request

The user or system generates a query involving data

Sensitivity classifier

Gateway that analyzes content and determines if it contains regulated data

Sensitive route → On-premise LLM

Financial, clinical, or personal data is processed locally (Llama 3.1)

Non-sensitive route → External API

Marketing content, public analyses go to GPT-4o or Claude

Unified response

The result is delivered to the user regardless of which model originated it

How to get started: the process we follow at Orionis

If you're evaluating going on-premise, this is the process we follow with every client. It's not a sales pitch — it's the methodology we actually use.

Weeks 1-2: Diagnosis. We audit your current data flows, identify which information is regulated, map existing and potential AI use cases, and evaluate your infrastructure. We deliver a feasibility document with clear recommendations.

Weeks 3-4: Proof of Concept. We set up a deployment in a staging environment with anonymized data. We test the selected model with your real use cases and measure performance, latency, and response quality compared to whatever API you're currently using.

Weeks 5-8: Production deployment. We configure the full stack — inference, RAG if applicable, monitoring, alerts, backups, and security policies. We integrate with your existing systems via an OpenAI-compatible API.

Weeks 9-12: Handoff and stabilization. We train your team, document everything, and provide active support while the system stabilizes in production.

After deployment, we offer a continuous support contract that includes model updates, proactive monitoring, and consulting for new use cases. But what's important is that if you decide to part ways with us, you have everything you need to operate autonomously. The code, the configuration, and the knowledge are yours.

On-premise implementation process

Phase 1: Diagnosis (Weeks 1-2)

Data flow audit, infrastructure assessment, feasibility document

Phase 2: Proof of Concept (Weeks 3-4)

Staging deployment, testing with anonymized data, benchmarks vs. current API

Phase 3: Production (Weeks 5-8)

Full stack: inference, RAG, monitoring, alerts, system integration

Phase 4: Handoff (Weeks 9-12)

Team training, documentation, active support, operational autonomy

The question you should be asking

It's not 'should I go on-premise?' The right question is: 'what happens to my data when I send it to an external API, and can I live with that answer?'

If the answer is 'I'm not sure,' you need to investigate. If the answer is 'I can't afford that risk,' you need a plan. And if the answer is 'my regulator is going to ask me about it,' you need to act now.

Open-source models have reached a level of maturity that makes on-premise viable for mid-sized companies. The hardware is accessible. The technical stack is mature. And regulation is only going to get stricter. Companies that move now will have a real competitive advantage — not just in compliance, but in the ability to customize and control their AI models.

If you'd like to evaluate your specific case, write to us at hello@orionis.consulting. We offer a no-cost initial assessment where we'll honestly tell you whether on-premise makes sense for your company or if you're better off with external APIs. Our commitment is to give you the best recommendation, even if that means we don't work together.