News2025-03-0112 min

Orionis launches on-premise LLM service for regulated companies

Orionis

2025-03-01

Companies with sensitive data in Argentina, Uruguay, and LATAM can now run language models on their own infrastructure, with full support, without relying on external APIs, and in compliance with local data regulations.

The problem we solve

Since we started working with companies in the financial, healthcare, and legal sectors, the same question came up in every conversation: 'How can we use LLMs without sending our clients' data to third-party servers?' The answer until now was complicated — it required assembling an internal ML team, researching models, configuring infrastructure, and hoping everything works in production.

Today we launch a service that eliminates that complexity. We deploy open-source language models directly on client infrastructure — whether it's a private cloud, a local datacenter, or your own hardware — with a complete production stack ready to use from day one.

This isn't a SaaS product with a nice wrapper. It's a real deployment, on your infrastructure, with your full control over the data, the models, and access. When we finish the project, all the code, configuration, and documentation are yours.

What the service includes exactly

The service has four main components that cover everything needed to go from zero to production.

Component 1 — Diagnosis and design (2 weeks): we audit your data flows, identify which information is regulated, assess your existing infrastructure, and design the target architecture. We deliver an architecture document with diagrams, hardware specifications, operational cost estimates, and a week-by-week implementation plan.

Component 2 — Deployment and configuration (3-4 weeks): we install and configure the full stack. This includes the inference server (vLLM or TGI), the selected model (Llama 3.1, Mistral Large, Qwen 2.5, or another based on the use case), the RAG pipeline if applicable (with Qdrant or pgvector), the OpenAI-compatible API gateway, and the monitoring stack (Prometheus, Grafana, Langfuse).

Component 3 — Integration and fine-tuning (2-3 weeks): we connect the deployment with your existing systems via API, configure prompts and workflows for your specific use cases, and if necessary, fine-tune the model with your data (always within your infrastructure).

Component 4 — Handoff and support (2 weeks + ongoing contract): we train your IT team to operate and maintain the system, deliver complete operations documentation, and begin the post-deployment support period where we proactively monitor and resolve incidents.

The 4 components of the on-premise LLM service

Diagnosis and design (2 weeks)

Data audit, infrastructure assessment, target architecture, implementation plan

Deployment and configuration (3-4 weeks)

vLLM/TGI, selected model, RAG pipeline, API gateway, monitoring stack

Integration and fine-tuning (2-3 weeks)

Connection with existing systems, prompt and workflow configuration, fine-tuning if applicable

Handoff and support (2 wks + ongoing)

Hands-on training, operational documentation, proactive monitoring, 4-hour SLA

Target industries and use cases

We designed the service with four main industries in mind, but the architecture is agnostic — it applies to any company with data sovereignty requirements.

Fintech and banking: automated credit application analysis, fraud detection with internal data, regulatory report generation (BCRA, BCU, CMF), internal assistants for compliance officers, and KYC/AML documentation processing without exposing client data.

Healthcare: automated medical record summaries, medication interaction alerts, diagnostic coding assistants (ICD-10), lab result analysis, and structured medical report generation. All in compliance with local health data regulations.

Legal: automated contract review and problematic clause detection, semantic search across internal case law, legal document draft generation, and litigation risk analysis. Law firms handle extremely confidential information that cannot leave their perimeter.

Insurance: automated claims processing, policy analysis, claims fraud detection, and report generation for reinsurers. The volume of documentation in insurance makes AI's operational impact enormous.

Pricing and engagement model

We are transparent about costs because we believe pricing surprises destroy trust.

The implementation service has a fixed cost that varies based on deployment complexity. To give you an indicative range: a standard deployment (one model, one primary use case, private cloud infrastructure) starts at USD 25,000-35,000. A complex deployment (multiple models, fine-tuning, integration with several legacy systems, pure on-premise infrastructure) can reach USD 60,000-80,000.

This includes all the diagnosis, deployment, integration, fine-tuning if applicable, and knowledge transfer work. There are no hidden costs or surprises.

The continuous support contract (optional but recommended) has a monthly cost that includes: proactive 24/7 monitoring, model updates (we evaluate new releases and deploy them if they improve performance), technical support with a 4-hour SLA for critical incidents, and 8 monthly hours of consulting for new use cases or improvements. The support cost varies by deployment size, but as a reference it's in the range of USD 3,000-6,000/month.

Important: infrastructure costs (GPUs, storage, networking) are borne by the client. We advise you on selection and help you negotiate with providers, but the infrastructure is yours.

Full pricing transparency: standard implementation from USD 25,000-35,000, complex up to USD 60,000-80,000. Ongoing support USD 3,000-6,000/month. No hidden costs or surprises.

The onboarding process step by step

Week 0 — Initial assessment (no cost): we meet via video call, understand your case, and honestly tell you whether the service makes sense for your company. If it doesn't, we'll tell you and recommend alternatives. This assessment has no cost or commitment.

Weeks 1-2 — Diagnosis: our technical team audits your infrastructure, data flows, and regulatory requirements. Together we define the project scope, the model to use, and the target architecture. We sign the contract with locked-in scope, timeline, and pricing.

Weeks 3-6 — Implementation: we deploy the stack on your infrastructure. We run load tests, security tests, and integration tests. We run a pilot with real data (or anonymized data, depending on your preference) to validate response quality and performance.

Weeks 7-8 — Go-live and handoff: we go to production with intensive monitoring. We train your team with hands-on sessions (no PowerPoints — open terminals and real practice). We deliver runbooks for the most common scenarios: how to restart the service, how to update a model, how to add a new use case, what to do if a GPU fails.

Weeks 9-12 — Stabilization: we continue actively monitoring, adjust configurations based on real production behavior, and resolve any incidents. By the end of this period, your team should be able to operate the system autonomously.

Important detail: throughout the entire process, we work in pairs with your team. We don't do anything alone in a closed room. Knowledge transfer starts on day one, not at the end.

Case study: lending fintech in Buenos Aires

To illustrate how this works in practice, we're sharing a recent case (with client authorization, anonymized data).

A Buenos Aires fintech with 130 employees processes over 2,000 credit applications per month. Each application requires analyzing documentation (pay stubs, bank statements, Veraz credit reports), cross-referencing data with external sources, and generating a risk report for the credit committee. The manual process took between 45 minutes and 2 hours per application.

The regulatory problem: BCRA requires that applicants' financial data not leave the bank or fintech's perimeter. Using GPT-4 via API to analyze pay stubs was legally unviable.

What we implemented: Llama 3.1 70B running on a dedicated VPC on AWS (Sao Paulo region, the closest with GPU availability). RAG pipeline with BCRA regulations and the fintech's internal policies as the knowledge base. Direct integration with their core system via API.

Results after 3 months: analysis time per application dropped from an average of 90 minutes to 12 minutes (85% reduction). The rate of incorrect approvals remained the same (the model is no less conservative than the analysts). The credit team went from processing 10 applications per person per day to 35. And most importantly: the regulator audited the system and approved it without objections.

If you have a similar case or want to evaluate whether the service applies to your company, write to us at hello@orionis.consulting. The initial assessment is free of charge.

85% reduction in analysis time: from 90 minutes to 12 minutes per credit application. The team went from 10 to 35 applications/day per person, and the regulator approved the system without objections.

References

BCRA — Consolidated Text on Technology Risk and Information Security Management Standards

Banco Central de la República Argentina

BCRA — Communication A 6017: Technology Risk Management Guidelines

BCRA

Disposition 60-E/2016 — Security Measures for Personal Data (Argentina)

Argentina.gob.ar