Artificial Intelligence · March 26, 2026 · 15 min read
LLMs in Enterprise: Why Foundation Models Need Domain Fine-Tuning
General-purpose foundation models are remarkable. They are also wrong in highly specific, predictable ways the moment you put them in front of regulated, jargon-heavy, document-dense enterprise workflows. This is what domain adaptation actually requires — and why retrieval alone is not enough.
Two years on from the release of GPT-4 and Claude 3, the enterprise picture of large language models has clarified in a way that the 2023 hype cycle made difficult to see. Foundation models are a genuinely useful general-purpose capability. They are not a substitute for domain knowledge, and the cost of pretending they are has — as anyone running a regulated workflow has now discovered — turned out to be substantial. The question is no longer 'do LLMs work for enterprise?'. The question is which architecture pattern turns a general-purpose model into a system that an operator, treasurer, lawyer, or geoscientist can actually use.
The pattern that has converged across serious deployments — and which Hwodye Research Labs builds on for the LLM components of both Hwodye Energy and Ajovi/Wuladi — is a three-layer stack: a strong base model, a domain-adapted retrieval layer, and a fine-tuned reasoning layer specific to the workflow. Each layer does different work. Most failed enterprise LLM projects we see have skipped at least one. This post is the case for why all three matter, what each one adds, and how to think about the build-vs-buy tradeoff at each layer.
12+
Production deployments
94%
Retrieval recall @10
<2s
End-to-end latency
30×
Hallucination reduction
Where general-purpose models fail in production
The failure modes of out-of-the-box foundation models in enterprise contexts are by now well documented. The single most consequential one is plausible-sounding wrongness on domain-specific factual matter. A well-instructed Claude or GPT will answer a question about Nigerian petrophysical regulation, or about Wuladi's dispute resolution framework, or about a specific Niger Delta reservoir — and it will be confidently wrong in ways a domain expert would catch but a general user would not. The wrongness is not random; it is shaped by the model's training distribution, which over-represents American, European, and global-North English-language sources and under-represents almost everything else.
The second failure mode is jargon collapse. Specialist domains use terms whose meaning is precise and shared within the community but obscured or repurposed in general-purpose corpora. Net pay means one thing to a petrophysicist and another to a real estate accountant. Escrow in the Wuladi context is technically distinct from escrow in a US real estate transaction. The model will average over both. The output is text that sounds reasonable, uses the right vocabulary, and is in fact wrong about the operative concept. Retrieval helps with this. Retrieval alone does not solve it.
Why retrieval alone is not enough
The dominant enterprise architecture pattern of 2024 was retrieval-augmented generation — supply the model with relevant context at inference time, ask it to answer grounded in that context, accept that hallucination is now bounded by the retrieved documents. RAG is a real advance and Hwodye uses it extensively, but it has structural limits that have become clear with two years of production data. The limits are well-documented in the research literature and worth being explicit about. First, retrieval is only as good as the embedding and the corpus — a poorly curated corpus produces confidently-grounded but actually-wrong answers. Second, the model still reasons over the retrieved context with whatever general knowledge it has, which means jargon collapse still occurs. Third, retrieval-only systems struggle on questions that require synthesis across many documents, even when each individual document is correctly retrieved.
The architectural response that has emerged — and that we believe will define the next phase of enterprise LLM deployments — is hybrid: a retrieval layer for factual grounding, plus a fine-tuned reasoning layer that has internalised the domain's vocabulary and inference patterns. The fine-tuning component does not replace retrieval. It complements it by ensuring the model's reasoning over retrieved context is itself domain-correct. The LoRA and QLoRA techniques make this fine-tuning cheap enough to be operationally feasible — typically tens of dollars of compute per fine-tune for a domain-adapted variant of a 7B-parameter open model.
Retrieval gets the right facts in front of the model. Fine-tuning makes the model reason about those facts in the right vocabulary, with the right inference rules, in the right professional voice. Neither alone is sufficient for serious enterprise work.
The three-layer stack that actually works
The architecture Hwodye uses across its production LLM workloads has three layers. The base layer is a frontier model — currently a mix of Claude Sonnet and an open-weights model from the Llama 3 or Mistral families, chosen per workload based on cost, latency, and capability requirements. The base layer provides general reasoning, language fluency, and the world model that everything else builds on. We do not modify the base layer.
The retrieval layer uses a hybrid vector + keyword search over a domain corpus — petrophysical literature and operator reports for Hwodye Energy, regulatory texts and Ajovi/Wuladi product documentation for the finance ecosystem. The corpus is versioned, the embeddings are regenerated on a schedule, and the retrieval results are passed to the base model with explicit source attribution. The retrieval implementation uses LangChain primitives with a custom relevance ranker on top, and the corpus is stored in a Qdrant vector database for the dense retrieval and a Tantivy-backed index for the lexical retrieval.
The fine-tuning layer is where the system stops looking generic. For each high-value workflow, we maintain a LoRA adapter trained on domain-specific examples — typically a few thousand carefully-curated input-output pairs that teach the model the domain's professional voice, common inference patterns, and edge cases. The fine-tuning data is the most important and most expensive part of the stack: it is hand-curated by domain experts who understand exactly which behavioural difference each example is teaching. A typical domain adapter takes 2–8 weeks of dedicated expert time to assemble. There is no shortcut.
Hallucination rate (% of factual claims) · Niger Delta petrophysics QA benchmark
Evaluation — and why most teams do this badly
The hardest part of an enterprise LLM deployment is not building it. It is evaluating it. Standard benchmarks — MMLU, HELM, BIG-bench — measure capabilities that have almost no correlation with whether a domain-specific system works in production. The model that scores highest on MMLU may also be the model that confidently misinterprets a regulatory document on which your downstream workflow depends. You cannot use general benchmarks to make production decisions about a domain-adapted system.
What works instead is domain-specific evaluation — a benchmark assembled, curated, and maintained by your own domain experts, designed to test exactly the failure modes you care about. The HELM-medicine and HELM-law variants are the closest published examples of what this should look like for specialist domains. For Hwodye Energy, our internal petrophysics benchmark contains roughly 2,400 question-answer pairs across lithology, fluid typing, regulatory submission, and CCUS site screening. The benchmark is regenerated quarterly. Every model deployment passes through it before any production release. The benchmark is, in our view, more valuable than any specific model we have ever trained on it.
What the next 18 months should bring — and what they will not
Three predictions, in roughly increasing order of confidence. First, agentic systems — LLMs that take multi-step actions, call tools, and operate over time — will move from research demos into production for narrow workflows where the action space is well-defined. The Anthropic Agent Computer Use capabilities released in late 2024 and the Microsoft Copilot agent platform signal where this is heading. Most enterprise applications, including the ones Hwodye builds, are not yet ready for full autonomy and will retain a human-in-the-loop posture for the foreseeable future.
Second, open-weights models will close more of the capability gap with frontier closed models. The Llama 3 series, Mistral Large, DeepSeek-V3, and the Qwen series from Alibaba have made remarkable strides in the last 12 months. For most enterprise workloads, the open-weights option is now genuinely competitive, and the data-residency arguments for self-hosted deployment are increasingly decisive. Hwodye's default position for any new deployment is to evaluate at least one open-weights option as part of the architecture decision.
Third — and this is the prediction we hold most strongly — domain adaptation will become the defining commercial moat in enterprise AI. The base model layer is increasingly a commodity, with capabilities converging fast and pricing collapsing. The retrieval layer is engineering work that competent teams can replicate. The fine-tuning data — the curated examples, the domain expert hours, the institutional knowledge encoded in the adapter — is the part that does not commoditise. The companies that win the next phase of enterprise AI are the companies that invest now in building domain-adapted systems and the proprietary evaluation infrastructure to keep them honest. If that maps to a problem you are working on, Hwodye Research Labs is open to partnership engagements at exactly this layer.
