We were talking to a client this week who basically said “We want to build it ourselves, how do we do it?”.
Now, normally we’d politely tell them to jog on, but this time I thought I’d share the ‘recipe’ with them so they could fully understand the implications of trying to build your own private and secure RAG.
And the real ‘cost’ isn’t in the AI. That’s the cheap part.
Hosting
When we talk about ‘private and secure’, we mean creating your own isolated environment on Azure or AWS. We’ve never tried it in Google, but I’m sure it can be done. We’re then calling a hosted Large Language Model (LLM) model through a no-training, data-residency-guaranteed endpoint (Azure OpenAI, AWS Bedrock, the Anthropic API on enterprise terms). Azure OpenAI is used by most of our clients, but we’ve just become a Claude partner so will be testing the API as soon as it’s available through Azure UK South. A point to note is that doing it properly in AWS is currently 10x the price of doing it in Azure. And that’s after their engineers have ratified our solution and told us we’re doing it correctly, so factor that into your calculations if yours is an AWS shop.
Building
The good news is, the build is basically the same regardless of hosting choice:
- Ingestion and parsing. Connect to the source systems (SharePoint, file shares, intranets, case management) and get clean text out of messy formats. Scanned PDFs need OCR, tables need structure-aware extraction, and a badly parsed document can mess up every answer that touches it. You need to budget real time here. It’s never the bit people expect, but it makes or breaks the quality of the outputs and therefore user confidence.
- Chunking and embedding. Split documents into retrievable pieces with sensible boundaries and overlap, attach metadata (source, date, access level), then convert each chunk to a vector via an embedding model. Note the embedding model choice now, because changing it later means re-embedding the entire index. For large indexes, that can take some time and effort.
- Vector store. Somewhere to hold and search those vectors. Our tool of choice is Azure AI Search.
- Retrieval and generation. Query comes in, gets embedded, you search (ideally hybrid, keyword plus semantic, with a reranking pass), then feed the top chunks plus the question to the model with a prompt that forces it to ground answers in the retrieved context and cite sources. Grounding and citation are super important for obvious reasons and guard against hallucinations and the associated nonsense.
- The security layer. This is probably the most important section for obvious reasons. It ensures people don’t see things they are not supposed to, and stops poison code or document injections. Your security people will understand all of this, but the headlines are:
-
-
Permission-aware retrieval. Bizarrely, this is the single most-missed requirement by people doing it themselves. A user must only ever retrieve from documents they’re already entitled to see. That means propagating Access Control Lists (ACLs) from the source system into the index and filtering at query time, not just bolting auth onto the UI. Get this wrong and you’ve built a very efficient data-leak engine.
- Authentication and Role-Based Access Control (RBAC) via Single Sign-On (SSO), ideally tied to existing identity.
-
Encryption in transit and at rest, network isolation (private endpoints, no public ingress), and proper management.
-
Audit logging of who asked what and what was returned, both for security and for the inevitable “prove it” conversation with an auditor (or worse, law enforcement).
-
Input/output guarding – prompt-injection defences (a poisoned document can carry instructions), Personally Identifiable Information (PII) redaction where needed, and output filtering.
- A Data Protection Impact Assessment (DPIA) and data-flow map done up front, not retrofitted.
-
Ongoing maintenance
The piece most people don’t think about during setup.
-
Up-to-date content. Source documents change, and crucially they get deleted. Stale or withdrawn content in your index is a genuine liability. You need incremental re-sync that adds, updates, and removes.
-
Quality and evaluation. Maintain an evaluation set of representative questions with known-good answers and run it as a regression test whenever anything changes. Capture user feedback and actually review the failures. Retrieval quality drifts as the library grows.
-
Model and embedding changes. A new LLM version means re-testing your prompts. This is super important because model behaviour shifts with every new release. This one catches out a lot of people who assume that output will be the same. It never is.
-
Permission drift. People join, leave, change roles, documents get reclassified. Access controls have to stay in sync, or you have a problem.
-
Security upkeep. Patch dependencies, rotate secrets, review access logs, and regularly pen-test. Prompt-injection techniques in particular keep evolving.
-
Cost and capacity. Monitor token spend and infrastructure as usage climbs. RAG costs creep quietly and are likely to escalate as the labs try to monetise their investments.
In summary
So, there you go. You now know more than 90% of people how to build your own private and secure RAG AI solution.
In short, building it is a few weeks of engineering, but running it well is a big commitment. Or you could get the experts to do it for you. Just saying…