In conversations about AI’s potential, one point keeps coming up: it’s only as good as the data you give it. Then, almost immediately, everyone starts talking about Retrieval Augmented Generation (RAG), because we’re in software and boy do we love fancy-sounding acronyms for simple things. (We’re just pasting in some useful data to help the AI 🤷.) From there the conversation quickly moves on to vector embeddings and similarity measures: “Vectors! Cosine similarity!”
But does it have to?
We’ve been using databases and text search for a really long time (decades, in fact!) and we’re pretty good with them. Often, "Just query your data the old fashioned way" is a better choice. And it’s not that vector embeddings aren’t useful—they really are1—but most of the time it’s easier than that; the data you want just isn’t that hard to find.
What about the other case, you ask? What if it’s not easy? Is that when you use vectorDB? Yes—and no. You are now staring hard at the problem of relevance.
Imagine you are discussing "software patterns for applications that use LLMs with healthcare data." Information about "AI for diagnosing diseases" and "AI for video game development" are both related and possibly close to "building software applications with LLMs for healthcare" in vector space, but they miss crucial connections.
Ok, this is a contrived example (also suggested by an AI) but the point stands. How do you get the right contextual data to help the AI? A good way to answer this is, if it would help a human do the task well, it will help the AI do the task well. So, as others have said before, "You should be building a search engine, not a vector DB." And as they’ve also said, that’s good news and bad news.
The good news: A lot of the time, the data you need is obvious, straightforward.
The bad news: When it’s not, you’re effectively building a search engine or at least configuring one and feeding your data into it.
More good news though: We’ve been doing search engines for a long time and there’s a lot of technology and expertise you can pull off the shelf. But you’ve signed up for some hard questions.
You will need to continuously monitor and evaluate the relevance of your results in the same way you need to evaluate the quality of responses from your model. Models themselves can help with some of this. Specifically, they are pretty good at scoring and ranking results in context. Folks will often use techniques like Reciprocal Rank Fusion to merge and score results from different sources (like keyword search and vector similarity) and this is helpful, but it is just a blending of the different search mechanisms native scoring.
I recall reading about this approach somewhere, though the exact source escapes me despite my best efforts to locate it. A better way to assess, "How good are the results at answering this prompt or helping do this task?" is to ask an LLM:
Given this list of results, which are most useful in helping to perform <TASK>? <RESULTS>
Much like, "What’s helpful to a human is helpful to an LLM," an LLM is good at assessing what information is useful in a similar way to a human.
Side Note: The Nature of LLM "reasoning" and contextual data
Ask a model—My daughter was born on 1/13/2022. How old is she?
—and the model will likely get it wrong. ChatGPT and other assistants like Claude will likely get this right, but that’s because what you type is not all the information they are getting, and just answering the prompt is not all the computation they are doing. Try it with one of the Llama Models. Then ask,My daughter was born on 1/13/2022 and the current date is <current date>. How old is she?
It will do a lot better.
This distinction is important: it’s not just about a full-fledged model doing the heavy lifting but about injecting the right context to guide the response. Knowing which parts of the data are relevant to the task is key to making AI responses accurate and useful.
The second key issue with RAG is its ephemerality. RAG results are temporary—they disappear after each request. That sounds annoying, right? You mean I have to keep reinjecting it into the context? But it’s an important advantage for real time applications. LLMs have one big thing in common with humans, you can con them into telling you secrets. The ephemeral nature of RAG lets you isolate data to a particular conversation and makes it easier to prevent it leaking out. It doesn’t eliminate the need for strong protections, but it does provide some benefits, especially when you contrast it with fine-tuning or distillation.
This opens a broader conversation around managing data in ephemeral contexts—a critical factor in LLM applications, with implications for how we secure and control AI knowledge. I’ll dive deeper into these security questions in a future post.
We’re at an exciting time in AI and LLM development. Much like web applications in their early days, we have a chance to set the foundation right. As the industry evolves, the standards we build today will shape how AI applications are developed tomorrow. That’s why it’s crucial to approach tools like RAG with a more thoughtful perspective—not just as a one-size-fits-all solution, but as a part of a broader toolkit that can be fine-tuned and optimized for specific use cases.
By focusing on the nuances of relevance and ephemerality, we can create systems that deliver better, more precise results. The opportunity to establish best practices is here, and if we’re careful, we can guide AI’s evolution toward smarter, more secure, and more impactful outcomes.