Observations from improving a RAG chatbot from useless to amazing!

Build a solid base first, then let magicians do magic

Jan 19, 2025

Everyone is building chatbots, riding the current AI wave that will undoubtedly crash if not crashing already. However, because we are in an industry where hype trumps literally everything else, if you’re not building with AI, you’re just being left behind and risk being forgotten or at least surpassed by direct competitors who do leverage it.

If you don’t do it, someone else will, and… then what? What this means is that this way of thinking is leading to a lot of rushed implementations with half-baked objectives that just look and feel bad. While the realizations of this will take a long time to catch up with the hype drivers, it will surely come.

So, what is needed to build something useful in this space?

Data is king!

Unsurprisingly, the better data you have, the better any implementation you can have. The decades old adage of GIGO has never been truer today. If you feed garbage, you will get garbage, no way around it. These models are not silver bullets and, in fact, the models themselves are trained on what is announced as “high-quality datasets” which means that there are teams of data engineers and data scientists building super complex pipelines just to get the data in as much decent shape as possible to feed through these model’s training loops that generally run over several weeks or months.

A big investment (the biggest?) for a lot of companies these days is to build private “company-LLMs” that use private data, unique to your company, business domain, operations, and customers to leverage a highly customized and focused model that is much more useful for “John working at company X wanting to know about project Y” than it is for Janice who just wants to know how to bake the perfect Mac n cheese.

However, even here, at the very beginning of the journey, a lot of questions start popping up, do we work with all the different data source formats we have internally or do we try to align on something specific?
Do we want only text or images too?

It all depends on the use-case and data sensibility and complexity of the pipeline you want to assemble. However, for a lot of use cases, where the vast majority of the data being ingested will be textual, it pays off immensely to adopt a single standard format. Make everything Markdown. That’s it.
When we adopt a single format for data ingestion, things get drastically simplified: it becomes easy to prepare document ingestion pipelines, that only need to deal with a single format. We can use tools like Docling or similar to convert from N different formats to Markdown, we can extract metadata, summarize content, and organize the data with headers, sub-headers and lists _inside_ the document itself. This is a huge win, and if it’s something doable for your particular use case, you should do it.

Parsing and chunking are not on the critical path

When I first started looking into this space, there were a lot of unknowns, also because, even today, around 1 year and a few months after I started, the landscape keeps shifting so fast that it becomes very hard to keep track of everything, although, just like with “classical software engineering”, if we can call it that, focusing on the fundamentals is always the best choice.

In this case, the fundamentals all stem from Information Retrieval algorithms and natural language processing, with the actual “novelty” that are LLMs only having an impact in the end of the entire thing.

The most important piece of advice here can be summarized in two points:

- You need to ensure that the chunks of your documents can be sent to an embedding model (basically, they need to fit in that model’s context window);
- You probably don’t need sophisticated parsing and chunking algorithms, we will tie this point together near the end;

Don’t overthink this part too much, just ensure that whatever embedding model you use matches the size of the chunks you’re passing to it or at least has predefined truncation strategies to keep your pipelines nimble.

Don’t reinvent the wheel

Coupling best engineering practices together with using providers that expose APIs to interact with both foundation and embedding models is the way to go.
Look into your language of choice and see if there are frameworks dedicated to work with AI in this brave new world, and use them. Even better, improve them and help others navigating them to give everyone a chance to do the right thing!

Simply focus on your secret sauce, which is your data and then let things naturally fit together by following best practices, leverage frameworks and offload the real API work to vendors that shield your apps and users from all the complexity.

Like this, you will be able to build on top of very solid foundations that will help you improve or extend things as needed.

Finally, if you focus on reading literature from top companies in the space while seeking how you can apply and/or adapt some of their advice to your own work, you will be able to make great progress by staying grounded in the fundamentals. I recommend personally reading anything Anthropic puts out!

Practical tips

To wrap this up, here are some practical tips to try out:

Standardize your documents for your RAG pipelines into a single, unified format (I like Markdown);
Use proven technologies where possible (Springboot, Postgres, hosted model providers, like AWS Bedrock);
You don’t need a dedicated vector DB, you just need Postgres.
You NEED reranking, even if you think you don’t. Use Cohere rerank 3.5.
Contexts can be probably longer than you think. Sending 20 context chunks does not disperse more recent LLMs and, actually, actively improves their output.
Indexes might or might not be critical, but, they work differently for embeddings than traditional DB fields, so, learn them first.

Overall, if you combine being grounded in the fundamentals, make sensible choices to keep it simple at critical junctures and use state-of-the-art LLMs, the quality of what you build will only be limited by the quality of your data, and, for many shops, that’s where you’ll be able to shine and outdo the competition!

Above all, have fun!! This is an exciting space to be working on, and you can learn a lot about engineering fundamentals while toying around with the latest shiny toy! There’s not a lot of opportunities for that so, enjoy it, have fun and build cool stuff!

Coding Ramblings

Discussion about this post