LLM series - Pathlight: Why good AI starts with infrastructure

This is a guest post for the Computer Weekly Developer Network written by Trey Doig in his position as co-founder and CEO at conversational intelligence company Pathlight

Keen to address the fact that no real standards or infrastructure exist when building with LLMs, Doig suggests that so every developer has to essentially build from scratch… so where does that leave us?

Doig writes as follows…

Every developer today is eager to harness the power of Large Language Models (LLMs), yet many embark on this journey without a clear understanding of what it entails.

Diving into the world of LLMs demands a substantial level of technical expertise. Building and fine-tuning these models requires a deep understanding of natural language processing (NLP), machine learning (ML) and data handling.

Beyond talent, the technical hurdles cannot be overstated. First, let’s remember that the technical infrastructure to build and maintain LLMs is non-existent. The computational resources required for training and fine-tuning LLMs can be bank-breaking, especially for startups. It’s no secret that compute power is not cheap and securing access to powerful Graphical Processing Units (GPUs), which are essential for accelerating the training process, remains a challenge.

Then, there are the safety and security implications. LLMs, if not carefully trained and monitored, can generate harmful or biased content, which can have severe consequences. Ensuring the ethical and responsible use of LLMs is an ongoing challenge that developers must address.

Too many choices

I’ll get straight to the point. In the world of LLMs, model choice has exploded. It’s been pretty much exactly a year since the OpenAI model GPT-4 was made publicly available and we’re now flooded with every kind of model imaginable spanning the worlds of both proprietary and open source software. Names here include Llama2, Mistral, Claude, Falcon 40B – the list is long.

Each of those models has varying limitations in terms of scale, rate and performance. Given rate limitations (cloud networks and all computer systems for that matter have a maximum rate at which they can function and process workloads), real-time or high-throughput applications often face challenges in handling processing delays, for example. Even the trust and safety mechanisms baked into the models can sometimes contribute to the unpredictability of results, especially as the hosted foundational models out there are being regularly updated behind the scenes.

Pathlight co-founder Trey Doig: With his dog.

The reality is that for any application that is making use of LLMs in the enterprise, IT teams cannot be reliant on any one hosted model due to the constraints of limited throughput via rate limits no matter how much you are willing to spend.

On top of that, a diverse mixture of model backends, especially across highly specific use cases, will produce a customer experience that is far too unpredictable and especially unreliable for the purpose of data analysis (which is, after all, the primary use case for LLMs as we stand today).

Instead, what we have found great success in is in adopting best-in-class open source models, typically on the smaller end parameter count wis. We then use our own models to fine-tune these much smaller and more easily hosted models, each optimised for the specific use cases our application requires of LLMs today. In doing so, we’ve mitigated any over-reliance on any specific one LLM vendor and made possible a path towards horizontal scalability that doesn’t massively inflate our COGS.

Build your own infrastructure

Because of these limitations, it’s critical to be able to make use of multiple LLMs, even for a single application. Relying on a single LLM is akin to the days when companies relied on one cloud provider. It won’t work in the long run.

But stringing LLMs together is not quite like connecting microservices. Unlike traditional software development, we don’t yet have standard practices (i.e. DevOps) and tested infrastructure that any one developer or team could buy out of the box. Instead, they have to build all of it.

There’s the obvious components like having systems in place to support Retrieval Augmented Generation (RAG)-based functionality and storage of embeddings, which normally requires a vector store of some kind even at the smaller end of scale. Unless a dataset for retrieval is small enough to fit in memory, which normally only applies to prototypes or demos, a purposely built vector store is normally a requirement for any non-trivial application. As many will know, RAG itself is an AI framework used to refine and improve the responses that come from LLMs to elevate their consistency and quality by making an additional connection from the AI model to external sources of ratified knowledge data.

What we’ve discovered to be one of the more complicated infrastructures to get right is building a service-level abstraction layer for prompt generation. In order to actually achieve multi-model support in your application, it’s essential that your application’s business logic is capable of compiling into various prompt formats different models expect. Architecting a pluggable infrastructure to achieve this can be complicated for a number of reasons and the tooling and libraries that help make this possible (such as Langchain) are never the silver bullet you may expect them to be.

In practice, once you’ve reached this point and have an architecture supporting multiple LLM models and routing prompts across a variety of models, comes a need for advanced monitoring and logging systems to better understand performance across those various models. 

They might involve a combination of playground tools, prompt monitoring and logging, evaluation frameworks, etc. Some of these have commercial solutions beginning to come out for managing these jobs to be done, but most are still so very early in development and tightly coupled to the major three or four LLM vendors and may not be easily adopted by companies deep in the customisation of LLM infrastructure.