LLM series - Stack Overflow: Winning wars with data quality & responsible AI
This is a guest post for the Computer Weekly Developer Network written by Prashanth Chandrasekar, CEO, of Stack Overflow.
Chandrasekar writes in full as follows…
Data is the lifeblood that fuels Large Language Models (LLMs) and their ability to translate plain-language prompts into highly informed, specific content.
But as the old adage says, more isn’t always better.
In order for AI to succeed and be an accurate and reliable resource to the world, LLM developers must not fall into the trap of feeding more and more data sources into their models with no understanding of their source and the quality of the data itself.
Picture this: an LLM brimming with potential but trained on data riddled with inaccuracies, biases or significant inconsistencies. The ramifications this could bring about are profound. Whether that’s flawed language generation, biased outputs or erroneous conclusions, a single misplaced word or skewed interpretation could propagate misinformation and with the stakes now so high, businesses need to ensure they are keeping a close eye on the data they are using in their LLMs. Now, continue this thought exercise and ask how detrimental it would be if this was a wide-scale ripple across all models both publicly available and those used internally at a company?
The rise of the LLM wars
The exponential boom in gen-AI supported by a growing open source ecosystem, combined with fears of hallucinations and IP leakage, has led businesses to invest in their own LLM models and countless more will be launched in the coming months across every cloud platform.
Companies want to move quickly and those lacking the larger resources of the competition may choose to engage with or take advantage of data sets that are easy to find, or leverage free AI models that they can then customise to their own efforts and data sets.
However, that foundation means nothing if the sources are not viable. Garbage in, garbage out. As the LLM wars accelerate, the true winners will be those that emerge with trusted, accurate and attributed data sources.
The models that are trained on high-quality, diverse and representative data will be superior in understanding, generating and contextualising languages and those businesses that can source such high-quality training data are off to a good start in the race against their competitors.
However, maintaining this edge by incorporating continuously evolving and constantly verified new data is the real differentiator in winning over both business and technical users and creating a truly impactful and trusted LLM.
Navigating AI regulation
Regulation of AI is as important as it is inevitable. And while some concerns have been raised about future regulation’s potential impact on innovation and development of LLMs, we believe that balanced regulations – not over-regulation – will support innovation while allowing accurate / quality data to amplify the growth of AI tools.
This is particularly important considering the growing concerns around biases, trust and ethical implications of AI technology. By ensuring that models are trained on high-quality data that is diverse, representative and free from biases will mean you are likely to remove any harmful stereotypes and in turn, help to foster user trust and industry admiration.
Naturally, choice and flexibility will be required to meet the individual needs of developers, but this can only be possible with an accurate layer of data at the foundation of your systems. Choice is important to developers, but there cannot be a choice between accurate and inaccurate data. Accuracy cannot be a ‘nice to have’ but an imperative.
How to generate trusted data?
LLMs are typically trained via an abundance of different online communities and sources. Ultimately that means GenAI can only regurgitate what has already been published, it is not going to provide any response to or insightful feedback on anything created after its last data ingestion. It can never replace communities of action where users are actively engaged and learning directly in the moment you need them most.
At Stack Overflow, we are now working actively with the world’s largest companies building LLMs to leverage our data, powered by a very active community, in socially responsible ways and ultimately giving communities further tools, features or opportunities to continue to generate up to date and high-quality data. If the efforts of these crucial communities stop – we know what that means for LLMs. No more fuel for innovation, just the garbage that will create future waves of misinformation and vulnerabilities across the entire landscape.
As Generative AI and LLMs continue to advance, our foundations of vetted, trusted and accurate data will be central to how technology solutions are built. We are committed to ensuring that developers are not only contributing to the foundation of what Generative AI is today, but also that they are an integral part of helping to build its future – one of socially responsible AI.