One of the biggest criticisms of AI-generated content is hallucination, or the tendency for AI models to confidently make things up. Although human writers also have this problem, industry critics have a point. Given the speed at which you can spin up 1,000s of articles on a website these days with AI, it’s important that we try to correct for mistakes in those articles.
I work as a prompt engineer, and published a prompt engineering book with O’Reilly, so I face this problem all the time in my work. Let me take you through the problem so you see what we’re dealing with, and how people deal with it (manually), then I’ll show you an automated AI solution you can implement today that solves for this. Yes, often the answer to bad AI, is more AI.
Here’s an example of an AI generated article that caught flack online:
https://x.com/exceljet/status/1724861513608175911
The AI model that wrote this article is clearly out-of-date, and hasn’t heard of XLookup. Therefore it made its best guess, and assumed it was an Add-in you install, rather than a new native function. To their credit, they took the article down, and reviewed their practices relating to AI-generated content.
This is a tricky spot to be in, because on the one hand, most AI answers are actually really great (better than many of the human freelancers I have hired in the past), and taking down AI-generated content would mean a lot of people not getting the answers they need, and the business getting less valuable traffic. However, we also don’t want to be putting slop out there, and getting accused of spam.
Here’s what I typically advise my clients to do in these situations:
Here’s what I mean by that last one:
Here’s a Google Colab if you know how to code, or if you want something to pass onto your developers. Let’s go through these in order so you can see how to implement something like this in your AI content generation process:
I made a fake article based on this one from the New York Times on Elon Musk’s pay package shareholder vote, except I added three fake claims in there. These are the fake claims I added to the many real claims made in the article:
Using AI to find claims is fairly trivial: given a basic prompt asking GPT-4o to check for any claims that are made, it identifies pretty much anything it should. If you are in a regulated industry such as medical you might want to spend a lot more time on this prompt and check its accuracy in finding claims, but this task basically just works out of the box without much prompt engineering.
The rise of LLMs and the attempts to mitigate hallucinations have driven a huge amount of interest in giving AI’s the ability to search for themselves. Whether it’s vector databases, which can search your documents, or good old-fashioned web search, gathering additional context for an LLM to make a decision is usually the solution that works to solve hallucination. In this case we’re using a service called Tavily, which is by the team who made AgentGPT, as it is easy to use and offers 1,000 free searches per month.
For each claim in our article that is detected by GPT-4o, we simply make a query to Tavily, which executes a web search session on our behalf. We can get back either a list of links that would show up on Google (with some relevant content from each link), or we can opt to let Tavily’s AI read that text and respond back with a final verdict. In this case I opted for the latter, but you might find it helpful to display citations to your users, or parse the information from the search with your own LLM call.
In my use case, I just wanted to note the claims and then have the LLM provide some context, which a human would then review and do their own investigation. However, you could take this further and automatically rewrite sections of the article based on whether it contained an inaccuracy. In addition, you can convert the results of this fact-checking into an accuracy score, and then use that as an evaluation metric for A/B testing whether one version of your article generation prompt works better than another, in terms of producing fewer hallucinations.
While this code isn’t production-ready, it should give you a reasonable mental model for how a system like this could be built. In reality, you may do many LLM calls to review and rewrite parts of your article in production. I’ve even had some success building tools that generate multiple articles at once, and then use fact checking and other evaluation scores to pick the one that works best. One thing that AI critics get wrong, is that this is the worst that AI will ever be: we’re seeing a step change in performance every six months, and that’s without even implementing systems like this, which use existing capabilities in smarter ways. It won’t be long before AI-generated content is better and more trusted than human content.