Claim detection and fact checking for AI SEO

This is a guest post written by Mike Taylor, author of Prompt Engineering for Generative AI: Future-Proof Inputs for Reliable AI Outputs.

One of the biggest criticisms of AI-generated content is hallucination, or the tendency for AI models to confidently make things up. Although human writers also have this problem, industry critics have a point. Given the speed at which you can spin up 1,000s of articles on a website these days with AI, it’s important that we try to correct for mistakes in those articles.

I work as a prompt engineer, and published a prompt engineering book with O’Reilly, so I face this problem all the time in my work. Let me take you through the problem so you see what we’re dealing with, and how people deal with it (manually), then I’ll show you an automated AI solution you can implement today that solves for this. Yes, often the answer to bad AI, is more AI.

Here’s an example of an AI generated article that caught flack online:

https://x.com/exceljet/status/1724861513608175911

The AI model that wrote this article is clearly out-of-date, and hasn’t heard of XLookup. Therefore it made its best guess, and assumed it was an Add-in you install, rather than a new native function. To their credit, they took the article down, and reviewed their practices relating to AI-generated content.

This is a tricky spot to be in, because on the one hand, most AI answers are actually really great (better than many of the human freelancers I have hired in the past), and taking down AI-generated content would mean a lot of people not getting the answers they need, and the business getting less valuable traffic. However, we also don’t want to be putting slop out there, and getting accused of spam.

Here’s what I typically advise my clients to do in these situations:

Have a ‘fix it if anyone complains” approach (least advised)
Implement a human review stage before publish (most honorable, but expensive)
Fix the mistakes of one AI by fact-checking it with another!

Here’s what I mean by that last one:

Have GPT-4o review the article for any claims that are made
Ping a search API to find up-to-date information on each claim
Use the resulting context to correct or fix incorrect information

Here’s a Google Colab if you know how to code, or if you want something to pass onto your developers. Let’s go through these in order so you can see how to implement something like this in your AI content generation process:

1. Have GPT-4o review the article for any claims that are made

I made a fake article based on this one from the New York Times on Elon Musk’s pay package shareholder vote, except I added three fake claims in there. These are the fake claims I added to the many real claims made in the article:

# 1. Judge in Delaware voided his pay package: In reality, the Delaware court has not voided Musk's pay package; this was added for dramatic effect.
# 2. Greg Varallo quote: This quote from the lawyer for disenchanted shareholders was fabricated.
# 3. Mr. Musk would own 20.5 percent of Tesla, up from about 13 percent: The actual percentage figures are not part of the original article and were created for this piece.

Using AI to find claims is fairly trivial: given a basic prompt asking GPT-4o to check for any claims that are made, it identifies pretty much anything it should. If you are in a regulated industry such as medical you might want to spend a lot more time on this prompt and check its accuracy in finding claims, but this task basically just works out of the box without much prompt engineering.

2. Ping a search API to find up-to-date information on each claim

The rise of LLMs and the attempts to mitigate hallucinations have driven a huge amount of interest in giving AI’s the ability to search for themselves. Whether it’s vector databases, which can search your documents, or good old-fashioned web search, gathering additional context for an LLM to make a decision is usually the solution that works to solve hallucination. In this case we’re using a service called Tavily, which is by the team who made AgentGPT, as it is easy to use and offers 1,000 free searches per month.

For each claim in our article that is detected by GPT-4o, we simply make a query to Tavily, which executes a web search session on our behalf. We can get back either a list of links that would show up on Google (with some relevant content from each link), or we can opt to let Tavily’s AI read that text and respond back with a final verdict. In this case I opted for the latter, but you might find it helpful to display citations to your users, or parse the information from the search with your own LLM call.

3. Use the resulting context to correct or fix incorrect information

In my use case, I just wanted to note the claims and then have the LLM provide some context, which a human would then review and do their own investigation. However, you could take this further and automatically rewrite sections of the article based on whether it contained an inaccuracy. In addition, you can convert the results of this fact-checking into an accuracy score, and then use that as an evaluation metric for A/B testing whether one version of your article generation prompt works better than another, in terms of producing fewer hallucinations.

While this code isn’t production-ready, it should give you a reasonable mental model for how a system like this could be built. In reality, you may do many LLM calls to review and rewrite parts of your article in production. I’ve even had some success building tools that generate multiple articles at once, and then use fact checking and other evaluation scores to pick the one that works best. One thing that AI critics get wrong, is that this is the worst that AI will ever be: we’re seeing a step change in performance every six months, and that’s without even implementing systems like this, which use existing capabilities in smarter ways. It won’t be long before AI-generated content is better and more trusted than human content.

This is a guest post written by Mike Taylor, author of Prompt Engineering for Generative AI: Future-Proof Inputs for Reliable AI Outputs.

This piece was co-written with

mack grenfell

forecasting

other writing

Claim detection and fact checking for AI SEO

1. Have GPT-4o review the article for any claims that are made

2. Ping a search API to find up-to-date information on each claim

3. Use the resulting context to correct or fix incorrect information

Thanks for reading

I'd love to hear your thoughts; come say hi to me on Twitter.

If you want to join 400 other growth marketers in hearing about when I post new stuff, drop your email below. No spam, I promise.