Can AI Outperform Human Editors? We Put OpenAI to the Test!

At Packt Publishing, we’re continually exploring new ways to enhance our editorial processes, especially as we manage highly technical content across a range of subjects. With OpenAI’s recent release of o1-preview and o1-mini, we saw an exciting opportunity to experiment with these advanced reasoning models to assist with quality assurance (QA). These AI models promise to excel at tasks like code validation and clarity improvement, areas where precision and speed are critical for technical publishing.

Recently, I led an experiment to compare the feedback generated by o1-preview and o1-mini against that of our experienced human editors. We wanted to see if these AI tools could enhance our workflow while addressing practical considerations such as cost, response speed, and functionality.

Why OpenAI o1-preview and o1-mini?

o1-preview is designed to solve complex problems through deeper reasoning, making it a potential asset for our editorial team, particularly for high-level tasks like validating code snippets, ensuring technical accuracy, and suggesting improvements for clarity. At the same time, o1-mini—a more cost-efficient, faster version of the model—offered an appealing option for discrete, STEM-focused tasks, such as performing math-related QA tasks.

Our goal was to see how these models performed in real-world editorial scenarios, using content already in development at Packt.

The Experiment: Testing AI in Publishing

We tested several draft chapters of content, feeding them through o1-preview and o1-mini while comparing the AI’s feedback to the results from our human editors. We focused on three core areas:

Code validation: Could the AI models identify bugs, inefficiencies, or errors in the code?

Clarity of explanations: Could they suggest ways to improve the readability and accessibility of technical content?

Fact-checking: Would the models be able to flag any outdated or inaccurate information?

Results: How o1-preview and o1-mini Performed

1. Code Validation: o1-preview vs. o1-mini

Both models showed strong capabilities when it came to code validation, but they excelled in different ways. o1-preview was able to handle more complex reasoning tasks. For example, it identified an issue in a piece of Java code where a logic error would have caused the application to behave unexpectedly in specific environments. The model not only flagged the bug but also provided a more efficient solution that optimised performance—something that might have gone unnoticed in a traditional review process.

On the other hand, o1-mini performed admirably for more targeted tasks like math validation and basic coding logic checks. While its reasoning capacity is lower than o1-preview, o1-mini was much faster and still provided high-quality, contextually relevant feedback, particularly for STEM-related content.

2. Clarity of Explanations

When it came to improving the clarity of technical explanations, o1-preview provided useful insights but didn’t significantly outperform our human editors. It made some suggestions for simplifying dense sections, but the improvements were incremental rather than transformative. The model occasionally offered useful rewording for specific technical terms, but it didn’t provide consistent, high-impact suggestions.

o1-mini, optimised for STEM reasoning, performed admirably on tasks that required mathematical explanations or coding logic. However, it struggled when tasked with more general editorial work, as it lacks the broad world knowledge required to excel in non-STEM content.

3. Fact-Checking

Fact-checking proved to be a mixed bag for o1-preview. While the model successfully flagged outdated data in a cloud architecture guide and highlighted areas that needed updates, it struggled with newer, niche technologies, where human expertise was still necessary. o1-preview was adept at verifying general tech specifications but less reliable when diving into emerging fields.

o1-mini, focused on STEM domains, wasn’t as effective in broader fact-checking tasks. While it excelled in mathematical validation and code logic, its limited world knowledge made it less useful for verifying non-technical details or handling general editorial queries.

Challenges: Cost, Speed, and Limitations

Although the models showed promise, they also presented several key challenges that make them less practical for large-scale use at this stage:

1. Higher Cost

o1-preview is significantly more expensive than other API calls we typically use with GPT models. Given its higher cost, using o1-preview on large projects quickly becomes prohibitive. While its reasoning capabilities are impressive, the price makes it difficult to justify for routine editorial tasks, especially when human editors or less costly AI models can handle simpler checks.

o1-mini, on the other hand, is 80% cheaper than o1-preview and better suited for discrete tasks like math reasoning and code validation. This cost-efficiency makes o1-mini an attractive option for targeted use cases, especially when we need quick, accurate feedback on specific technical tasks.

2. Slower Response Speed

One major drawback of o1-preview was its slower response speed, a result of its reasoning-intensive design. While this extra processing time occasionally led to more thoughtful feedback, it significantly slowed down our workflow, making it difficult to scale in a fast-paced publishing environment. o1-mini, by contrast, delivered feedback much faster, making it better suited for handling large-scale tasks where quick feedback is essential.

3. Limited Functionality

Both models showed some limitations in terms of functionality. o1-preview, being in its preview phase, lacks certain capabilities like batch processing and structured data support, which are essential for scaling editorial tasks. This made it difficult to use for larger projects, where processing multiple files or batches at once is critical to efficiency.

o1-mini, while effective for its intended tasks, also has limitations. Its focus on STEM reasoning means it lacks the broader world knowledge required for fact-checking or handling more diverse editorial tasks. This makes it less versatile compared to other models, but its specialisation in coding and math still makes it an invaluable tool for those specific areas.

A Promising Supplement, Not a Full Replacement

Our experiment with o1-preview and o1-mini revealed that while these models have significant potential, they aren’t yet ready to replace human editors or serve as a comprehensive solution for all QA tasks. o1-preview excels in complex code validation and reasoning but is hampered by its high cost and slower response times. o1-mini is a faster, more cost-effective option for discrete tasks like math validation and coding checks, making it a strong candidate for specialised use cases.

At Packt, we see both models as valuable supplements to our human-led QA process. They can save time and improve accuracy in targeted areas, particularly in code-heavy content. However, for more general tasks—like broad fact-checking or content clarity—human expertise remains irreplaceable. As OpenAI continues to evolve these models, we look forward to exploring their potential even further.