Beyond Generic Metrics: Building AI Benchmarks That Actually Deliver
Seemingly daily we get flashy AI demos with new models with record-breaking scores on public benchmarks every few months. While impressive, many of us are still waiting for AI models and apps to translate into tangible tools that make us more productive in our personal and professional lives. Here, I write about my observations with respect to how standard benchmarking is becoming less applicable to real-world problems and try to provide a framework that can be used to help build tests to evaluate when breakthroughs in AI truly matter for your specific needs.
The Role and Limitations of Standard Benchmarks
Benchmarks like MMLU (measuring multitask language understanding), HumanEval (for coding abilities), Humanity’s Last Exam, and HELM (for holistic evaluation) have advanced our understanding of AI capabilities. These tests represent valuable standardized measurements that enable scientific progress. However, these benchmarks must be generalized enough to apply across multiple models and use cases. As a result, these tests are often insufficient for evaluating practical applications in specific contexts.
Public benchmarks succeed in:
Establishing a common reference point for comparing different models
Tracking progress in foundational capabilities over time
Setting baselines for core skills like reasoning, knowledge retrieval, and specialized tasks
Public benchmarks fail to:
Accurately assess which AI capabilities deliver real value to your specific organization
Provide actionable information for when you should adopt new technologies
How advancements offer a return on your AI investment now and in the future
Translating Benchmark Performance to Business Value
Notably, the relationship between benchmark improvements and practical value isn't strictly linear. A model with a hypothetical 2% higher score on an academic benchmark might not deliver proportionate improvements in your specific application (i.e., 2% more intelligence does not equal 2% more valuable for you). This isn't because benchmarks lack value—I believe they're crucial for progress—but because real-world value emerges from a more complex combination of factors. Generally, these factors include:
Foundational capabilities (what benchmarks measure)
Domain-specific knowledge (your industry, company, or field)
Integration quality (how well the AI works within your workflows and systems)
Task alignment (how appropriate the AI is for your particular needs)
In some industries, like software development, we see meaningful returns on AI. In other sectors however, progress lags. Each model advancement opens new, previously impossible possibilities; capturing that value however, requires thoughtfulness in adapting general intelligence tools for your specific benefit.
“Stuck In Demo”
Many business, especially startups, get stuck in neutral. That is to say, they never get off the ground. AI-first companies and products are not immune to this. A new phenomena I am calling “stuck in demo” is becoming common.
Public demonstrations showcase AI capabilities under favorable conditions. These demos serve an important purpose: they illustrate potential and inspire new applications. Currently, many AI apps are “stuck” in the demo phase, where videos with elegant voiceovers wow users only to never deliver any real value to any real user.
The journey from demonstration to production involves addressing additional considerations and careful tailoring of general intelligence to specific use cases. The following considerations keep AI apps stuck in demo:
A lack of consistent performance across diverse inputs (prompt engineering should not be the job of the end user)
Ineffectiveness in handling of messy, real-world data
Clunky integration with existing workflows and systems (lots of copy / pasting and extensive reformatting before any AI output is usable)
Limited management of edge cases
Irregularities in the probabilistic output of the selected AI model (OpenAI’s structured outputs helps with this but isn’t perfect)
Reflecting on history, the gap between demonstration and production is a natural part of technology adoption. Bridging from proof-of-concept to practical application requires recognizing nuance and extreme effort. Nuance is where you, as an AI user, can create the largest impact right now.
Developing Your Personal Benchmark(s)
Moving past public benchmarks requires work, but it isn’t hard. AI users and organizations looking to adopt AI should work to develop a customized benchmark suite focused specifically on the 3-5 workflows most important to them. The goal is not to fully test emerging AI capabilities, but rather identify advancements in AI systems that translate to increases in your personal or professional success. By concentrating on 3-5 key tasks, you can create a focused evaluation framework that measures what matters most to your operations. Below, I have created a rough outline to help you think about this process:
1. Identify your critical tasks
Begin by defining 3-5 specific tasks or workflows that:
Currently consume significant time or resources
Would deliver substantial value if improved
Represent different aspects of your operations
For example:
Summarizing specific documents you regularly process (customer support tickets, research papers, legal contracts)
Creating content that follows your brand voice and compliance requirements
Answering employee questions using your internal knowledge base
Analyzing data in new formats with contexts specific to your business
2. Create representative test examples
For each task, develop a realistic test set that reflects your actual work:
Include examples from your operations
Incorporate challenging cases that currently require significant effort (if the AI can successfully complete every task you ask of it, your test is too simple)
Use your domain-specific terminology and context
Balance routine scenarios with potential edge cases that test system limits
3. Define meaningful success metrics
While standard accuracy metrics provide value, consider additional dimensions that matter to your operations. Don’t be afraid to use anecdotal experiences to judge subjective results:
Time savings compared to current processes
Error rates on critical aspects specific to your needs
Consistency across different input types
Usability for your team members
Qualitative quality of response
4. Test in realistic environments
Evaluate systems in settings that approximate your workflow:
Under typical operational conditions
With input from actual end users (both AI-native and non-native employees)
Within your technology ecosystem
Against any specific quality standards you might have
5. Measure relative improvement
Assess each solution compared to:
Your current approach
Previous AI implementations you've tested
Ubiquitous consumer options (ChatGPT by OpenAI or Claude by Anthropic)
By developing this focused benchmark approach, you create a systematic method for evaluating AI progress in terms of your specific needs.
Conclusion
As AI continues to mature, the gap between standard benchmark performance and real-world utility is likely to widen. Additionally, as AI begins to tackle more subjective tasks, it will become increasingly hard to quantify gains. This means, models with the highest benchmark scores will not necessarily be the ones that will deliver the most value for your goals.
Predicting AI progress is daunting but the future of AI evaluation won’t be about abandoning benchmarks—it’ll be about creating ones that actually measure what matters to you. If standard benchmarks continue to become less relevant for predicting real-world performance, your personalized benchmark system will become increasingly valuable as a decision-making tool.
The people and organizations that thrive in this new era won't be those that reinvent their entire stack everytime a new AI is announced with the highest benchmark scores. Nor will it be people who never adopt the latest tech. Success will come to those that systematically identify and implement the specific AI capabilities that deliver measurable value for their unique needs. In my opinion, building personal benchmarks is no longer optional–it is essential for anyone seeking to maximize the value of AI in an increasingly noisy landscape.