Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers

    OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers

    April 9, 2025

    In a significant move to empower developers and teams working with large language models (LLMs), OpenAI has introduced the Evals API, a new toolset that brings programmatic evaluation capabilities to the forefront. While evaluations were previously accessible via the OpenAI dashboard, the new API allows developers to define tests, automate evaluation runs, and iterate on prompts directly from their workflows.

    Why the Evals API Matters

    Evaluating LLM performance has often been a manual, time-consuming process, especially for teams scaling applications across diverse domains. With the Evals API, OpenAI provides a systematic approach to:

    • Assess model performance on custom test cases
    • Measure improvements across prompt iterations
    • Automate quality assurance in development pipelines

    Now, every developer can treat evaluation as a first-class citizen in the development cycle—similar to how unit tests are treated in traditional software engineering.

    Core Features of the Evals API

    1. Custom Eval Definitions: Developers can write their own evaluation logic by extending base classes.
    2. Test Data Integration: Seamlessly integrate evaluation datasets to test specific scenarios.
    3. Parameter Configuration: Configure model, temperature, max tokens, and other generation parameters.
    4. Automated Runs: Trigger evaluations via code, and retrieve results programmatically.

    The Evals API supports a YAML-based configuration structure, allowing for both flexibility and reusability.

    Getting Started with the Evals API

    To use the Evals API, you first install the OpenAI Python package:

    Copy CodeCopiedUse a different Browser
    pip install openai

    Then, you can run an evaluation using a built-in eval, such as factuality_qna

    Copy CodeCopiedUse a different Browser
    oai evals registry:evaluation:factuality_qna 
      --completion_fns gpt-4 
      --record_path eval_results.jsonl

    Or define a custom eval in Python:

    Copy CodeCopiedUse a different Browser
    import openai.evals
    
    class MyRegressionEval(openai.evals.Eval):
        def run(self):
            for example in self.get_examples():
                result = self.completion_fn(example['input'])
                score = self.compute_score(result, example['ideal'])
                yield self.make_result(result=result, score=score)

    This example shows how you can define a custom evaluation logic—in this case, measuring regression accuracy.

    Use Case: Regression Evaluation

    OpenAI’s cookbook example walks through building a regression evaluator using the API. Here’s a simplified version:

    Copy CodeCopiedUse a different Browser
    from sklearn.metrics import mean_squared_error
    
    class RegressionEval(openai.evals.Eval):
        def run(self):
            predictions, labels = [], []
            for example in self.get_examples():
                response = self.completion_fn(example['input'])
                predictions.append(float(response.strip()))
                labels.append(example['ideal'])
            mse = mean_squared_error(labels, predictions)
            yield self.make_result(result={"mse": mse}, score=-mse)

    This allows developers to benchmark numerical predictions from models and track changes over time.

    Seamless Workflow Integration

    Whether you’re building a chatbot, summarization engine, or classification system, evaluations can now be triggered as part of your CI/CD pipeline. This ensures that every prompt or model update maintains or improves performance before going live.

    Copy CodeCopiedUse a different Browser
    openai.evals.run(
      eval_name="my_eval",
      completion_fn="gpt-4",
      eval_config={"path": "eval_config.yaml"}
    )

    Conclusion

    The launch of the Evals API marks a shift toward robust, automated evaluation standards in LLM development. By offering the ability to configure, run, and analyze evaluations programmatically, OpenAI is enabling teams to build with confidence and continuously improve the quality of their AI applications.

    To explore further, check out the official OpenAI Evals documentation and the cookbook examples.

    The post OpenAI Introduces the Evals API: Streamlined Model Evaluation for Developers appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMulti-LLM routing strategies for generative AI applications on AWS
    Next Article Salesforce AI Released APIGen-MT and xLAM-2-fc-r Model Series: Advancing Multi-Turn Agent Training with Verified Data Pipelines and Scalable LLM Architectures

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Apple is making a cheap MacBook — that’s bad news for Windows

    News & Updates

    Hobbit-inspired sword can help you find unsecured WiFi hotspots

    Development

    CVE-2025-29529 – ITC Systems Multiplan/Matrix OneCard SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    Accelerate foundation model training and inference with Amazon SageMaker HyperPod and Amazon SageMaker Studio

    Machine Learning

    Highlights

    News & Updates

    Is saying “please and thank you” to ChatGPT worth it? — CEO jokes it spends “tens of millions of dollars” on polite prompts

    April 17, 2025

    OpenAI CEO Sam Altman jokingly hints that ChatGPT needs “tens of millions of dollars” to…

    PC Gaming Show returns June 8 — here’s when and how to watch the show

    June 7, 2025

    A Coding Implementation to Build an Advanced Web Intelligence Agent with Tavily and Gemini AI

    June 4, 2025

    Best Artificial Intelligence Software

    May 14, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.