Benchmarks and Evaluations¶

In the context of software engineering, the purpose of a benchmark is to associate some performance metric (accuracy, latency, memory usage, etc.) with a specific piece of code or a system. This metric is then usually used to compare different implementations of the same functionality in the hopes to improve the metric over time. This is especially important in the context of machine learning systems when measuring functional capabilities on specific tasks.

To accomplish measuring machine learning systems capabilities, in the context of the nearai project, we provide a benchmark tool to compare different agents and solvers on sets of reference evaluations. This tool enables you to do measurement and compare your systems on various sets of reference evaluations, ex: mpbb. The core metric for benchmarks like these is "percent true" or "accuracy".

How is a benchmark implemented?¶

In the nearai project, a benchmark is the combination of a dataset and a solver.

Adding a benchmark dataset¶

nearai leverages huggingface datasets as the primitive when operating with datasets + benchmarks (see load_dataset). This means that to add a new benchmark, you need to create a new dataset and register it with the nearai registry (we will go over this in Implementing the "3 digit addition" benchmark).

There is also a support for datasets of custom format.

Adding a solver¶

To implement a solver, you will need to implement the SolverStrategy interface under the nearai.solvers module. The most important method the solver should implement is the solve method. This method should take a datum, run your implementation specific agentic strategy / strategies, and return a result.

Implementing the "3 digit addition" benchmark¶

In this section we will be implementing a benchmark we'll call "3 digit addition". The goal of the benchmark is to test an agents ability to add two 1-3 digit numbers together. The dataset will consist of 1000 examples of 3 digit addition problems and their solutions. The solver will adjudicate the agent answers and return a single accuracy score. While this benchmark is simple and can be solved with a simple program, it serves as a good example of how to implement a benchmark in nearai.

Step 1: Creating the dataset¶

To create this dataset, we will first synthetically generate the data. We will then register the dataset with the nearai registry.

import random
from itertools import product

import datasets

SAMPLE_SIZE = 1000
SEED = 42
PATH = "3_digit_addition"

random.seed(SEED)
datasets.Dataset.from_generator(
    lambda: iter(
        {
            "input": f"{a} + {b}",
            "output": str(a + b)
        }
        for a, b in random.sample(list(product(range(1000), range(1000))), SAMPLE_SIZE)
    ),
    features=datasets.Features(
        {
            "input": datasets.Value("string"),
            "output": datasets.Value("string")
        }
    )
).save_to_disk(PATH)

Now to upload the dataset to the registry we'll run the command:

nearai registry upload ./3_digit_addition

Step 2: Creating the solver¶

To create the solver, we will implement the SolverStrategy interface. The solver will take in a datum, parse the input, execute any setup for the agent, run the agent, and return the correctness of the agents result.

Remember

To ensure this solver is registered with nearai:

Write this implementation in the nearai.solvers module.
Import it in the __init__.py file in the nearai.solvers module.

# ... other imports ...
from pydantic import BaseModel
from huggingface import Dataset
from nearai.solvers import SolverStrategy

from typing import Dict, List

class ThreeDigitAdditionDatum(BaseModel):
    input: str
    output: str

class ThreeDigitAdditionSolver(SolverStrategy):
    """Solver for the 3 digit addition benchmark."""

    def __init__(self, dataset_ref: Dataset, model: str = "", agent: str = ""):
        super().__init__(model, agent)
        self.dataset_ref = dataset_ref

    def evaluation_name(self) -> str:
        return "3_digit_addition"

    def compatible_datasets(self) -> List[str]:
        return ["3_digit_addition"]

    def solve(self, datum: Dict[str, str]) -> bool:
        datum = ThreeDigitAdditionDatum(**datum)
        label = datum.input.replace(" + ", "+")
        session = self.start_inference_session(label)

        goal = f"""Please add the following numbers together: {datum.input}\n\nOutput the result only."""
        result = session.run_task(goal).strip()
        return result == datum.output

The code above can run for both models and agents. If both model and agent are given, the model value will be inserted into agent metadata.

To check agent functionality to write files:

    def solve(self, datum: Dict[str, str]) -> bool:
        datum = ThreeDigitAdditionDatum(**datum)
        label = datum.input.replace(" + ", "+")
        session = self.start_inference_session(label)

        goal = f"""Please add the following numbers together: {datum.input}\n\nOutput the result in a file called 'result.txt'."""
        session.run_task(goal)
        with open(os.path.join(session.path, "result.txt"), "r") as f:
            result = f.read().strip()
        return result == datum.output

Step 3: Running the benchmark¶

To run the benchmark, we will use the nearai CLI. We will specify the dataset and solver we want to use.

nearai benchmark run near.ai/3_digit_addition/1.00 ThreeDigitAdditionSolver --agent ~/.nearai/registry/<my_agent>

Benchmarks Cache¶

Benchmark individual tasks and completion are cached in registry or locally. To see registry benchmark completion caches:

nearai benchmark list

To force execution and overwrite cache pass --force flag.

nearai benchmark run near.ai/mbpp/1.0.0 MBPPSolverStrategy --model 'llama-3p2-1b-instruct' --subset test --force

Example runs¶

$ nearai benchmark run near.ai/mbpp/1.0.0 MBPPSolverStrategy --model 'llama-3p2-1b-instruct' --subset test
$ nearai benchmark run near.ai/mmlu/1.0.0 MMLUSolverStrategy --model 'llama-v3p1-405b-instruct' --subset test
$ nearai benchmark run near.ai/mbpp/1.0.0 MBPPSolverStrategy --model 'qwen2p5-72b-instruct' --subset test --agent ~/.nearai/registry/flatirons.near/example-travel-agent/1
$ nearai benchmark run near.ai/live_bench/1.0.0 LiveBenchSolverStrategy --model 'qwen2p5-72b-instruct' --agent ~/.nearai/registry/flatirons.near/example-travel-agent/1

Evaluations¶

Recording benchmark result as an evaluation¶

To record benchmark results as an evaluation, pass --record. It is strongly recommended to pass this flag after verifying successful run of the benchmark.

$ nearai benchmark run near.ai/mbpp/1.0.0 MBPPSolverStrategy --model 'llama-3p2-1b-instruct' --subset test
Final score: 131/500 - 26.20%
$ nearai benchmark run near.ai/mbpp/1.0.0 MBPPSolverStrategy --model 'llama-3p2-1b-instruct' --subset test --record

That creates new evaluation entry in the registry:

$ nearai registry list --category=evaluation
┌────────────────────────────────────────────────────────────────────────┬────────────┬───────────────┬────────┐
│ entry                                                                  │ category   │ description   │ tags   │
├────────────────────────────────────────────────────────────────────────┼────────────┼───────────────┼────────┤
│ alomonos.near/evaluation_mbpp_model_llama-v3p2-1b-                     │ evaluation │               │        │
│ instruct_provider_fireworks/0.0.1                                      │            │               │        │

View evaluation table¶

To view evaluation table in CLI:

$ nearai evaluation table --num_columns=8
$ nearai evaluation table --all_key_columns --num_columns=8
$ nearai evaluation table --all_metrics

https://app.near.ai/evaluations has a functionality to choose any columns.