MoLE - Modalix Language Model Evaluator

Overview

MoLE (Modalix Language Model Evaluator) is a benchmarking tool for evaluating the accuracy and performance of LLMs running on the Modalix platform.

It extends EleutherAI’s lm-evaluation-harness and supports two backends:

hf — runs evaluation on the host using HuggingFace transformers (baseline reference)
modalix — runs evaluation on a Modalix board via the llima benchmark-server

Installation

MoLE requires LLiMa to be installed on the Modalix device first. See Introduction to LLiMa for installation instructions.

Install MoLE on your Modalix device using the sima-cli:

host:~$ sima-cli install tools/mole

This installs MoLE into a virtual environment at ~/sima-mole-venv.

Usage

First, activate the MoLE virtual environment:

host:~$ source ~/sima-mole-venv/bin/activate

MoLE is then invoked via the llima-benchmark CLI with two subcommands. The <model_id> argument is always the HuggingFace model ID (e.g., meta-llama/Llama-3.2-3B-Instruct).

Accuracy Benchmarking

Evaluates model quality against standard tasks:

(sima-mole-venv) host:~$ llima-benchmark accuracy <model_id> -b modalix \
    -t <task> \
    -o <output_dir> \
    --max_num_tokens <max_num_tokens> \
    --board_ip <board_ip> \
    --board_model <model_path_on_board>

Argument	Description
`model_id`	HuggingFace model ID (e.g., `meta-llama/Llama-3.2-3B-Instruct`).
`-b`	Backend to use: `modalix` (run on board) or `hf` (run on host as reference baseline).
`-t`	Required. One or more evaluation tasks. Example tasks: `hellaswag`, `triviaqa`, `piqa`, `winogrande`, `wikitext`. See the task list for all available tasks.
`-o`	Output directory for benchmark results.
`--board_ip`	IP address of the Modalix board. Required for `-b modalix`.
`--board_model`	Path to the compiled model directory on the Modalix device (e.g., `/media/nvme/llima/models/Llama-3.2-3B-Instruct-a16w4`). Required for `-b modalix`.
`--max_num_tokens`	Maximum context length. Must be equal to or smaller than the value used during compilation.
`-n, --num_samples`	Number of samples to evaluate. Runs the full task set if not specified.
`--board_ssh_user`	SSH username for the Modalix board. Optional, default: `sima`. #
`--board_ssh_pass`	SSH password for the Modalix board. Optional. Set to enable non-interactive automated benchmarking.

Important

Accuracy benchmarking with -b modalix requires the model to be compiled with the --return_logits flag (see GenAI Model Compilation). If the model was compiled without this flag, benchmarking will fail at runtime.

To use the HuggingFace backend as a reference baseline:

(sima-mole-venv) host:~$ llima-benchmark accuracy <model_id> -b hf -t <task> -o <output_dir>

For all available options, run llima-benchmark accuracy -h.

Performance Benchmarking

Measures Time To First Token (TTFT) and Tokens Per Second (TPS) on a Modalix board for different input lengths:

(sima-mole-venv) host:~$ llima-benchmark perf <model_id> \
    -o <output_dir> \
    --board_ip <board_ip> \
    --board_model <model_path_on_board> \
    --max_num_tokens <max_num_tokens> --max_new_tokens <max_new_tokens>

Argument	Description
`model_id`	HuggingFace model ID (e.g., `meta-llama/Llama-3.2-3B-Instruct`).
`-o`	Output directory for benchmark results.
`--board_ip`	IP address of the Modalix board.
`--board_model`	Path to the compiled model directory on the Modalix device (e.g., `/media/nvme/llima/models/Llama-3.2-3B-Instruct-a16w4`).
`--max_num_tokens`	Maximum context length. Must be equal to or smaller than the value used during compilation.
`--max_new_tokens`	Maximum number of tokens to generate in the output.
`--board_ssh_user`	SSH username for the Modalix board. Optional, default: `sima`.
`--board_ssh_pass`	SSH password for the Modalix board. Optional. Set to enable non-interactive automated benchmarking.

For all available options, run llima-benchmark perf -h.