Benchmarking latency and quality

Benchmarking is an essential part of the model development process, allowing researchers to evaluate the performance and quality of their models under various conditions. In this section, we will cover how to benchmark both the latency and quality of elastic models.

To run benchmarks you need to clone benchmark scripts from the TheStageAI/ElasticModels:

git clone https://github.com/TheStageAI/ElasticModels.git
cd ElasticModels/benchmark
pip install -r requirements.txt

Note

The benchmark scripts are improving and new features are added regularly. Currently quality benchmarks are available for text-to-text models. latency benchmarks are available for all models.

Language models

Language models supports latency benchmarking and quality benchmarking. For latency benchmarks we provide tokens per second and time to first token.

Benchmarking latency

For latency benchmarking you can configure:

Parameters for LLMs latency benchmarking

  • model_name: name of the model to benchmark, e.g. mistralai/Mistral-7B-Instruct-v0.3

  • input_context: size of the input context. You can configure

    • small: 100 tokens

    • medium 1000 tokens

    • large: 4000 tokens

  • batch_size: batch size for the latency benchmark

  • mode: mode of the model, e.g. S, M, L, XL, original

  • hf_token: Hugging Face token for accessing the model

  • cache_dir: Hugging Face cache directory for storing the model weights

Returns metrics

  • tps: throughput in tokens per second

  • ttft: time to first token in seconds

  • max_memory_usage: maximum memory usage in MB

Example to run benchmark on mistral:

python benchmark_llm.py \
    --model_name mistralai/Mistral-7B-Instruct-v0.3 \
    --input_context small \
    --batch_size 1 \
    # Supported: S, M, L, XL, original
    --mode S \
    --hf_token <your_hf_token> \
    --cache_dir <your_hf_cache_dir>

Output on the L40s GPU:

2025-08-06 17:27:47: [ElasticModels Benchmark]: INFO: Loading model mistralai/Mistral-7B-Instruct-v0.3 in S mode.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 3/3 [00:00<00:00, 62.85it/s]
Fetching elastic checkpoint: 100%|██████████████████████████████████████████████████| 66/66 [00:00<00:00, 33895.44it/s]
Loading elastic checkpoint: 100%|███████████████████████████████████████████████████| 64/64 [00:15<00:00,  4.01it/s]
2025-08-06 17:28:05: [ElasticModels Benchmark]: INFO: Model mistralai/Mistral-7B-Instruct-v0.3 in S mode loaded successfully.
2025-08-06 17:28:05: [ElasticModels Benchmark]: INFO: Starting latency benchmark for mistralai/Mistral-7B-Instruct-v0.3 in S mode
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: latency benchmark for mistralai/Mistral-7B-Instruct-v0.3 in S are ready:
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: tps: 76.93793246821232
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: ttft: 0.03589480509981513
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: batch_size: 1
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: input_tokens: 102
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: max_memory_usage: 15825.75
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: latency benchmarking completed.

Benchmarking quality

Computed quality benchmarks for each model are provided in the HF model card: Transformers collection. To run benchmaking of the LLM models using the Hugging Face lm-eval library, you can use the script benchmark/benchmark_llm.py:

python benchmark_llm.py \
    --model_name mistralai/Mistral-7B-Instruct-v0.3 \
    # List tasks from lm_eval library
    --bench_tasks 'mmlu' 'arc_challenge' 'piqa' \
    # Supported: S, M, L, XL, original
    --mode S \
    --hf_token <your_hf_token> \

Diffusion models

To run latency benchmarking of the diffusion models:

python benchmark_diffusers.py \
    --model_name black-forest-labs/FLUX.1-schnell \
    --device cuda:0 \
    --mode S \
    --hf_token <your_hf_token> \
    --cache_dir <your_hf_cache_dir>

Output on L40s GPU:

19:49:47: [ElasticModels Benchmark]: INFO: Model black-forest-labs/FLUX.1-schnell in S mode loaded successfully.
19:49:47: [ElasticModels Benchmark]: INFO: Starting latency benchmark for black-forest-labs/FLUX.1-schnell in S mode
19:49:56: [ElasticModels Benchmark]: INFO: Latency benchmark for black-forest-labs/FLUX.1-schnell in S mode are ready:
19:49:56: [ElasticModels Benchmark]: INFO: time: 1.4327480830810964
19:49:56: [ElasticModels Benchmark]: INFO: batch_size: 1
19:49:56: [ElasticModels Benchmark]: INFO: width: 1024
19:49:56: [ElasticModels Benchmark]: INFO: height: 1024
19:49:56: [ElasticModels Benchmark]: INFO: max_memory_usage: 29407.75
19:49:56: [ElasticModels Benchmark]: INFO: Latency benchmarking completed.

To give example of the results we are providing the table below with the results of the benchmarking of FLUX.1-Schnell model on different GPUs. For instance, on H100 GPU FLux S model more than 2x faster than original model, M model is 1.6x faster, L model is 1.4x faster and XL model is 1.3x faster:

GPU/Model

S

M

L

XL

Original, BF16

H100

0.5

0.57

0.65

0.7

1.04

L40s

1.4

1.6

1.9

2.1

2.5

RTX 5090

0.94

Whisper model

To run latency benchmarking of the Whisper model:

python benchmark_whisper.py \
    --model_name openai/whisper-large-v3 \
    # Supported: S, M, L, XL, original
    --mode S \
    --batch_size 1 \
    --chunk_length 30 \
    --hf_token <your_hf_token> \
    --cache_dir <your_hf_cache_dir>

Output on L40S GPU:

19:27:05: [ElasticModels Benchmark]: INFO: Loading model openai/whisper-large-v3 in S mode.
19:27:20: [ElasticModels Benchmark]: INFO: Model openai/whisper-large-v3 in S mode loaded successfully.
19:27:28: [ElasticModels Benchmark]: INFO: Latency benchmark for openai/whisper-large-v3 in S are ready:
19:27:28: [ElasticModels Benchmark]: INFO: tps: 199.6423047802856
19:27:28: [ElasticModels Benchmark]: INFO: ttft: 0.08691206807270646
19:27:28: [ElasticModels Benchmark]: INFO: batch_size: 1
19:27:28: [ElasticModels Benchmark]: INFO: max_memory_usage: 4205.75