Benchmarking latency and quality¶
Benchmarking is an essential part of the model development process, allowing researchers to evaluate the performance and quality of their models under various conditions. In this section, we will cover how to benchmark both the latency and quality of elastic models.
To run benchmarks you need to clone benchamrk scripts from the TheStageAI/ElasticModels:
git clone https://github.com/TheStageAI/ElasticModels.git
cd ElasticModels/benchmark
pip install -r requirements.txt
Note
The benchmark scripts are improving and new features are added regularly. Currently quality benchmarks are available for text-to-text models. latency benchmarks are available for all models.
Language models¶
Language models supports latency benchmarking and quality benchmarking. For latency benchmarks we provide tokens per second and time to first token.
Benchmarking latency¶
For latency benchamrking you can configure:
Parameters for LLMs latency benchmarking
model_name: name of the model to benchmark, e.g. mistralai/Mistral-7B-Instruct-v0.3
input_context: size of the input context. You can configure
small: 100 tokens
medium 1000 tokens
large: 4000 tokens
batch_size: batch size for the latency benchmark
mode: mode of the model, e.g. S, M, L, XL, original
hf_token: Hugging Face token for accessing the model
cache_dir: Hugging Face cache directory for storing the model weights
Returns metrics
tps: throughput in tokens per second
ttft: time to first token in seconds
max_memory_usage: maximum memory usage in MB
Example to run benchamrk on mistral:
python benchmark_llm.py \
--model_name mistralai/Mistral-7B-Instruct-v0.3 \
--input_context small \
--batch_size 1 \
# Supported: S, M, L, XL, original
--mode S \
--hf_token <your_hf_token> \
--cache_dir <your_hf_cache_dir>
Output on the L40s GPU:
2025-08-06 17:27:47: [ElasticModels Benchmark]: INFO: Loading model mistralai/Mistral-7B-Instruct-v0.3 in S mode.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 3/3 [00:00<00:00, 62.85it/s]
Fetching elastic checkpoint: 100%|██████████████████████████████████████████████████| 66/66 [00:00<00:00, 33895.44it/s]
Loading elastic checkpoint: 100%|███████████████████████████████████████████████████| 64/64 [00:15<00:00, 4.01it/s]
2025-08-06 17:28:05: [ElasticModels Benchmark]: INFO: Model mistralai/Mistral-7B-Instruct-v0.3 in S mode loaded successfully.
2025-08-06 17:28:05: [ElasticModels Benchmark]: INFO: Starting latency benchmark for mistralai/Mistral-7B-Instruct-v0.3 in S mode
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: latency benchmark for mistralai/Mistral-7B-Instruct-v0.3 in S are ready:
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: tps: 76.93793246821232
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: ttft: 0.03589480509981513
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: batch_size: 1
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: input_tokens: 102
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: max_memory_usage: 15825.75
2025-08-06 17:28:15: [ElasticModels Benchmark]: INFO: latency benchmarking completed.
Benchmarking quaility¶
Computed quality benchmarks for each model are provided in the HF model card: Transformers collection.
To run benchmaking of the LLM models using the Hugging Face lm-eval
library, you can use the script benchmark/benchmark_llm.py
:
python benchmark_llm.py \
--model_name mistralai/Mistral-7B-Instruct-v0.3 \
# List tasks from lm_eval library
--bench_tasks 'mmlu' 'arc_challenge' 'piqa' \
# Supported: S, M, L, XL, original
--mode S \
--hf_token <your_hf_token> \
Diffusion models¶
To run lantency benchmaking of the diffusion models:
python benchmark_diffusers.py \
--model_name black-forest-labs/FLUX.1-schnell \
--device cuda:0 \
--mode S \
--hf_token <your_hf_token> \
--cache_dir <your_hf_cache_dir>
Output on L40s GPU:
19:49:47: [ElasticModels Benchmark]: INFO: Model black-forest-labs/FLUX.1-schnell in S mode loaded successfully.
19:49:47: [ElasticModels Benchmark]: INFO: Starting latency benchmark for black-forest-labs/FLUX.1-schnell in S mode
19:49:56: [ElasticModels Benchmark]: INFO: Latency benchmark for black-forest-labs/FLUX.1-schnell in S mode are ready:
19:49:56: [ElasticModels Benchmark]: INFO: time: 1.4327480830810964
19:49:56: [ElasticModels Benchmark]: INFO: batch_size: 1
19:49:56: [ElasticModels Benchmark]: INFO: width: 1024
19:49:56: [ElasticModels Benchmark]: INFO: height: 1024
19:49:56: [ElasticModels Benchmark]: INFO: max_memory_usage: 29407.75
19:49:56: [ElasticModels Benchmark]: INFO: Latency benchmarking completed.
To give example of the results we are providing the table below with the results of the benchmarking of FLUX.1-Schnell model on different GPUs. For instance, on H100 GPU FLux S model more than 2x faster than original model, M model is 1.6x faster, L model is 1.4x faster and XL model is 1.3x faster:
GPU/Model |
S |
M |
L |
XL |
Original, BF16 |
---|---|---|---|---|---|
H100 |
0.5 |
0.57 |
0.65 |
0.7 |
1.04 |
L40s |
1.4 |
1.6 |
1.9 |
2.1 |
2.5 |
RTX 5090 |
0.94 |
Whisper model¶
To run lantency benchmaking of the Whisper model:
python benchmark_whisper.py \
--model_name openai/whisper-large-v3 \
# Supported: S, original
--mode S \
--batch_size 1 \
--hf_token <your_hf_token> \
--cache_dir <your_hf_cache_dir>
Output on L40S GPU:
19:27:05: [ElasticModels Benchmark]: INFO: Loading model openai/whisper-large-v3 in S mode.
19:27:20: [ElasticModels Benchmark]: INFO: Model openai/whisper-large-v3 in S mode loaded successfully.
19:27:28: [ElasticModels Benchmark]: INFO: Latency benchmark for openai/whisper-large-v3 in S are ready:
19:27:28: [ElasticModels Benchmark]: INFO: tps: 199.6423047802856
19:27:28: [ElasticModels Benchmark]: INFO: ttft: 0.08691206807270646
19:27:28: [ElasticModels Benchmark]: INFO: batch_size: 1
19:27:28: [ElasticModels Benchmark]: INFO: max_memory_usage: 4205.75
Musicgen model¶
To run lantency benchmaking of the Musicgen model:
python benchmark_musicgen.py \
--model_name facebook/musicgen-large \
# Supported: S, M, L, XL, original
--mode S \
--batch_size 1 \
--hf_token <your_hf_token> \
--cache_dir <your_hf_cache_dir>
Output on L40S GPU:
19:30:39: [ElasticModels Benchmark]: INFO: Loading model facebook/musicgen-large in S mode.
19:31:41: [ElasticModels Benchmark]: INFO: Model facebook/musicgen-large in S mode loaded successfully.
19:31:41: [ElasticModels Benchmark]: INFO: Starting latency benchmark for facebook/musicgen-large in S mode
19:31:58: [ElasticModels Benchmark]: INFO: Latency benchmark for facebook/musicgen-large in S are ready:
19:31:58: [ElasticModels Benchmark]: INFO: tps: 98.85491494717672
19:31:58: [ElasticModels Benchmark]: INFO: ttft: 0.0876833382062614
19:31:58: [ElasticModels Benchmark]: INFO: batch_size: 1
19:31:58: [ElasticModels Benchmark]: INFO: max_memory_usage: 6639.75
19:31:58: [ElasticModels Benchmark]: INFO: Latency benchmarking completed.