Get started with TheStage Elastic Transformers¶

Overview
- Installation
- Benchmarking quality and latency
Running inference

Overview ¶

Elastic models are models optimized by TheStage AI framework Qlip for Nvidia GPUs. Each model has 4 performance tiers:

XL: Mathematically equivalent neural network, optimized with our DNN compiler.
L: Near lossless model, with less than 0.5% degradation obtained on corresponding benchmarks.
M: Faster model, defined as averaged performance between L and S models.
S: The fastest model, with accuracy degradation less than ~2%.

Optimized models (L, M, S) can use different otpimizations as: int8/fp8/fp4 quantization, sparsification, pruning.

In this tutorial we will cover:

Install. Installation of Elastic Models: optimized models with HF interface for self-serving.
Performance tiers. Load different performance tiers from fastest to slowest: S, L, M, XL.
Benchamrking latency and quality. Using benchmarking scripts.
Single prompt inference. Getting started with model local inference.
Batch inferfence. Combining several prompts together.
Recognizing ASCII images. Funny experiment on pattern recognition with LLMs.

Before start be sure that you have:

OS: Linux
GPU: Nvidia H100 or L40s
Python: 3.10-3.12
CPU: x86-64

Installation ¶

!pip install thestage
!pip install thestage_elastic_models[nvidia] --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple

!pip install flash_attn==2.7.3 --no-build-isolation

Generate your API token on the platform and setup through CLI:

!thestage config set --api-token <YOUR_API_TOKEN>

Set your Hugging Face token:

hf_token = <YOUR_HF_TOKEN>

Check installation:

import torch
import elastic_models

elastic_models.print_available_models()

----------------------------------------------------------------------------------------------------------------------------------
Model                                              | B200        | GeForce-RTX-4090 | GeForce-RTX-5090 | H100        | L40S
----------------------------------------------------------------------------------------------------------------------------------
DavidAU/MN-GRAND-Gutenberg-Lyra4-Lyra-12B-DARKNESS |             | S, M             | S, M, L          |             |
Qwen/Qwen2.5-14B-Instruct                          |             |                  |                  | S, M, L, XL | S, M, L, XL
Qwen/Qwen2.5-7B-Instruct                           |             |                  |                  | S, M, L, XL | S, M, L, XL
black-forest-labs/FLUX.1-dev                       | S, M, L, XL |                  | S                | S, M, L, XL | S, M, L, XL
black-forest-labs/FLUX.1-schnell                   | S, M, L, XL |                  | S                | S, M, L, XL | S, M, L, XL
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B           |             |                  |                  | S, M, L, XL | S, M, L, XL
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B            |             |                  |                  | S, M, L, XL | S, M, L, XL
facebook/musicgen-large                            |             |                  |                  | S, M, L, XL | S, M, L, XL
genmo/mochi-1-preview                              | S, XL       |                  |                  | S, XL       |
meta-llama/Llama-3.1-8B-Instruct                   |             |                  |                  | S, M, L, XL | S, M, L, XL
meta-llama/Llama-3.2-1B-Instruct                   |             |                  |                  | S, M, L, XL | S, M, L, XL
mistralai/Mistral-7B-Instruct-v0.3                 |             |                  |                  | S, M, L, XL | S, M, L, XL
mistralai/Mistral-Nemo-Instruct-2407               |             |                  |                  | S, M, L, XL | S, M, L, XL
mistralai/Mistral-Small-3.1-24B-Instruct-2503      |             |                  |                  | S, M, L, XL | S, M, L
openai/whisper-large-v3                            |             |                  |                  | S           | S
stabilityai/stable-diffusion-xl-base-1.0           |             |                  |                  | S, XL       | XL
----------------------------------------------------------------------------------------------------------------------------------

Benchmarking quality and latency ¶

For benchmarks we have created straightforward python scripts. We just need to clone Github repo:

!git clone https://github.com/TheStageAI/ElasticModels.git
!pip install -r ElasticModels/benchmark/requirements.txt

Now we can benchmark speed of models. For each model we can select several modes: 'S', 'M', 'L', 'XL', 'original'.

Running latency benchmarks for S model:

!python ElasticModels/benchmark/benchmark_llm.py \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --input_context small \
    --batch_size 1 \
    --mode S \
    --hf_token <YOUR_HF_TOKEN>

32:54: [ElasticModels Benchmark]: INFO: Loading model meta-llama/Llama-3.1-8B-Instruct in S mode.
Loading checkpoint shards: 100%|█████████████████| 4/4 [00:00<00:00, 130.31it/s]
Fetching elastic checkpoint: 100%|███████████| 66/66 [00:00<00:00, 28328.29it/s]
Loading elastic checkpoint: 100%|███████████████| 64/64 [00:13<00:00,  4.72it/s]
33:10: [ElasticModels Benchmark]: INFO: Model meta-llama/Llama-3.1-8B-Instruct in S mode loaded successfully.
33:10: [ElasticModels Benchmark]: INFO: Starting latency benchmark for meta-llama/Llama-3.1-8B-Instruct in S mode
33:16: [ElasticModels Benchmark]: INFO: Latency benchmark for meta-llama/Llama-3.1-8B-Instruct in S are ready:
33:16: [ElasticModels Benchmark]: INFO: tps: 187.84188146130018
33:16: [ElasticModels Benchmark]: INFO: ttft: 0.027731652022339404
33:16: [ElasticModels Benchmark]: INFO: batch_size: 1
33:16: [ElasticModels Benchmark]: INFO: input_tokens: 93
33:16: [ElasticModels Benchmark]: INFO: max_memory_usage: 17555.25
33:16: [ElasticModels Benchmark]: INFO: Latency benchmarking completed.

Running latency benchmarks for original model:

!python ElasticModels/benchmark/benchmark_llm.py \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --input_context small \
    --batch_size 1 \
    --mode original \
    --hf_token <YOUR_HF_TOKEN>

35:46: [ElasticModels Benchmark]: INFO: Loading model meta-llama/Llama-3.1-8B-Instruct in original mode.
Loading checkpoint shards: 100%|█████████████████| 4/4 [00:00<00:00, 135.62it/s]
35:50: [ElasticModels Benchmark]: INFO: Model meta-llama/Llama-3.1-8B-Instruct in original mode loaded successfully.
35:50: [ElasticModels Benchmark]: INFO: Starting latency benchmark for meta-llama/Llama-3.1-8B-Instruct in original mode
35:59: [ElasticModels Benchmark]: INFO: Latency benchmark for meta-llama/Llama-3.1-8B-Instruct in original are ready:
35:59: [ElasticModels Benchmark]: INFO: tps: 59.091880279143794
35:59: [ElasticModels Benchmark]: INFO: ttft: 0.019373000948689878
35:59: [ElasticModels Benchmark]: INFO: batch_size: 1
35:59: [ElasticModels Benchmark]: INFO: input_tokens: 93
35:59: [ElasticModels Benchmark]: INFO: max_memory_usage: 16755.25
35:59: [ElasticModels Benchmark]: INFO: Latency benchmarking completed.

Running compiled model in bfloat16 without quantization, i.e 'XL':

!python ElasticModels/benchmark/benchmark_llm.py \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --input_context small \
    --batch_size 1 \
    --mode XL \
    --hf_token <YOUR_HF_TOKEN>

39:47: [ElasticModels Benchmark]: INFO: Loading model meta-llama/Llama-3.1-8B-Instruct in XL mode.
Loading checkpoint shards: 100%|█████████████████| 4/4 [00:00<00:00, 127.99it/s]
Fetching elastic checkpoint: 100%|███████████| 66/66 [00:00<00:00, 25186.43it/s]
Loading elastic checkpoint: 100%|███████████████| 64/64 [00:36<00:00,  1.74it/s]
40:27: [ElasticModels Benchmark]: INFO: Model meta-llama/Llama-3.1-8B-Instruct in XL mode loaded successfully.
40:27: [ElasticModels Benchmark]: INFO: Starting latency benchmark for meta-llama/Llama-3.1-8B-Instruct in XL mode
40:33: [ElasticModels Benchmark]: INFO: Latency benchmark for meta-llama/Llama-3.1-8B-Instruct in XL are ready:
40:33: [ElasticModels Benchmark]: INFO: tps: 131.46156827035406
40:33: [ElasticModels Benchmark]: INFO: ttft: 0.02746944793034345
40:33: [ElasticModels Benchmark]: INFO: batch_size: 1
40:33: [ElasticModels Benchmark]: INFO: input_tokens: 93
40:33: [ElasticModels Benchmark]: INFO: max_memory_usage: 31089.25
40:33: [ElasticModels Benchmark]: INFO: Latency benchmarking completed.

For quality benchmarking we are using lm_eval library. To run quality benchmakrs you can specify task arg for the same script:

!python ElasticModels/benchmark/benchmark_llm.py \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --bench_tasks 'mmlu' \
    --mode S \
    --batch_size 32 \
    --hf_token <YOUR_HF_TOKEN>

For all models you should get following results:

Metric/Model	S	M	L	XL	Original
MMLU	65.8	66.8	67.5	68.2	68.2
PIQA	77.6	79.3	79.8	79.8	79.8
Arc Challenge	50.7	50.3	52.3	51.7	51.7
Winogrande	72.5	72	73.3	73.9	73.9

Running inference ¶

Models follows the same trasnformers interface, so nothing special is required for model intialization.

import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM

def create_elastic_model(model_name, mode='S'):
    """
    """
    device = torch.device("cuda")
    tokenizer = AutoTokenizer.from_pretrained(
        model_name, token=hf_token
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        token=hf_token,
        torch_dtype=torch.bfloat16,
        attn_implementation="sdpa",
        mode=mode
    ).to(device)

    return model, tokenizer

model_name_qwen = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
model, tokenizer = create_elastic_model(model_name_qwen, mode='S')

Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 56.06it/s]
Fetching elastic checkpoint: 100%|██████████| 98/98 [00:00<00:00, 342249.62it/s]
Loading elastic checkpoint: 100%|██████████| 96/96 [00:26<00:00,  3.67it/s]

Recognizing ASCII images. Funny test ¶

We will implement simple function to recognize ASCII images. There we will set a role for our LLM as an image recognizer.

def ask_about_ascii_image(model, tokenizer, image_prompt, device):
    """
    model: HF transformers model.
    image_prompt: str, ASCII image.
    """
    role_system = """
        You are an ASCII art recognizer. A simple ASCII drawing is provided below.
        Your job:
          1. Identify the single object being depicted.
          2. Explain your reasoning by pointing to specific patterns in the drawing.

        Format your response exactly as:

        Answer: <Object>
        Description: <Explain which characters form key parts of the object and how they map to its shape.>
        """

    messages = [
      {
        "role": "system",
        "content": role_system
      },
      {
        "role": "user",
        "content": image_prompt
      }
    ]

    chat_prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, tokenize=False
    )
    inputs = tokenizer(chat_prompt, return_tensors="pt")
    inputs.to(device)

    with torch.inference_mode():
        generate_ids = model.generate(
            **inputs, max_length=1024,
            temperature=0.6,
            do_sample=True,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id
        )

    input_len = inputs['input_ids'].shape[1]
    generate_ids = generate_ids[:, input_len:]
    output = tokenizer.batch_decode(
        generate_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]

    return output

Let’s start from the simple case, small house:

house = """
   /\
  /  \
 /____\
 | [] |
 |____|
"""
answer_house = ask_about_ascii_image(model, tokenizer, house, "cuda")
print(answer_house)

Okay, so I'm looking at this ASCII art, and I need to figure out what object it's depicting. Let me start by breaking it down piece by piece. The drawing is made up of several lines with different characters.
First, the top part is " /", which looks like a peak or a roof. Then the next line is "/  ", which seems like two diagonal lines meeting at the bottom, maybe forming a V shape. Following that, there's "/____". The underscores here make me think of a horizontal line, so this line looks like a wider base with diagonal lines on either side. So, putting the first three lines together, it seems like some sort of container or maybe a house with a roof.
Moving on to the fourth line: " | [] | ". Here, the vertical bars on either side could represent walls or sides, and the "[]" in the middle looks like a door or a window. It's a bit small, but it's a clear square, so that makes me think of a door.
The last line is " |____| ", which again has vertical bars on the sides and underscores in the middle. This looks like a base or a foundation, maybe the floor of whatever this object is. The underscores suggest something solid and flat.
Putting all these parts together, the roof, walls, door, and base—this seems to fit the description of a house. The roof is made with slashes and underscores, the walls are the vertical bars with a door in the middle, and the base reinforces the structure. There's no indication of anything else, like a person or an animal, so it's more likely a simple house.
I don't see any other common shapes or objects that fit these elements. It's not a car because there are no wheels or specific details like that. It's not a tree because there's no trunk or leaves. So, the most logical conclusion is that this ASCII art is a house.
</think>

Answer: House
Description: The ASCII art depicts a house with a peaked roof formed by the characters "/", followed by walls made of vertical bars " | " and a door represented by "[]". The base of the house is suggested by the line " |____| ", creating a simple, recognizable structure.

The next would be bigger image of a cat:

cat = """
 /\     /\
{  `---'  }
{  O   O  }
~~>  V  <~~
 \  \|/  /
  `-----'____
  /     \    \_
 {       }\  )_\_   _
 |  \_/  |/ /  \_\_/ )
  \__/  /(_/     \__/
    (__/
"""
answer_cat = ask_about_ascii_image(model, tokenizer, cat, "cuda")
print(answer_cat)

Okay, so I'm looking at this ASCII art, and I need to figure out what object it's depicting. Let me start by examining the structure and the patterns of the characters used.
First, I notice that the ASCII art has several lines with different symbols. The top part has a structure that looks like a peaked roof, made up of '/' and '' characters. That peaked shape might suggest something like a roof of a building or maybe an animal's head. But looking further down, I see more details that could help clarify.
In the middle of the ASCII art, there are two 'O's placed next to each other, enclosed within curly braces. That makes me think of eyes, like in a face. So maybe it's an animal or a character with a face. The 'O's are often used to represent eyes in ASCII art.
Moving down, there's a line with a '>' and a 'V', which are symbols that can indicate direction or movement. The '>' is pointing to the right, and the 'V' is a downward-pointing triangle. This could suggest movement or a specific feature like a tail or a beak.
The next line has a mix of underscores and slashes, creating a wavy or segmented line. This might represent a mouth or some kind of ground. The underscores are commonly used to denote lines in ASCII art, so maybe this is part of the face or a base.
Looking further down, I see parentheses and more slashes. The parentheses are used here in a way that might form a body or a tail. The '/' and '' characters are creating curves and angles, which could be part of a body or legs.
The last few lines have a structure that looks like a tail or a limb, with the use of underscores and slashes. The repetition of these symbols suggests a pattern, maybe indicating movement or a flowing part of the object.
Putting it all together, the combination of a peaked roof, eyes, a mouth, and a tail-like structure makes me think of a cat. Cats are often depicted in ASCII art with these features, especially the face with eyes and a playful tail. The movement indicators like '>' and 'V' could represent the cat's active nature.
I'm pretty confident it's a cat because the elements like the peaked head, the eyes, and the tail all align with how a cat is typically portrayed in ASCII. The use of slashes and underscores to create curves and lines also supports this idea, as cats have smooth and flowing bodies.
</think>

The ASCII art depicts a cat. The peaked roof suggests a head, the 'O's are the eyes, the '>' and 'V' represent movement, and the wavy lines and slashes form a body and tail, characteristic of a cat.

Sometimes network misclassify image, suggesting “frog” or “bird” instead of “cat.” This happens because it try to factorize the task to understand geometric structure. The network first recognizes features on each row, then attempts to build image understanding through sequential combination of these recognized features. It doesn’t process 2D structure simultaneously. Interestingly, self-attention for ASCII images would resemble a heatmap of that image, creating a spatial representation inside the transformer. This suggests it might be relatively straightforward to fine-tune an LLM for ASCII image recognition. While this has limited practical applications, it provides valuable insights into the natural visual perception capabilities of LLMs.

Now lets try more complicated case:

gorilla = """
                _
            ,.-" "-.,
           /   ===   \
          /  =======  \
       __|  (o)   (0)  |__
      / _|    .---.    |_ \
     | /.----/ O O \----.\ |
      \/     |     |     \/
      |                   |
      |                   |
      |                   |
      _\   -.,_____,.-   /_
  ,.-"  "-.,_________,.-"  "-.,
 /          |       |          \
|           l.     .l           |
|            |     |            |
l.           |     |           .l
 |           l.   .l           | \,
 l.           |   |           .l   \,
  |           |   |           |      \,
  l.          |   |          .l        |
   |          |   |          |         |
   |          |---|          |         |
   |          |   |          |         |
   /"-.,__,.-"\   /"-.,__,.-"\"-.,_,.-"\
  |            \ /            |         |
  |             |             |         |
   \__|__|__|__/ \__|__|__|__/ \_|__|__/

"""
answer_gorilla = ask_about_ascii_image(model, tokenizer, gorilla, "cuda")
print(answer_gorilla)

Okay, so I'm trying to figure out what object is depicted in this ASCII art. Let me look at the characters and patterns carefully.
First, I notice a lot of underscores and other lines, which might indicate some sort of outline or structure. There are also repeated patterns with slashes and backslashes, which could be forming curves or rounded edges. The presence of parentheses like (o) and (0) might represent eyes or features of a face.
Looking closer, there are multiple sections with repeated characters. For example, the lines with "/ _" and "_" could be forming the outline of a head or a face. The eyes are probably the (o) or (0) characters, which are common in ASCII faces.
I also see some parts with "====" which might be a mouth or a line in the face. The repeated sections with "/" and "/" could be indicating hair or maybe ears. The structure seems symmetrical, which is typical in faces or heads.
There are also some parts with pipes "|" and dots ".", which might be detailing the face, like features or the inside of the eyes. The overall shape seems to have a rounded top and a more angular bottom, which fits the profile of a face with some details.
Putting it all together, the repeated use of eyes, a possible mouth, and symmetrical lines suggests that the object is a face or head. The various characters are forming the different parts of a face, like eyes, mouth, and outline.
</think>

The ASCII art represents a face or head, characterized by its symmetrical use of characters like (o) for eyes, underscores for outlines, and slashes for features. The repeated patterns and symmetrical structure further support this depiction.

Answer: Face
Description: The ASCII art depicts a face using characters such as (o) for eyes, underscores for outlines, and slashes for features. The symmetrical patterns and repeated sections suggest a face with eyes, a mouth, and a head shape.

Batching prompts for faster generation. Testing on simple math problems ¶

Batching of imput prompts allows to utilize GPU better and provides significant speedup comparing to sequential generation. For batch processing we will pad sequences with to make them have same size and create attention mask to skip attention for added padding tokens. Here we are creating function for batched generation:

def batch_process(model, tokenizer, prompts, device, do_sample=True):

    model.generation_config.pad_token_id = None
    chat_messages_batch = [
        [
            {
                "role": "system",
                "content": "You are a helpful AI assistant."
                "Be direct and concise in your responses.",
            },
            {
                "role": "user",
                "content": prompt,
            },
        ]
        for prompt in prompts
    ]

    # Tokenize inputs using chat template formatting
    tokenized_inputs = tokenizer.apply_chat_template(
        chat_messages_batch, add_generation_prompt=True,
        padding=True, return_tensors="pt",
    ).to(device)
    attention_mask = (tokenized_inputs != tokenizer.pad_token_id).long()

    with torch.inference_mode():
        generate_ids = model.generate(
            input_ids=tokenized_inputs,
            attention_mask=attention_mask,
            max_length=1024,
            temperature=0.6,
            do_sample=do_sample,
            pad_token_id=tokenizer.eos_token_id
        )

    input_len = tokenized_inputs.shape[1]
    generate_ids = generate_ids[:, input_len:]
    output = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )

    return output

Math testing ¶

For test we will create 2 simple math problems:

questions = [
    'Solve equation: x^2 + x = 0. Give answer in a format "Answer: $x=..., x=...$"',
    'What is the result of 9.9 - 9.11? Give answer as "Answer: decimal number".'
]

from IPython.display import Markdown, display

out = batch_process(model, tokenizer, questions, "cuda")

box_style = """
<div style="
    border: 1px solid #fff845;
    border-radius: 8px;
    padding: 10px;
    ">
    {} {}
</div>
"""

display(Markdown(box_style.format("<h4>❓Question</h4>", questions[0])))
display(Markdown(box_style.format("<h4>📝Answer</h4>", out[0])))
display(Markdown('-------------------'))
display(Markdown(box_style.format("<h4>❓Question</h4>", questions[1])))
display(Markdown(box_style.format("<h4>📝Answer</h4>", out[1])))

Question

Solve equation: x^2 + x = 0. Give answer in a format “Answer: $x=…, x=…$”

Answer

Okay, so I have this equation to solve: x squared plus x equals zero. Hmm, let me think about how to approach this. I remember from algebra that when you have a quadratic equation like this, you can sometimes factor it or use the quadratic formula. Let me try factoring first because that’s usually quicker if it works.

The equation is x² + x = 0. I notice that both terms have an x in them. So maybe I can factor out an x. If I do that, I’d get x times (x + 1) equals zero. That seems right because when I distribute the x back in, it’s x² + x, which matches the original equation.

Now, according to the zero product property, if the product of two factors is zero, then at least one of the factors must be zero. So that means either x = 0 or x + 1 = 0. Solving these separately, if x = 0, that’s straightforward. If x + 1 = 0, then subtracting 1 from both sides gives x = -1.

So, the solutions are x = 0 and x = -1. Let me just double-check by plugging them back into the original equation. For x = 0: 0² + 0 = 0, which is true. For x = -1: (-1)² + (-1) = 1 - 1 = 0, which also works.

I think that’s all the solutions. The equation is quadratic, so there should be two roots, and I found both. I don’t think I made any mistakes here, but just to be thorough, I can also use the quadratic formula as another method to confirm.

The quadratic formula is x = [-b ± sqrt(b² - 4ac)] / (2a). In this equation, a = 1, b = 1, and c = 0. Plugging those in: x = [-1 ± sqrt(1 - 0)] / 2 = [-1 ± 1]/2. That gives two solutions: (-1 + 1)/2 = 0/2 = 0, and (-1 - 1)/2 = -2/2 = -1. Yep, same answers as before. So I’m confident that x = 0 and x = -1 are the correct solutions.

The equation x² + x = 0 can be solved by factoring out an x, resulting in x(x + 1) = 0. Applying the zero product property gives the solutions x = 0 and x = -1.

Answer: x=0, x=-1

Question

What is the result of 9.9 - 9.11? Give answer as “Answer: decimal number”.

Answer

Okay, so I need to figure out the result of 9.9 minus 9.11. Hmm, let’s see. I’m a bit rusty on decimal subtraction, but I think I can handle this.

First, I should probably write down the numbers to make it clearer. So, it’s 9.9 minus 9.11. I know that when subtracting decimals, it’s important to line up the decimal points to ensure each place value is correctly aligned.

Let me write them one under the other:

  9.9
-9.11

Wait, but 9.9 has only one decimal place, while 9.11 has two. I think I should add a zero to 9.9 to make it 9.90 so that both numbers have the same number of decimal places. That way, I can subtract them properly.

So now it looks like:

  9.90
-9.11

Alright, starting from the rightmost digit, which is the hundredths place. 0 minus 1. Hmm, I can’t subtract 1 from 0, so I’ll need to borrow from the tenths place. The 9 in the tenths place becomes 8, and the 0 in the hundredths place becomes 10.

So now, 10 minus 1 is 9. That’s the hundredths place.

Next, moving to the tenths place: 8 minus 1 is 7.

Finally, the ones place: 9 minus 9 is 0.

Putting it all together, I get 0.79.

Wait, does that make sense? Let me double-check by adding 9.11 and 0.79 to see if I get back to 9.9.

9.11 plus 0.79 is 9.90, which is the same as 9.9. That seems right.

So, the result of 9.9 minus 9.11 is 0.79.

The result of 9.9 minus 9.11 is 0.79.

Answer: 0.79

Inference time batched vs sequential ¶

import time

t = time.time()
out = batch_process(model, tokenizer, questions, "cuda", do_sample=False)
t = time.time() - t

print("Time batched: ", t)

t = time.time()
out = batch_process(model, tokenizer, [questions[0]], "cuda", do_sample=False)
out = batch_process(model, tokenizer, [questions[1]], "cuda", do_sample=False)
t = time.time() - t

print("Time sequential: ", t)

Time batched:  4.177047967910767
Time sequential:  9.199042320251465

We can see that sequential runs are about 2x slower than batched inference! And increasing batch size can yield even greater speedups!

Get started with TheStage Elastic Transformers
- Overview
  - Installation
  - Benchmarking quality and latency
- Running inference