A Getting Started Guide to Using the NVIDIA Transformer Engine with Mixed Precision, FP8 Testing, Benchmarking, and Using Fallback

admin April 6, 2026

0 1 8 minutes read

A Getting Started Guide to Using the NVIDIA Transformer Engine with Mixed Precision, FP8 Testing, Benchmarking, and Using Fallback

In this tutorial, we use an advanced, functional implementation of the NVIDIA Transformer Engine in Python, the focus is on how mixed speedup and accuracy can be evaluated in a realistic deep learning workflow. We set up the environment, ensure GPU and CUDA readiness, try to install the necessary Transformer Engine components, and handle compatibility issues properly so that the notebook remains functional even if a full extension cannot be built. As we go through each step, we build networks of teachers and students, compare the basic approach of PyTorch with the approach enabled by the Transformer Engine, we train both models, estimate their speed and memory usage, and visualize the results, giving us a clear understanding of how the performance-oriented training workflow is built in practice.

import os
import sys
import json
import time
import math
import random
import shutil
import platform
import subprocess
import statistics


def run(cmd, check=True):
   print("n[RUN]", " ".join(cmd))
   result = subprocess.run(cmd, text=True, capture_output=True)
   if result.stdout.strip():
       print(result.stdout[-4000:])
   if result.returncode != 0 and result.stderr.strip():
       print(result.stderr[-4000:])
   if check and result.returncode != 0:
       raise subprocess.CalledProcessError(result.returncode, cmd)
   return result


def has_cmd(name):
   return shutil.which(name) is not None


run([sys.executable, "-m", "pip", "install", "-q", "--upgrade", "pip"])
run([sys.executable, "-m", "pip", "install", "-q", "ninja", "packaging", "matplotlib"])


import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt


assert torch.cuda.is_available(), "This notebook needs a GPU runtime in Colab."


gpu_name = torch.cuda.get_device_name(0)
cc_major, cc_minor = torch.cuda.get_device_capability(0)
cuda_runtime = torch.version.cuda
python_version = sys.version.split()[0]
torch_version = torch.__version__
cuda_home = os.environ.get("CUDA_HOME", "/usr/local/cuda")
nvcc_path = shutil.which("nvcc") or os.path.join(cuda_home, "bin", "nvcc")
cudnn_header_candidates = [
   os.path.join(cuda_home, "include", "cudnn.h"),
   "/usr/include/cudnn.h",
   "/usr/local/include/cudnn.h",
]


nvcc_exists = os.path.exists(nvcc_path)
cudnn_header_exists = any(os.path.exists(p) for p in cudnn_header_candidates)


print("=" * 120)
print("ENVIRONMENT CHECK")
print("=" * 120)
print(json.dumps({
   "python": python_version,
   "platform": platform.platform(),
   "torch": torch_version,
   "torch_cuda": cuda_runtime,
   "gpu_name": gpu_name,
   "compute_capability": f"{cc_major}.{cc_minor}",
   "cuda_home": cuda_home,
   "nvcc_exists": nvcc_exists,
   "nvcc_path": nvcc_path if nvcc_exists else None,
   "cudnn_header_exists": cudnn_header_exists,
}, indent=2))
print("=" * 120)

We configure the Colab environment by importing the required Python libraries, defining a helper function for running shell commands, and installing important dependencies for the tutorial. We then import PyTorch and Matplotlib, verify that the GPU is available, and gather important local information, including GPU name, CUDA version, Python version, and toolkit methods. This gives us a clear idea of the state of the system before we attempt any Transformer Engine installation or modeling.

te_available = False
te_mode = "fallback"
te_import_error = None


try:
   run([sys.executable, "-m", "pip", "install", "-q", "transformer_engine[core_cu12]"])
except Exception as e:
   print("Core wheel install failed:", repr(e))


can_try_te_torch = nvcc_exists and cudnn_header_exists


if can_try_te_torch:
   env = os.environ.copy()
   env["NVTE_FRAMEWORK"] = "pytorch"
   env["MAX_JOBS"] = "1"
   env["NVTE_BUILD_THREADS_PER_JOB"] = "1"
   env["CUDA_PATH"] = cuda_home
   env["CUDA_HOME"] = cuda_home
   try:
       print("nAttempting to build the PyTorch extension for Transformer Engine...")
       result = subprocess.run(
           [sys.executable, "-m", "pip", "install", "-q", "--no-build-isolation", "transformer_engine[pytorch]"],
           text=True,
           capture_output=True,
           env=env,
       )
       if result.stdout.strip():
           print(result.stdout[-4000:])
       if result.returncode != 0 and result.stderr.strip():
           print(result.stderr[-4000:])
       if result.returncode == 0:
           import transformer_engine.pytorch as te
           from transformer_engine.common import recipe
           te_available = True
           te_mode = "transformer_engine"
       else:
           te_import_error = result.stderr[-4000:] if result.stderr else "Unknown pip build error"
   except Exception as e:
       te_import_error = repr(e)
else:
   te_import_error = "Missing nvcc or cuDNN headers in this Colab runtime, so TE PyTorch extension cannot be built here."


if te_available:
   try:
       fp8_available, fp8_reason = te.is_fp8_available(return_reason=True)
   except Exception as e:
       fp8_available, fp8_reason = False, f"Could not query FP8 availability: {e}"
   try:
       bf16_available = te.is_bf16_available()
   except Exception:
       bf16_available = torch.cuda.is_bf16_supported()
else:
   fp8_available = False
   fp8_reason = "Transformer Engine not installed; using fallback PyTorch path."
   bf16_available = torch.cuda.is_bf16_supported()


amp_dtype = torch.bfloat16 if bf16_available else torch.float16


print("n" + "=" * 120)
print("INSTALL STATUS")
print("=" * 120)
print(json.dumps({
   "te_available": te_available,
   "te_mode": te_mode,
   "fp8_available": fp8_available,
   "fp8_reason": fp8_reason,
   "te_import_error": te_import_error,
   "amp_dtype": str(amp_dtype),
}, indent=2))
print("=" * 120)


device = "cuda"
random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)


if te_available:
   fp8_recipe = recipe.DelayedScaling(margin=0, fp8_format=recipe.Format.E4M3)


def baseline_autocast():
   return torch.autocast(device_type="cuda", dtype=amp_dtype)


def te_forward_context(use_fp8):
   if te_available and use_fp8:
       return te.autocast(enabled=True, recipe=fp8_recipe)
   return baseline_autocast()

We try to install the Transformer Engine core package and check if the Colab runtime can build the PyTorch extension by verifying the presence of the nvcc and cuDNN headers. If the environment supports it, we try to install the Transformer Engine PyTorch backend and check if FP8 and BF16 are available on the current hardware. We also adjust the precision mode and define autocast content which later allows us to switch between standard mixed precision and Transformer Engine rendering.

class TeacherNet(nn.Module):
   def __init__(self, hidden_size=512, intermediate_size=2048, num_layers=3, vocab_size=4096):
       super().__init__()
       self.embed = nn.Embedding(vocab_size, hidden_size)
       self.layers = nn.ModuleList([
           nn.Sequential(
               nn.LayerNorm(hidden_size),
               nn.Linear(hidden_size, intermediate_size),
               nn.GELU(),
               nn.Linear(intermediate_size, hidden_size),
           ) for _ in range(num_layers)
       ])
       self.head = nn.Linear(hidden_size, hidden_size)


   def forward(self, token_ids):
       x = self.embed(token_ids)
       for layer in self.layers:
           x = x + layer(x)
       return self.head(x)


class BaselineStudent(nn.Module):
   def __init__(self, hidden_size=512, intermediate_size=2048, num_layers=3, vocab_size=4096):
       super().__init__()
       self.embed = nn.Embedding(vocab_size, hidden_size)
       self.norms = nn.ModuleList([nn.LayerNorm(hidden_size) for _ in range(num_layers)])
       self.fc1 = nn.ModuleList([nn.Linear(hidden_size, intermediate_size) for _ in range(num_layers)])
       self.fc2 = nn.ModuleList([nn.Linear(intermediate_size, hidden_size) for _ in range(num_layers)])
       self.head = nn.Linear(hidden_size, hidden_size)


   def forward(self, token_ids):
       x = self.embed(token_ids)
       for ln, fc1, fc2 in zip(self.norms, self.fc1, self.fc2):
           residual = x
           x = ln(x)
           x = fc1(x)
           x = F.gelu(x, approximate="tanh")
           x = fc2(x)
           x = x + residual
       return self.head(x)


if te_available:
   class TEStudent(nn.Module):
       def __init__(self, hidden_size=512, intermediate_size=2048, num_layers=3, vocab_size=4096):
           super().__init__()
           self.embed = nn.Embedding(vocab_size, hidden_size)
           self.norms = nn.ModuleList([te.LayerNorm(hidden_size) for _ in range(num_layers)])
           self.fc1 = nn.ModuleList([te.Linear(hidden_size, intermediate_size, bias=True) for _ in range(num_layers)])
           self.fc2 = nn.ModuleList([te.Linear(intermediate_size, hidden_size, bias=True) for _ in range(num_layers)])
           self.head = te.Linear(hidden_size, hidden_size, bias=True)


       def forward(self, token_ids, use_fp8=False):
           x = self.embed(token_ids)
           with te_forward_context(use_fp8):
               for ln, fc1, fc2 in zip(self.norms, self.fc1, self.fc2):
                   residual = x
                   x = ln(x)
                   x = fc1(x)
                   x = F.gelu(x, approximate="tanh")
                   x = fc2(x)
                   x = x + residual
               x = self.head(x)
           return x
else:
   class TEStudent(nn.Module):
       def __init__(self, hidden_size=512, intermediate_size=2048, num_layers=3, vocab_size=4096):
           super().__init__()
           self.embed = nn.Embedding(vocab_size, hidden_size)
           self.norms = nn.ModuleList([nn.LayerNorm(hidden_size) for _ in range(num_layers)])
           self.fc1 = nn.ModuleList([nn.Linear(hidden_size, intermediate_size) for _ in range(num_layers)])
           self.fc2 = nn.ModuleList([nn.Linear(intermediate_size, hidden_size) for _ in range(num_layers)])
           self.head = nn.Linear(hidden_size, hidden_size)


       def forward(self, token_ids, use_fp8=False):
           x = self.embed(token_ids)
           with baseline_autocast():
               for ln, fc1, fc2 in zip(self.norms, self.fc1, self.fc2):
                   residual = x
                   x = ln(x)
                   x = fc1(x)
                   x = F.gelu(x, approximate="tanh")
                   x = fc2(x)
                   x = x + residual
               x = self.head(x)
           return x


def count_params(model):
   return sum(p.numel() for p in model.parameters() if p.requires_grad)


def format_millions(n):
   return f"{n / 1e6:.2f}M"

We describe the neural network structures used throughout the course, including the teacher model, the basic learner model, and the Transformer Engine learner model. We keep the model properties aligned so that comparisons remain meaningful while allowing the TE method to change the Transform Engine layers when an extension is available. We also define small utility functions to calculate the parameters and size of the formatting model, which helps us check the scale of the models before training begins.

hidden_size = 512
intermediate_size = 2048
num_layers = 3
vocab_size = 4096
seq_len = 128
batch_size = 8
steps = 25
benchmark_iters = 20
lr = 2e-4
weight_decay = 1e-2


teacher = TeacherNet(hidden_size, intermediate_size, num_layers, vocab_size).to(device).eval()
baseline_model = BaselineStudent(hidden_size, intermediate_size, num_layers, vocab_size).to(device)
te_model = TEStudent(hidden_size, intermediate_size, num_layers, vocab_size).to(device)


optimizer_baseline = torch.optim.AdamW(baseline_model.parameters(), lr=lr, weight_decay=weight_decay)
optimizer_te = torch.optim.AdamW(te_model.parameters(), lr=lr, weight_decay=weight_decay)


print("Teacher params :", format_millions(count_params(teacher)))
print("Baseline params:", format_millions(count_params(baseline_model)))
print("TE-path params :", format_millions(count_params(te_model)))


def make_batch(batch_size, seq_len, vocab_size, device):
   tokens = torch.randint(0, vocab_size, (batch_size, seq_len), device=device)
   with torch.no_grad():
       target = teacher(tokens)
   return tokens, target


def peak_mem_mb():
   return torch.cuda.max_memory_allocated() / (1024 ** 2)


def train_baseline_step():
   baseline_model.train()
   optimizer_baseline.zero_grad(set_to_none=True)
   tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
   with baseline_autocast():
       pred = baseline_model(tokens)
       loss = F.mse_loss(pred, target)
   loss.backward()
   optimizer_baseline.step()
   return float(loss.detach().item())


def train_te_step(use_fp8):
   te_model.train()
   optimizer_te.zero_grad(set_to_none=True)
   tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
   pred = te_model(tokens, use_fp8=use_fp8)
   loss = F.mse_loss(pred, target)
   loss.backward()
   optimizer_te.step()
   return float(loss.detach().item())

We set the hyperparameters for the main test, we run all the models on the GPU, and we create the optimizers to be used during training. We also print parameter estimates to ensure that the baseline and TE methods are comparable in terms of model size. In addition, we describe the cluster generation logic, the memory tracking function, and the individual training step functions that perform one optimization step for each model.

baseline_losses = []
te_losses = []
mode_name = "TE-FP8" if (te_available and fp8_available) else ("TE-BF16/FP16" if te_available else "Fallback-PyTorch")


print("n" + "=" * 120)
print("TRAINING")
print("=" * 120)


for step in range(1, steps + 1):
   b_loss = train_baseline_step()
   t_loss = train_te_step(use_fp8=fp8_available)
   baseline_losses.append(b_loss)
   te_losses.append(t_loss)
   if step == 1 or step % 5 == 0 or step == steps:
       print(f"step={step:02d} | baseline_loss={b_loss:.6f} | te_path_loss={t_loss:.6f} | mode={mode_name}")


@torch.no_grad()
def evaluate_model(model, is_te=False, use_fp8=False, eval_batches=8):
   model.eval()
   vals = []
   for _ in range(eval_batches):
       tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
       if is_te:
           pred = model(tokens, use_fp8=use_fp8)
       else:
           with baseline_autocast():
               pred = model(tokens)
       vals.append(float(F.mse_loss(pred, target).item()))
   return sum(vals) / len(vals)


baseline_eval = evaluate_model(baseline_model, is_te=False)
te_eval = evaluate_model(te_model, is_te=True, use_fp8=fp8_available)


def benchmark_train_step(model, optimizer, is_te=False, use_fp8=False, warmup=5, iters=20):
   times_ms = []
   mems_mb = []
   for _ in range(warmup):
       optimizer.zero_grad(set_to_none=True)
       tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
       if is_te:
           pred = model(tokens, use_fp8=use_fp8)
       else:
           with baseline_autocast():
               pred = model(tokens)
       loss = F.mse_loss(pred, target)
       loss.backward()
       optimizer.step()
   torch.cuda.synchronize()
   for _ in range(iters):
       torch.cuda.reset_peak_memory_stats()
       optimizer.zero_grad(set_to_none=True)
       tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
       start = time.perf_counter()
       if is_te:
           pred = model(tokens, use_fp8=use_fp8)
       else:
           with baseline_autocast():
               pred = model(tokens)
       loss = F.mse_loss(pred, target)
       loss.backward()
       optimizer.step()
       torch.cuda.synchronize()
       end = time.perf_counter()
       times_ms.append((end - start) * 1000.0)
       mems_mb.append(peak_mem_mb())
   return {
       "mean_ms": statistics.mean(times_ms),
       "median_ms": statistics.median(times_ms),
       "max_memory_mb": max(mems_mb),
   }


baseline_bench = benchmark_train_step(baseline_model, optimizer_baseline, is_te=False, use_fp8=False, iters=benchmark_iters)
te_bench = benchmark_train_step(te_model, optimizer_te, is_te=True, use_fp8=fp8_available, iters=benchmark_iters)

Using the main training loop of both the original model and the TE method, we track their losses in multiple steps. We then describe and conduct a pilot exercise to measure how well each model matches teacher outcomes after training. Finally, we use a benchmark routine to measure the execution time per step and the memory usage of CUDA, allowing quantitative comparison of performance characteristics.

summary = {
   "gpu_name": gpu_name,
   "compute_capability": f"{cc_major}.{cc_minor}",
   "te_available": te_available,
   "fp8_available": fp8_available,
   "fp8_reason": fp8_reason,
   "mode": mode_name,
   "baseline_eval_mse": baseline_eval,
   "te_path_eval_mse": te_eval,
   "baseline_mean_step_ms": baseline_bench["mean_ms"],
   "te_path_mean_step_ms": te_bench["mean_ms"],
   "baseline_peak_mem_mb": baseline_bench["max_memory_mb"],
   "te_path_peak_mem_mb": te_bench["max_memory_mb"],
}


print("n" + "=" * 120)
print("SUMMARY")
print("=" * 120)
print(json.dumps(summary, indent=2))


plt.figure(figsize=(10, 5))
plt.plot(baseline_losses, label="Baseline loss")
plt.plot(te_losses, label=f"{mode_name} loss")
plt.xlabel("Training step")
plt.ylabel("MSE loss")
plt.title("Training Loss Comparison")
plt.legend()
plt.grid(True)
plt.show()


plt.figure(figsize=(8, 5))
plt.bar(["Baseline", mode_name], [baseline_bench["mean_ms"], te_bench["mean_ms"]])
plt.ylabel("Mean train step time (ms)")
plt.title("Speed Comparison")
plt.grid(True, axis="y")
plt.show()


plt.figure(figsize=(8, 5))
plt.bar(["Baseline", mode_name], [baseline_bench["max_memory_mb"], te_bench["max_memory_mb"]])
plt.ylabel("Peak memory (MB)")
plt.title("Peak CUDA Memory Comparison")
plt.grid(True, axis="y")
plt.show()

We collect all the final metrics in a condensed dictionary and print the combined test results in a structured format. We then generate observations of training loss, mean training step time, and peak memory usage to accurately interpret the differences between baseline and TE methods. This last section helps us move from raw numbers to practical details by showing how the two implementations work across precision, speed, and memory.

In conclusion, we have built much more than a simple installation; we’ve created a comprehensive test pipeline that helps us understand how the NVIDIA Transformation Engine fits into modern GPU-accelerated model training. We tested the runtime environment, adapted to Colab’s limitations, maintained a working backward approach, then trained, tested, and benchmarked each of the two implementations to see real-world differences in efficiency, precision behavior, and resource usage. Finally, we understood how to use the Transformer Engine in a Colab-friendly environment and achieved a reusable foundation that could have large transformer architectures, rich simulation scenarios, and a more production-oriented development workflow.

Check it out Full Codes/Notebook here. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us

admin April 6, 2026

0 1 8 minutes read