Compute Performance (FP4 Tensor)▲ 5.0x vs H100

20.0 PetaFLOPS

Dense FP4 performance utilizing 2nd Gen Transformer Engine

B200 Blackwell20.0 PF

H200 Hopper3.9 PF

Memory System

192GB HBM3e

8 Hi-Stacks / 1024-bit interface

9.5 TB/s Bandwidth

Interconnect & I/O

1.8 TB/s NVLink 5

Bi-directional total bandwidth

PCIe Gen 6.0 x16

Real-World Applications

Large Language Model Training

Train frontier-scale models with 192GB HBM3e per GPU. The B200's FP4 Tensor performance enables training runs that previously required 5x more H100s, dramatically reducing cluster size and operational costs for models above 70B parameters.

High-Throughput Inference

Serve production LLM workloads at scale with 2nd Gen Transformer Engine and native FP4 quantization. A single B200 node can handle the inference throughput of an entire H100 rack for latency-sensitive applications like real-time chat and code completion.

Scientific Computing & Simulation

Accelerate molecular dynamics, climate modeling, and computational fluid dynamics. The 9.5 TB/s memory bandwidth and NVLink 5 interconnect make multi-GPU simulation workloads up to 4x faster than previous generation.

Generative AI & Media

Power video generation, 3D rendering, and multimodal AI pipelines. The B200's massive memory capacity supports models like Sora-class video generators and real-time neural radiance fields without the memory bottlenecks that limit H100-based deployments.

Full Technical Specifications

Transistor Count	208 Billion (4NP Process)
Die Size	Dual-Die CoWoS-L (Reticle Limit x2)
CUDA Cores	160 Streaming Multiprocessors (Est.)
Tensor Cores	5th Gen Tensor Core Architecture
Memory Interface	8192-bit HBM3e
L2 Cache	128MB Unified
Form Factor	SXM6 / PCIe Add-in Card
Thermal Design Power	1000W (Configurable)

NVIDIA Blackwell B200