Nvidia pulled part of the curtain off its long-anticipated Volta GPU
architecture, revealing the GV100 GPU and the first derivative product,
the Tesla V100 here at GTC in San Jose today.
Nvidia first dropped the
Volta name at GTC in 2013,
and it's taken the company four years to reveal the juicy details. If
you're a gamer, don't get too excited yet; Nvidia is still pitching
Pascal-derived products (only a year old, or less).
If you work in the
AI and high performance computer (HPC) markets, however, this first
phase of Volta is coming your way.
The Volta GV100 GPU Architecture
The Volta GV100 GPU uses the 12nm TSMC FFN process, has over 21 billion transistors, and is designed for deep learning applications.
We're talking about an 815mm2 die here, which pushes the limits of TSMC's current capabilities. Nvidia said it's not possible to build a larger GPU on the current process technology.
The GP100 was the largest GPU that Nvidia ever produced before the GV100. It took up a 610mm2 surface area and housed 15.3 billion transistors. The GV100 is more than 30% larger.
Volta’s full GV100 GPU sports 84 SMs (each SM features four texture
units, 64 FP32 cores, 64 INT32 cores, 32 FP64 cores) fed by 128KB of
shared L1 cache per SM that can be configured to varying texture cache
and shared memory ratios.
The GP100 featured 60 SMs and a total of 3840
CUDA cores. The Volta SMs also feature a new type of core that
specializes in Tensor deep learning 4x4 Matrix operations. The GV100
contains eight Tensor cores per SM, and each core delivers up to 120
TFLOPS for training and inference operations.
To save you some math,
this brings the full GV100 GPU to an impressive 5,376 FP32 and INT32
cores, 2688 FP64 cores, and 336 texture units.
GV100 also features four HBM2 memory emplacements, like GP100, with each stack controlled by a pair of memory controllers. Speaking of which, there are eight 512-bit memory controllers (giving this GPU a total memory bus width of 4,096-bit).
Each memory controller is attached to 768KB of L2 cache, for a total of 6MB of L2 cache (vs 4MB for Pascal).
Tesla V100
The new Nvidia Tesla V100 features 80 SMs for a total of 5,120 CUDA cores. However, it has the potential to reach 7.5, 15, and 120 TFLOPs in FP64, FP32, and Tensor computations, respectively.
The Tesla V100 sports 16GB of HBM2 memory, which is capable of reaching up to 900 GB/s. The Samsung memory that Nvidia installed on the Tesla V100 is also 180 GB/s faster than the memory found on the Tesla P100 cards.
Nvidia said it used the fastest memory available on the market.
The Tesla V100 also introduces the second generation of NVLink, which allows for up to 300 GB/s over six 25GB/s NVLinks per GPU.
To put those numbers into perspective, Nvidia's Pascal-derived Tesla P100
sports 56 SMs and 3584 CUDA cores, which produce up to 5.3 TFLPs in
FP64 computations, and 10.6 TFLOPs in FP32 computations.
The V100 offers
a full 30% more FP32 computational capability than the P100, and nearly
a 50% increase in FP64 performance.
And Nvidia increased the NVLink
bandwidth of the Tesla V100 by 50% by adding two NVLinks per GPU
compared to the Tesla P100 and increasing the bandwidth of each NVLink
by 5GB/s.
Nvidia said the Tesla V100 carries a TDP of 300W, which is the same power requirement as the Tesla P100.
V100 | P100 | |
| 80 | 56 |
| - 5,120 (FP32) - 2,560 (FP64) | - 3,584 (FP32) - 1,792 (FP64) |
| 1,455MHz | 1,480MHz |
| - 7.5 (FP64) - 15 (FP32) - 120 Tensor | - 5.3 (FP64) - 10.3 (FP32) |
| 320 | 224 |
| 16GB 4096-bit HBM2 | 16GB 4096-bit HBM2 |
| 900 GB/s | 720 GB/s |
| 21.1 Billion | 15.3 Billion |
| 12nm FFN | 16nm FinFET+ |
No comments:
Post a Comment