Tensor Parallelism
A parallelization strategy that splits individual tensor operations (matrix multiplications) across multiple GPUs – necessary for layers too large for one GPU.
Tensor parallelism splits individual matrix multiplications across GPUs – enables training and inference of models whose layers don't fit on one GPU.
Explanation
Megatron-LM (NVIDIA) splits weight matrices in attention and FFN: Column parallel for the first matrix, row parallel for the second. Requires fast GPU interconnects (NVLink). Combined with data and pipeline parallelism for maximum scaling.
Marketing Relevance
Tensor parallelism is essential for training and inference of models with 100B+ parameters – individual layers no longer fit on one GPU.
Example
Llama-3 405B uses tensor parallelism across 8 GPUs per node: The 12,288-dimensional FFN matrices are distributed across 8 GPUs, each computing 1/8 of the output.
Common Pitfalls
Requires very fast GPU interconnects (NVLink). High communication overhead across nodes. Complex implementation. Not all operations are easily splittable.
Origin & History
Shoeybi et al. (NVIDIA, 2019) introduced tensor parallelism in Megatron-LM. The technique became standard for all 100B+ models. GPT-3, PaLM, and Llama-3 use tensor parallelism as core strategy.
Comparisons & Differences
Tensor Parallelism vs. Pipeline Parallelism
Tensor parallel splits within a layer (intra-layer); pipeline parallel splits between layers (inter-layer).