Question 1

What is Tensor Parallelism?

Accepted Answer

A parallelization strategy that splits individual tensor operations (matrix multiplications) across multiple GPUs – necessary for layers too large for one GPU. Megatron-LM (NVIDIA) splits weight matrices in attention and FFN: Column parallel for the first matrix, row parallel for the second. Requires fast GPU interconnects (NVLink). Combined with data and pipeline parallelism for maximum scaling.

Question 2

How does Tensor Parallelism work?

Accepted Answer

Megatron-LM (NVIDIA) splits weight matrices in attention and FFN: Column parallel for the first matrix, row parallel for the second. Requires fast GPU interconnects (NVLink). Combined with data and pipeline parallelism for maximum scaling.

Question 3

Why is Tensor Parallelism important for marketing?

Accepted Answer

Tensor parallelism is essential for training and inference of models with 100B+ parameters – individual layers no longer fit on one GPU.

Question 4

How is Tensor Parallelism used in practice?

Accepted Answer

Llama-3 405B uses tensor parallelism across 8 GPUs per node: The 12,288-dimensional FFN matrices are distributed across 8 GPUs, each computing 1/8 of the output.

Question 5

What are common mistakes with Tensor Parallelism?

Accepted Answer

Requires very fast GPU interconnects (NVLink). High communication overhead across nodes. Complex implementation. Not all operations are easily splittable.

Question 6

Where does Tensor Parallelism come from?

Accepted Answer

Shoeybi et al. (NVIDIA, 2019) introduced tensor parallelism in Megatron-LM. The technique became standard for all 100B+ models. GPT-3, PaLM, and Llama-3 use tensor parallelism as core strategy.

Tensor Parallelism

Explanation

Marketing Relevance

Example

Common Pitfalls

Origin & History

Comparisons & Differences

Tensor Parallelism vs. Pipeline Parallelism

Further Resources

Related Services

Related Terms