Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Tensor Parallelism

    Also known as:
    Intra-Layer Parallelism
    Megatron Parallelism
    Column/Row Parallel
    Updated: 2/11/2026

    A parallelization strategy that splits individual tensor operations (matrix multiplications) across multiple GPUs – necessary for layers too large for one GPU.

    Quick Summary

    Tensor parallelism splits individual matrix multiplications across GPUs – enables training and inference of models whose layers don't fit on one GPU.

    Explanation

    Megatron-LM (NVIDIA) splits weight matrices in attention and FFN: Column parallel for the first matrix, row parallel for the second. Requires fast GPU interconnects (NVLink). Combined with data and pipeline parallelism for maximum scaling.

    Marketing Relevance

    Tensor parallelism is essential for training and inference of models with 100B+ parameters – individual layers no longer fit on one GPU.

    Example

    Llama-3 405B uses tensor parallelism across 8 GPUs per node: The 12,288-dimensional FFN matrices are distributed across 8 GPUs, each computing 1/8 of the output.

    Common Pitfalls

    Requires very fast GPU interconnects (NVLink). High communication overhead across nodes. Complex implementation. Not all operations are easily splittable.

    Origin & History

    Shoeybi et al. (NVIDIA, 2019) introduced tensor parallelism in Megatron-LM. The technique became standard for all 100B+ models. GPT-3, PaLM, and Llama-3 use tensor parallelism as core strategy.

    Comparisons & Differences

    Tensor Parallelism vs. Pipeline Parallelism

    Tensor parallel splits within a layer (intra-layer); pipeline parallel splits between layers (inter-layer).

    Related Services

    Related Terms

    👋Questions? Chat with us!