ReLU (Rectified Linear Unit)
ReLU is the most used activation function in deep learning: f(x) = max(0, x) – simple, fast, and effective against vanishing gradients.
ReLU = max(0, x) – the simplest and most used activation function that made deep learning possible by avoiding vanishing gradients.
Explanation
ReLU passes positive values unchanged and sets negatives to 0. This avoids vanishing gradients (unlike Sigmoid/Tanh) and accelerates training. Variants: Leaky ReLU, PReLU, GELU, SiLU/Swish.
Marketing Relevance
ReLU was key to deep learning's success – without ReLU, deep networks would not have been trainable.
Origin & History
ReLU was described as early as the 1960s, but Nair & Hinton (2010) first showed its superiority for deep networks. AlexNet (2012) used ReLU for the ImageNet breakthrough. GELU (Hendrycks, 2016) and SiLU/Swish (2017) are smoother variants that became standard in transformers (GPT, BERT).
Comparisons & Differences
ReLU (Rectified Linear Unit) vs. Sigmoid
ReLU: no vanishing gradient, fast, but "dead neurons" possible. Sigmoid: smooth 0-1 output, but saturates in deep nets.
ReLU (Rectified Linear Unit) vs. GELU
ReLU has a hard kink at 0; GELU is smooth and probabilistic – standard in transformers (GPT, BERT).