CTC (Connectionist Temporal Classification)
CTC is a training algorithm for sequence-to-sequence problems where input and output have different lengths – the key to modern ASR.
CTC trains ASR models without explicit alignment – it sums over all possible frame-to-text mappings.
Explanation
CTC sums over all possible alignments between audio frames and text characters. A blank token allows the model to skip frames without output. Greedy or beam search decoding produces the final text.
Marketing Relevance
CTC enables end-to-end ASR without manual alignment annotation. Wav2Vec 2.0 uses CTC as fine-tuning objective.
Common Pitfalls
CTC assumes conditional independence of outputs (no language model). Peaky distributions can complicate decoding.
Origin & History
Graves et al. (2006) invented CTC for handwriting recognition. DeepSpeech (Baidu, 2014) made CTC the standard for ASR. Wav2Vec 2.0 (2020) uses CTC for fine-tuning.
Comparisons & Differences
CTC (Connectionist Temporal Classification) vs. Attention-based ASR
CTC uses conditional independence (fast, monotonic); attention-based ASR learns flexible alignments (slower, more powerful).
CTC (Connectionist Temporal Classification) vs. RNN-Transducer
CTC has no label dependency; RNN-T models dependencies between outputs – ideal for streaming ASR.