Detokenization
The process of converting tokens back into readable text – the reverse of tokenization.
Detokenization converts token sequences back into readable text – removes subword markers and correctly reconstructs whitespace.
Explanation
Detokenization must correctly reconstruct whitespace, punctuation, and special characters. With subword tokenization, "▁" (SentencePiece) or "##" (WordPiece) markers are removed.
Marketing Relevance
Detokenization is essential for correctly displaying LLM outputs in applications.
Common Pitfalls
Whitespace reconstruction with subword tokens is complex. Special characters and Unicode can be problematic. Streaming detokenization with partial tokens.
Origin & History
Detokenization was trivial with word-level tokenization. Subword tokenization (BPE, 2016) made detokenization more complex. SentencePiece solved the problem with the "▁" marker for word starts. Streaming detokenization became critical for chat interfaces (ChatGPT, 2022).
Comparisons & Differences
Detokenization vs. Tokenization
Tokenization splits text into tokens; detokenization reassembles tokens into readable text – not always losslessly possible.