Stratified Sampling
Sampling method that ensures class/group proportions in the sample match the overall distribution.
Stratified sampling preserves class distribution when splitting data – essential with class imbalance so rare classes are represented in every split.
Explanation
Especially important with class imbalance: prevents rare classes from being under- or over-represented in test or validation sets.
Marketing Relevance
Stratified sampling is standard in train/test splits and K-Fold CV to ensure representative evaluations.
Common Pitfalls
Stratification can be difficult with very rare classes. Multi-labels require special stratification methods.
Origin & History
The method comes from survey statistics (Neyman 1934). In ML, it became standard through Scikit-learn and is default in StratifiedKFold and train_test_split.
Comparisons & Differences
Stratified Sampling vs. Random Sampling
Random sampling can randomly exclude rare classes; stratified sampling guarantees proportional representation of each class.
Stratified Sampling vs. Oversampling
Stratified sampling preserves proportions; oversampling intentionally changes them to strengthen minority classes.