How to Split a Dataset for Machine Learning
Every ML model needs three data splits: training data to learn patterns, validation data to tune hyperparameters, and test data to evaluate final performance. Getting the ratios wrong leads to overfitting, unreliable metrics, or wasted data.
Recommended Split Ratios by Dataset Size
| Dataset Size | Recommended Split | Why |
|---|---|---|
| < 1,000 | Use cross-validation | Too small for a fixed split — k-fold gives better estimates |
| 1,000 – 10,000 | 70 / 15 / 15 | Larger val/test sets for reliable evaluation |
| 10,000 – 100,000 | 80 / 10 / 10 | Standard split — enough data for all three sets |
| 100,000 – 1M | 90 / 5 / 5 | Even 5% gives 5K+ samples for evaluation |
| > 1M | 98 / 1 / 1 | 1% still gives 10K+ samples — maximize training data |
Stratified vs Random Splitting
Random splitting shuffles all samples and assigns them to splits randomly. Works well when your data is balanced and IID (independent and identically distributed).
Stratified splitting preserves the class distribution in each split. If your dataset is 90% class A and 10% class B, each split will maintain that 90/10 ratio. Always use stratified splits for imbalanced classification problems.
Common Mistakes
- Data leakage: Never normalize or preprocess data before splitting. Fit scalers on training data only, then apply to val/test.
- Temporal leakage: For time-series data, split chronologically — don't shuffle. Future data must never appear in training.
- Peeking at test data: Only evaluate on the test set once, at the very end. Use the validation set for all tuning decisions.
For splitting amounts by ratio or percentage, use our ratio calculator or percentage split calculator.