How to Validate Machine Learning Models for Data S...

Sahir_Maharaj

If you’ve ever built a machine learning model and thought it was performing well (only to watch it collapse when tested on new data) you’re not alone... I’ve been there too, staring at a high accuracy score that felt almost too good to be true, only to realize it was too good to be true. The problem usually wasn’t the model itself but the way I validated it. Without proper validation, you’re essentially training blind, hoping the model generalizes instead of knowing it does.

That’s where cross-validation becomes your safety net. It forces you to test your model in multiple slices of your data, revealing weaknesses you wouldn’t see otherwise. And here’s the kicker: choosing the right cross-validation strategy isn’t just technical, but strategic. Different business problems call for different validation methods, and knowing which to use can save you from making bad decisions that look good on paper.

What you will learn: In this edition, we will explore cross-validation strategies and how to put them to work using scikit-learn’s model_selection module. By the time you’re done, you’ll know when to reach for K-Fold versus StratifiedKFold, how to handle grouped data with GroupKFold, and why TimeSeriesSplit is the only safe option for time-based problems. We’ll also walk through practical Python examples in Microsoft Fabric so you can see these strategies in action and apply them right away.

Source: Sahir Maharaj

When data professionals first encounter cross-validation, the most natural starting point is K-Fold. Let's say you’ve got 1,000 rows of data. Instead of a single train-test split (say 80/20), K-Fold splits the dataset into k equal parts. If you choose k=5, then you train on 800 rows and test on 200. But you don’t stop there... you repeat the process five times, each time using a different 200 rows for testing. The result is five performance scores, which you can average for a more reliable estimate.

But here’s where nuance comes in. K-Fold assumes your data distribution is consistent. That works fine for balanced datasets but falls apart for imbalanced problems. Suppose you’re predicting fraud and only 3% of your data points are fraudulent transactions. In one of the folds, you might end up with no fraud cases at all in the test set. Suddenly, your model looks perfect on that fold... it predicts “no fraud” every time and still scores high. This false sense of security is dangerous, especially if your stakeholders are making business decisions based on these results.

Source: Sahir Maharaj

That’s where StratifiedKFold becomes your ally. It ensures that each fold mirrors the overall class distribution. If your dataset has 3% fraud, every fold also has about 3% fraud. This keeps the evaluation fair and consistent across folds. In practice, I almost always reach for StratifiedKFold when dealing with classification tasks - it’s a small change in code but a massive improvement in reliability.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

np.random.seed(7)
X, y = make_classification(
    n_samples=2500, n_features=25, n_informative=8, n_redundant=4,
    n_classes=2, weights=[0.9, 0.1], class_sep=1.5, flip_y=0.01, random_state=7
)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=7)
pipe = Pipeline([("scaler", StandardScaler()), ("clf", LogisticRegression(max_iter=500, random_state=7))])

accs = []
precs = []
recs = []
f1s = []
aucs = []
fold = 1

for train_idx, test_idx in skf.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    y_proba = pipe.predict_proba(X_test)[:, 1]
    accs.append(accuracy_score(y_test, y_pred))
    precs.append(precision_score(y_test, y_pred, zero_division=0))
    recs.append(recall_score(y_test, y_pred, zero_division=0))
    f1s.append(f1_score(y_test, y_pred, zero_division=0))
    aucs.append(roc_auc_score(y_test, y_proba))
    print(f"StratifiedKFold Fold {fold} | ACC={accs[-1]:.4f} PREC={precs[-1]:.4f} REC={recs[-1]:.4f} F1={f1s[-1]:.4f} AUC={aucs[-1]:.4f}")
    fold += 1

print(f"StratifiedKFold Mean | ACC={np.mean(accs):.4f} PREC={np.mean(precs):.4f} REC={np.mean(recs):.4f} F1={np.mean(f1s):.4f} AUC={np.mean(aucs):.4f}")

Now that you’ve seen how StratifiedKFold can handle class imbalance, let’s move on to something I run into quite often as a data scientist: grouped data. Imagine you’re predicting customer churn and your dataset has multiple rows per customer - for example, one row per subscription month. If you blindly use StratifiedKFold, some months from the same customer might end up in training and others in testing. The problem is that the model effectively “sees” the customer during training, making the test set less challenging. It’s like letting students practice with half the answers before an exam.

GroupKFold solves this by keeping entire groups together. You define the group and the cross-validation ensures no group is split across folds. This prevents leakage and forces your model to prove it can generalize across entirely unseen customers, not just unseen rows. I’ve observed that this is one of the most overlooked areas in practice. Teams often validate on row-level splits and then wonder why their model fails in production. The hidden culprit here is leakage through group overlap.

Source: Sahir Maharaj

The best way to internalize this is with code. Suppose you have data on loan applications and each applicant can submit multiple applications. To evaluate your model fairly, you need to ensure all applications from the same person stay together. Notice how the groups control the splitting. No applicant ID is in both train and test. This forces the model to learn across applicants, not just within applicants.

import numpy as np
from sklearn.model_selection import GroupKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

rng = np.random.default_rng(21)
n_groups = 80
sizes = rng.integers(3, 25, size=n_groups)
groups = np.concatenate([np.full(s, i) for i, s in enumerate(sizes)])
n_samples = len(groups)
X = rng.normal(size=(n_samples, 18))
base_prob = rng.uniform(0.05, 0.5, size=n_groups)
y = np.concatenate([rng.binomial(1, base_prob[i], size=sizes[i]) for i in range(n_groups)])

gkf = GroupKFold(n_splits=5)
clf = RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=21)

accs = []
precs = []
recs = []
f1s = []
fold = 1

for train_idx, test_idx in gkf.split(X, y, groups=groups):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accs.append(accuracy_score(y_test, y_pred))
    precs.append(precision_score(y_test, y_pred, zero_division=0))
    recs.append(recall_score(y_test, y_pred, zero_division=0))
    f1s.append(f1_score(y_test, y_pred, zero_division=0))
    print(f"GroupKFold Fold {fold} | ACC={accs[-1]:.4f} PREC={precs[-1]:.4f} REC={recs[-1]:.4f} F1={f1s[-1]:.4f}")
    fold += 1

print(f"GroupKFold Mean | ACC={np.mean(accs):.4f} PREC={np.mean(precs):.4f} REC={np.mean(recs):.4f} F1={np.mean(f1s):.4f}")

Having completed our exploration of grouped data, let’s now turn to the world of time-series problems. Here, the rules change completely! If you’re working with time, you can’t randomly shuffle data. Doing so would let the model peek into the future, something it will never get to do in production. Time-series models must respect chronological order.

Source: Sahir Maharaj

That’s exactly what TimeSeriesSplit handles. Instead of random splits, it creates expanding windows: you train on January - March, test on April; then train on January - April, test on May; and so on. Each new fold respects the flow of time. This setup mirrors deployment conditions where you’re always forecasting the future based on past data. Personally, I find this one of the most satisfying validation strategies because it feels so natural... your training data grows as you move forward, just like in reality.

Source: Sahir Maharaj

Enough talk - Let’s make it practical. Let's say you have 12 months of sales data. With 3 splits, TimeSeriesSplit will create three train-test setups where training data grows each time, and test data always lies ahead in time. Every fold here respects causality: you never use “future” data to predict the past. For time-series problems, this is non-negotiable. A model validated with random shuffling might look brilliant but collapse once deployed. TimeSeriesSplit is your safeguard against that illusion.

import numpy as np
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error

n = 360
t = np.arange(n)
trend = 0.1 * t
seasonal = 3.5 * np.sin(2 * np.pi * t / 24)
noise = np.random.default_rng(33).normal(scale=1.5, size=n)
series = 20 + trend + seasonal + noise

def make_supervised(series, n_lags=24):
    X_list = []
    y_list = []
    for i in range(n_lags, len(series)):
        X_list.append(series[i-n_lags:i])
        y_list.append(series[i])
    return np.array(X_list), np.array(y_list)

X, y = make_supervised(series, n_lags=24)

tscv = TimeSeriesSplit(n_splits=5)
lr = LinearRegression()
rg = Ridge(alpha=1.0, random_state=33)

mae_lr = []
rmse_lr = []
mae_rg = []
rmse_rg = []
fold = 1

for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    lr.fit(X_train, y_train)
    y_pred_lr = lr.predict(X_test)
    rg.fit(X_train, y_train)
    y_pred_rg = rg.predict(X_test)
    mae_lr.append(mean_absolute_error(y_test, y_pred_lr))
    rmse_lr.append(mean_squared_error(y_test, y_pred_lr, squared=False))
    mae_rg.append(mean_absolute_error(y_test, y_pred_rg))
    rmse_rg.append(mean_squared_error(y_test, y_pred_rg, squared=False))
    print(f"TimeSeriesSplit Fold {fold} | LR MAE={mae_lr[-1]:.3f} RMSE={rmse_lr[-1]:.3f} | Ridge MAE={mae_rg[-1]:.3f} RMSE={rmse_rg[-1]:.3f}")
    fold += 1

print(f"TimeSeriesSplit Mean | LR MAE={np.mean(mae_lr):.3f} RMSE={np.mean(rmse_lr):.3f} | Ridge MAE={np.mean(mae_rg):.3f} RMSE={np.mean(rmse_rg):.3f}")

So the next time you open your notebook, don’t settle for a single train-test split. Stop and ask: what does my data truly demand? Then act on it. Because the best data professionals aren’t the ones who only build models - they’re the ones who ensure their models can stand strong in the real world. Today, you know how to do exactly that. Now it’s time to put it into practice.

Thanks for taking the time to read my post! I’d love to hear what you think and connect with you 🙂

FabCon is coming to Atlanta

How to Validate Machine Learning Models for Data Science in Microsoft Fabric