Traditional Backtesting is Outdated. Use CPCV Instead

Exploring a potential alternative to traditional backtesting with Python

In quantitative finance, backtesting is an essential tool for strategy validation. It is based on a simple premise: We simulate a trading strategy using historical data and use the results to gauge its potential future performance. However, beneath the simplicity lies a host of pitfalls that can render backtest results misleading or worthless when deployed in live markets.

In this article, we make a case for a CPCV-based framework as a robust alternative to traditional backtesting. Importantly, to ensure the credibility of our results, we rely on high-quality historical data from EODHD’s API. Data quality is an essential attribute, as even the most sophisticated validation techniques can fail to deliver the expected results. We compare the two approaches to demonstrate that CPCV should be the new gold standard in financial model validation.

Traditional Backtesting

Traditional backtesting methods, particularly the widely adopted train/test split approaches, suffer from the following critical flaws:

Overfitting — where the model learns patterns specific to the training set that do not generalize
Look-ahead bias — where future information inadvertently leaks into the training phase
Data Snooping — where we repeatedly tweak the models until we get favorable results
Ignoring market dynamics — We assume that past structures and relationships will persist unchanged

One of the most commonly used approaches is the 80/20 split, where we train the model on 80% of the data and test it with the most recent 20%. This approach inherently assumes that the future will behave similarly to the past — or worse. In adopting that approach, we bias the strategy towards patterns that dominated in the most recent regime, potentially discarding valuable signals from earlier periods. A commonly adopted strategy to resolve this challenge is through random sampling, which introduces new problems as the model peeks into future data that would never have been available in a real-world setting.

Combinatorial Purged Cross Validation

Traditional backtesting techniques often rely on a single or fixed number of train/test splits, leading to a biased and fragile model performance assessment. This challenge is even more pronounced in financial time series where market regimes shift and autocorrelation is prevalent. Consequently, we tend to overstate the strategy’s robustness. To address these challenges, we explore Combinatorial Purged Cross-Validation.

What is CPCV?

CPCV is a cross-validation scheme for time-dependent data. It was introduced to overcome challenges such as overfitting to a narrow regime, information leakage from overlapping samples, and temporal bias from naïve splits. It solves these challenges by generating multiple train-test combinations from sequential blocks of data, purging, applying an embargo period, and averaging performance across splits.

Multiple Train-Test Combinations from Sequential Blocks of Data

CPCV does not rely on a single train-test split, instead, it divides the dataset into several sequential non-overlapping blocks. Next, it forms multiple combinations where some blocks serve as the training set and others as the test set. This approach ensures that the models are evaluated across different historical segments, capturing various market behaviors and reducing dependency in any specific time window.

Purging

When dealing with time series data, information can bleed from the training set into the test set when their window is too close or overlaps. In such cases, we implement purging — removing data points from the training set that are too close in time to the test set. This is especially important for those data points whose outcomes might influence the test window due to delayed effects. This approach prevents look-ahead bias and ensures a clean separation between training and testing data. We implement purging using the Python code below


def split_combinatorial_purged_cv(X: pd.DataFrame, 
                                pred_times: pd.Series,
                                eval_times: pd.Series, 
                                n_splits: int = 6,
                                n_test_splits: int = 2,
                                embargo_td: pd.Timedelta = pd.Timedelta(days=1)) -> List[Tuple[np.ndarray, np.ndarray]]:
    """Create CPCV splits"""
    if not X.index.equals(pred_times.index) or not X.index.equals(eval_times.index):
        raise ValueError("Indices must match")

    indices = np.arange(X.shape[0])
    fold_bounds = [(fold[0], fold[-1] + 1) for fold in np.array_split(indices, n_splits)]
    selected_fold_bounds = list(itt.combinations(fold_bounds, n_test_splits))
    selected_fold_bounds.reverse()

    splits = []
    for fold_bound_list in selected_fold_bounds:
        test_indices, test_bounds = compute_test_set(indices, fold_bound_list)
        train_indices = compute_train_set(indices, pred_times, eval_times, test_bounds, test_indices, embargo_td)
        if len(train_indices) > 0 and len(test_indices) > 0:
            splits.append((train_indices, test_indices))
    
    return splits

The function prevents the model from accidentally learning from future information. It removes any training data that overlaps with the period used for testing. In so doing, it ensures that the model is only trained on data that was available before the test period began, making evaluation more realistic.

Applying an Embargo Period

Sometimes, even after purging, residual information may linger around the boundaries of the test set. CPCV addresses this problem by applying an embargo period — a time buffer after the test window during which no data is used for training. This prevents the model from indirectly learning from the market’s reaction immediately after the test set, which would be unavailable in real-time decision-making. The following code applies an embargo period.


def embargo(train_indices: np.ndarray,
            indices: np.ndarray,
            pred_times: pd.Series,
            eval_times: pd.Series,
            test_indices: np.ndarray,
            test_end: int,
            embargo_td: pd.Timedelta) -> np.ndarray:
    """Apply embargo period"""
    if len(test_indices[test_indices <= test_end]) == 0:
        return train_indices
    
    test_end = min(test_end, len(pred_times) - 1)
    last_test_eval_time = eval_times.iloc[test_indices[test_indices <= test_end]].max()
    embargo_limit = last_test_eval_time + embargo_td
    allowed_indices = indices[(pred_times < pred_times.iloc[test_end]) | (pred_times > embargo_limit)]
    return np.intersect1d(train_indices, allowed_indices)

The function adds a safety buffer after the test period to prevent the model from learning from data that is too close to it. This is essential because in a time series, events just after the test set may still be influenced by it. To resolve the problem, we skip a short period (the embargo) between the test and training data to reduce the risk of picking up hidden patterns that could leak future information and distort model performance.

Averaging Performance Across Splits

CPCV computes the performance metrics for each split, such as accuracy, Sharpe ratio, and drawdown, once all the train-test combinations are evaluated. It then averages these metrics or analyzes them as a distribution. This aggregation provides a more reliable and generalizable estimate of the model’s true performance.

Python Implementation

1. Importing Packages

We start by importing the necessary libraries as follows


import numpy as np
import pandas as pd
import requests
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                           f1_score, roc_auc_score, confusion_matrix)
from itertools import combinations
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
import itertools as itt
from typing import Tuple, List, Dict, Union
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.utils.class_weight import compute_class_weight
import seaborn as sns
from tqdm import tqdm

2. Extracting data via EODHD API

Next, we use EODHD’s API to extract Apple Inc.’s historical data. Make sure to use the appropriate API key.


def get_historical_data(symbol, api_token, years=10):
    end_date = datetime.today()
    start_date = end_date - timedelta(days=365 * years)
    start_date_str = start_date.strftime('%Y-%m-%d')
    
    url = f'https://eodhistoricaldata.com/api/eod/{symbol}?api_token={api_token}&from={start_date_str}&fmt=json'
    response = requests.get(url)
    data = response.json()
    
    df = pd.DataFrame(data)
    df['date'] = pd.to_datetime(df['date'])
    df.set_index('date', inplace=True)
    df.sort_index(inplace=True)
    
    return df

api_key = 'YOUR_API_KEY'
symbol = 'AAPL'
df = get_historical_data(symbol, api_key)
print(df.tail())

We extract data for the last ten years to ensure we have sufficient data points for deploying purging and embargoing. The code successfully extracts the 10-year historical data of AAPL, which looks like this:

3. Feature engineering

After extracting the historical data, we engineer several features that we shall deploy in traditional backtesting and CPCV-based frameworks. The following code computes common technical indicators


# Functions to compute various technical indicators

def compute_rsi(series, window):
    """Compute Relative Strength Index"""
    delta = series.diff()
    gain = delta.where(delta > 0, 0)
    loss = -delta.where(delta < 0, 0)
    
    avg_gain = gain.rolling(window).mean()
    avg_loss = loss.rolling(window).mean()
    
    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

def create_features(df):
    """Create comprehensive technical indicators"""
    df['returns'] = df['close'].pct_change()
    df['log_returns'] = np.log1p(df['returns'])
    df['volatility_5'] = df['returns'].rolling(5).std()
    df['volatility_10'] = df['returns'].rolling(10).std()

    windows = [5, 10, 20, 50]
    for w in windows:
        df[f'sma_{w}'] = df['close'].rolling(w).mean()
        df[f'ema_{w}'] = df['close'].ewm(span=w, adjust=False).mean()

    df['rsi_14'] = compute_rsi(df['close'], 14)
    df['macd'] = df['ema_12'] - df['ema_26']
    df['momentum_5'] = df['close'] / df['close'].shift(5) - 1

    if 'volume' in df.columns:
        df['volume_ma_5'] = df['volume'].rolling(5).mean()
        df['volume_ma_10'] = df['volume'].rolling(10).mean()
        df['volume_change'] = df['volume'].pct_change()

    df['signal'] = np.where(df['returns'].shift(-1) > 0, 1, 0)

    df.dropna(inplace=True)
    return df

The code returns the following features:

4. Traditional Backtesting

In this section, we simulate a traditional backtesting workflow by training a Multi-Layer Perceptron model to predict market direction based on the technical indicators we computed from historical price data. The workflow is as follows.


def train_mlp_and_trade(df, threshold=0.55, initial_cash=10_000):
    # Step 1: Define feature columns
    exclude_cols = [
        'signal', 'returns', 'log_returns', 'volatility_5', 'volatility_10',
        'sma_12', 'ema_12', 'sma_26', 'ema_26', 'macd', 'momentum_5',
        'volume_ma_5', 'volume_ma_10'
    ]
    feature_cols = [col for col in df.columns if col not in exclude_cols]
    X = df[feature_cols]
    y = df['signal']

    # Step 2: Scale features
    scaler = StandardScaler()
    X_scaled = pd.DataFrame(scaler.fit_transform(X), index=X.index, columns=X.columns)

    # Step 3: Train-test split
    split_idx = int(len(df) * 0.8)
    X_train, X_test = X_scaled.iloc[:split_idx], X_scaled.iloc[split_idx:]
    y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]
    price_test = df['close'].iloc[split_idx:]  

    # Step 4: Train MLP model
    model = MLPClassifier(hidden_layer_sizes=(64,), max_iter=500, random_state=42)
    model.fit(X_train, y_train)

    # Step 5: Evaluate accuracy
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Traditional Backtest Accuracy: {accuracy:.4f}")

    # Step 6: Predict probabilities for trading signal
    y_proba = model.predict_proba(X_test)[:, 1]
    signals = (y_proba > threshold).astype(int)
    buy_signals = np.sum(signals == 1)
    sell_signals = np.sum(signals == 0)
    print(f"Buy signals: {buy_signals}, Sell signals: {sell_signals}")

    # Step 7: Simulate trading strategy
    cash = initial_cash
    position = 0
    portfolio_values = []

    for i in range(len(signals)):
        price = price_test.iloc[i]
        if signals[i] == 1 and cash > 0:
            position = cash / price  # Buy
            cash = 0
        elif signals[i] == 0 and position > 0:
            cash = position * price  # Sell
            position = 0
        portfolio_value = cash + position * price
        portfolio_values.append(portfolio_value)

    # Step 8: Report performance
    final_value = portfolio_values[-1]
    returns = (final_value - initial_cash) / initial_cash * 100
    print(f"Final portfolio value: ${final_value:.2f}")
    print(f"Total return: {returns:.2f}%")
    return portfolio_values

# Run the function
portfolio_values = train_mlp_and_trade(df)

We use a subset of engineered features while excluding noisy or redundant ones. We split the data into training and testing sets using a fixed 80/20 time-based split — a common practice in financial modeling.

We use class probabilities to make buy/sell decisions instead of using binary predictions directly. Next, we simulate trading with an initial capital of $10,000, track the portfolio value over time, and report final returns.

The results were as follows:

Using a traditional 80/20 split and an MLPClassifier trained on engineered technical features, we executed trades based on the model’s predicted probabilities. Applying a threshold of 0.55 to generate buy signals, the simulated strategy achieved a 321% return on an initial $10,000 investment. While promising, this result reinforces the importance of robust validation — future tests using Combinatorial Purged Cross-Validation (CPCV).

5. CPCV-based Backtesting

In this section, we implement a more robust framework using combinatorial purged cross-validation. Unlike the traditional train-test split, CPCV evaluates the strategy over multiple overlapping train/test combinations. This guarantees that every subset of the data is used for training and testing without bias. The MLP classifier is trained on various combinations of sequential data blocks, as illustrated in the following code.


def cpcv_trading_simulation(df, threshold=0.55, initial_cash=10_000, n_splits=6, n_test_splits=2):
    from sklearn.preprocessing import StandardScaler
    from sklearn.neural_network import MLPClassifier

    # Step 1: Prepare features and target
    exclude_cols = [
        'signal', 'returns', 'log_returns', 'volatility_5', 'volatility_10',
        'sma_12', 'ema_12', 'sma_26', 'ema_26', 'macd', 'momentum_5',
        'volume_ma_5', 'volume_ma_10'
    ]
    feature_cols = [col for col in df.columns if col not in exclude_cols]
    X = df[feature_cols]
    y = df['signal']
    prices = df['close']

    # Step 2: Scale features
    scaler = StandardScaler()
    X_scaled = pd.DataFrame(scaler.fit_transform(X), index=X.index, columns=X.columns)

    # Step 3: Generate CPCV splits
    pred_times = pd.Series(X_scaled.index, index=X_scaled.index)
    eval_times = pd.Series(X_scaled.index + pd.Timedelta(days=1), index=X_scaled.index)
    splits = split_combinatorial_purged_cv(X_scaled, pred_times, eval_times, n_splits=n_splits, n_test_splits=n_test_splits)

    total_return = 0
    all_portfolio_values = []
    all_buy_signals = 0
    all_sell_signals = 0
    model = MLPClassifier(hidden_layer_sizes=(64,), max_iter=500, random_state=42)

    for fold_num, (train_idx, test_idx) in enumerate(splits, 1):
        # Subsets
        X_train, y_train = X_scaled.iloc[train_idx], y.iloc[train_idx]
        X_test, y_test = X_scaled.iloc[test_idx], y.iloc[test_idx]
        price_test = prices.iloc[test_idx]

        # Train model
        model.fit(X_train, y_train)
        y_proba = model.predict_proba(X_test)[:, 1]
        signals = (y_proba > threshold).astype(int)

        # Trade simulation
        cash = initial_cash
        position = 0
        portfolio_values = []

        for i in range(len(signals)):
            price = price_test.iloc[i]
            if signals[i] == 1 and cash > 0:
                position = cash / price
                cash = 0
                all_buy_signals += 1
            elif signals[i] == 0 and position > 0:
                cash = position * price
                position = 0
                all_sell_signals += 1
            portfolio_values.append(cash + position * price)

        # Track result from last test point of this fold
        final_fold_value = portfolio_values[-1]
        fold_return = (final_fold_value - initial_cash) / initial_cash * 100
        total_return += fold_return
        all_portfolio_values.append(final_fold_value)

        print(f"Fold {fold_num} Return: {fold_return:.2f}%, Final Value: ${final_fold_value:.2f}")

    # Final report
    avg_final_value = np.mean(all_portfolio_values)
    avg_return = total_return / len(splits)
    print(f"\nCPCV Strategy Summary:")
    print(f"Total Folds: {len(splits)}")
    print(f"Avg Final Portfolio Value: ${avg_final_value:.2f}")
    print(f"Avg Return per Fold: {avg_return:.2f}%")
    print(f"Buy signals: {all_buy_signals}, Sell signals: {all_sell_signals}")

    return all_portfolio_values
cpcv_values = cpcv_trading_simulation(df)

We use purging and embargo periods to prevent data leakage. For each split, we rely on the MLP probability outputs to generate trading signals, entering a position when the model is sufficiently confident. The results were as follows.

The Combinatorial Purged Cross-Validation (CPCV) strategy was tested across 15 unique train-test fold combinations. For each fold, the model made out-of-sample predictions using an MLP classifier and traded based on the model’s confidence levels. The trading logic allowed the model to simulate buying when it was confident of an upward move and selling otherwise.

The returns were remarkably high, with some folds yielding over 1,700% growth. This high variance across folds underscores the strength of using CPCV. Importantly, it exposes how performance varies over different market conditions and mitigates the risk of overly optimistic single-split evaluations. This result makes a strong case for CPCV as a statistically robust, leak-resistant alternative to traditional backtesting.

Conclusion

In this article, we set out to demonstrate the limitations of traditional backtesting and the need to shift to a more robust validation framework using combinatorial purged cross-validation. We trained an MLP model using the traditional 80/20 split. The strategy’s trading performance was impressive, producing a final portfolio value of $42141.12, a 341% return on investment.

However, when we applied CPCV, the difference was striking. The model was trained on 15 unique folds, each with purging and an embargo to prevent data leakage. The average final portfolio value across folds was $104707, with some folds reaching over $180,000, with an average return of 947.07%. These results were accompanied by balanced buy/sell activity. While the traditional approach generated 136 and 96 buy and sell signals, respectively, CPCV had 1447 and 1443. Consequently, CPCV far outperforms the traditional approach.

With that being said, you’ve reached the end of the article. Hope you learned something new and useful today. Thank you very much for your time.

tech. finance. ai