April 4, 2026 18 min read Tutorial Python

How to Backtest a Trading Strategy in Python

A trading idea is worthless until you test it against historical data. This tutorial walks you through building a complete momentum backtest from scratch using Python — from downloading data to computing performance metrics and spotting the pitfalls that make most backtests unreliable.

Prerequisites

You need three Python libraries installed before we begin. All three are available on PyPI:

pip install pandas numpy yfinance matplotlib

pandas handles tabular data and time series. numpy provides fast numerical computation. yfinance downloads historical OHLCV (open, high, low, close, volume) data from Yahoo Finance. matplotlib handles plotting. We will also use Python’s standard library datetime module.

This tutorial assumes basic Python fluency — you should be comfortable with DataFrames, indexing, and simple arithmetic operations on Series. No prior quantitative finance knowledge is required.

Step 1: Download Historical Data with yfinance

The first step in any backtest is acquiring clean historical data. We will use SPY (the SPDR S&P 500 ETF Trust) as our test instrument. SPY is the most liquid ETF in the world by trading volume, which means its historical prices are reliable and minimally affected by bid-ask bounce.

import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt

# Download daily OHLCV data for SPY
data = yf.download("SPY", start="2020-01-01", end="2025-01-01")

# Inspect the first few rows
print(data.head())
print(f"\nShape: {data.shape}")
print(f"Date range: {data.index[0].date()} to {data.index[-1].date()}")

The yf.download() function returns a pandas DataFrame indexed by date. The columns are Open, High, Low, Close, Adj Close, and Volume. For backtesting, you should use the Adj Close (adjusted close) column, which accounts for dividends and stock splits. Using the unadjusted Close column will introduce errors because historical prices are not retroactively adjusted for corporate actions.

# Use adjusted close for all calculations
price = data["Adj Close"].copy()

# Calculate daily returns
daily_returns = price.pct_change().dropna()

print(f"Mean daily return: {daily_returns.mean():.6f}")
print(f"Daily volatility:  {daily_returns.std():.6f}")
print(f"Trading days:      {len(daily_returns)}")

Data Quality Note

yfinance pulls data from Yahoo Finance, which is adequate for learning and prototyping. For production backtesting, consider data vendors that provide survivorship-bias-free, point-in-time data with proper corporate action adjustments. Yahoo Finance data occasionally has gaps or errors, particularly for delisted securities.

Step 2: Generate Trading Signals

We will implement a dual moving average crossover strategy, one of the most well-known trend-following approaches. The logic is straightforward: buy when a shorter-term moving average crosses above a longer-term moving average, and sell when it crosses below.

Specifically, we use the 50-day simple moving average (SMA) and the 200-day simple moving average (SMA). When the 50-day SMA crosses above the 200-day SMA, this is commonly called a “golden cross” and is interpreted as a bullish signal. When the 50-day crosses below the 200-day, it is called a “death cross” and is interpreted as bearish.

# Compute moving averages
data["SMA_50"] = price.rolling(window=50).mean()
data["SMA_200"] = price.rolling(window=200).mean()

# Generate signal: 1 = long, 0 = flat
# Buy when 50 MA > 200 MA, sell when 50 MA < 200 MA
data["signal"] = 0
data.loc[data["SMA_50"] > data["SMA_200"], "signal"] = 1

# Shift signal by 1 day to avoid lookahead bias
# The signal on day T is based on day T's close prices,
# so we can only act on it at day T+1's open
data["position"] = data["signal"].shift(1)

# Drop rows where we don't have enough data for the 200-day MA
data = data.dropna()

# Count trades (transitions between 0 and 1)
trades = (data["position"].diff().abs() > 0).sum()
print(f"Number of trades (entries + exits): {trades}")
print(f"Days in position: {data['position'].sum():.0f} / {len(data)}")

Lookahead Bias

The .shift(1) on line 9 is critical. Without it, you are using today’s closing prices to generate a signal and then pretending you traded at today’s close — which is impossible in practice. You can only observe the close after the market has closed, so the earliest you could act is the next trading day. Failing to shift signals is the single most common backtesting mistake.

Step 3: Simulate Trades

With our position series in hand, simulating the strategy’s returns is a single line of vectorized pandas code. When we are in position (position = 1), we earn the daily return of SPY. When we are flat (position = 0), we earn nothing (we assume cash earns zero for simplicity).

# Daily returns of the asset
data["asset_return"] = data["Adj Close"].pct_change()

# Strategy returns: asset return when in position, 0 when flat
data["strategy_return"] = data["position"] * data["asset_return"]

# Buy-and-hold benchmark for comparison
data["benchmark_return"] = data["asset_return"]

# Cumulative returns (growth of $1)
data["strategy_cumulative"] = (1 + data["strategy_return"]).cumprod()
data["benchmark_cumulative"] = (1 + data["benchmark_return"]).cumprod()

print(f"Strategy total return:  {data['strategy_cumulative'].iloc[-1] - 1:.2%}")
print(f"Benchmark total return: {data['benchmark_cumulative'].iloc[-1] - 1:.2%}")

This vectorized approach is much faster than looping through rows one at a time. For SPY over five years, we have roughly 1,250 trading days. Vectorized computation handles this instantly; an explicit Python loop would also be fast at this scale, but the habit of vectorizing matters when you scale to thousands of securities.

Step 4: Apply Transaction Costs

A backtest without transaction costs is a fantasy. Every time you enter or exit a position, you incur costs from commissions, bid-ask spread, and market impact (slippage). For a liquid instrument like SPY, a realistic round-trip cost estimate is approximately 10 basis points (0.10%) — roughly 5 bps per side. This accounts for the bid-ask spread (typically 1 penny on SPY, which is about 0.2 bps at a $500 price) plus slippage from execution delay.

# Transaction cost: 10 basis points per trade (5 bps per side)
cost_per_trade = 0.0010  # 10 bps round trip

# Identify trade days (position changes)
data["trade"] = data["position"].diff().abs().fillna(0)

# Subtract cost on each trade day
data["strategy_return_net"] = (
    data["strategy_return"] - data["trade"] * cost_per_trade / 2
)

# Recalculate cumulative returns after costs
data["strategy_net_cumulative"] = (1 + data["strategy_return_net"]).cumprod()

gross_return = data["strategy_cumulative"].iloc[-1] - 1
net_return = data["strategy_net_cumulative"].iloc[-1] - 1
cost_drag = gross_return - net_return

print(f"Gross return:  {gross_return:.2%}")
print(f"Net return:    {net_return:.2%}")
print(f"Cost drag:     {cost_drag:.2%}")
print(f"Total trades:  {data['trade'].sum():.0f}")

We divide the round-trip cost by 2 and apply it on each position change because data["trade"] fires on both entries and exits separately. The cost drag may look small per trade, but it compounds over time. A strategy that trades frequently — say 200 round trips per year — would lose 2% annually to transaction costs alone at 10 bps per round trip.

Step 5: Compute Performance Metrics

Raw returns are not enough to evaluate a strategy. You need standardized metrics that account for risk, consistency, and the distribution of outcomes. Here are the essential metrics every backtest should report:

def compute_metrics(returns, trades_per_year=None):
    """Compute standard backtest performance metrics."""
    # Annualized return
    total_days = len(returns)
    total_return = (1 + returns).prod() - 1
    ann_return = (1 + total_return) ** (252 / total_days) - 1

    # Annualized volatility
    ann_vol = returns.std() * np.sqrt(252)

    # Sharpe ratio (assuming risk-free rate = 0 for simplicity)
    sharpe = returns.mean() / returns.std() * np.sqrt(252)

    # Maximum drawdown
    cumulative = (1 + returns).cumprod()
    running_max = cumulative.cummax()
    drawdown = (cumulative - running_max) / running_max
    max_drawdown = drawdown.min()

    # Win rate
    winning_days = (returns[returns != 0] > 0).sum()
    total_active_days = (returns != 0).sum()
    win_rate = winning_days / total_active_days if total_active_days > 0 else 0

    # Profit factor: gross profits / gross losses
    profits = returns[returns > 0].sum()
    losses = abs(returns[returns < 0].sum())
    profit_factor = profits / losses if losses > 0 else float("inf")

    return {
        "Total Return": f"{total_return:.2%}",
        "Annualized Return": f"{ann_return:.2%}",
        "Annualized Volatility": f"{ann_vol:.2%}",
        "Sharpe Ratio": f"{sharpe:.2f}",
        "Max Drawdown": f"{max_drawdown:.2%}",
        "Win Rate (active days)": f"{win_rate:.2%}",
        "Profit Factor": f"{profit_factor:.2f}",
    }

# Compute metrics for strategy and benchmark
strat_metrics = compute_metrics(data["strategy_return_net"].dropna())
bench_metrics = compute_metrics(data["benchmark_return"].dropna())

print("\n--- Strategy ---")
for k, v in strat_metrics.items():
    print(f"  {k:<25s} {v}")

print("\n--- Buy & Hold ---")
for k, v in bench_metrics.items():
    print(f"  {k:<25s} {v}")

Understanding Each Metric

The Sharpe ratio is the most widely used measure of risk-adjusted returns. It divides the mean return by the standard deviation and annualizes by multiplying by the square root of 252 (the approximate number of trading days in a year). A Sharpe above 1.0 is generally considered good for a long-only strategy; above 2.0 is excellent. Below 0.5 suggests the strategy is not generating sufficient return for the risk taken.

Maximum drawdown measures the worst peak-to-trough decline in cumulative returns. It answers the question: “If I started at the worst possible time, how much would I have lost before recovering?” A max drawdown of -30% means the strategy lost 30% from its peak before eventually making new highs. This metric is critical because drawdowns directly affect investor behavior — most people cannot tolerate drawdowns beyond 20-25% without abandoning a strategy.

Win rate measures the percentage of active trading days that produced positive returns. Note that a low win rate does not necessarily mean a bad strategy — trend-following systems often win on only 40-45% of trades but make significantly more on winners than they lose on losers. That asymmetry is captured by the profit factor, which divides total gross profits by total gross losses. A profit factor above 1.0 means the strategy is profitable; above 1.5 is solid; above 2.0 is strong.

Step 6: Plot the Equity Curve

A visual inspection of the equity curve often reveals things that summary statistics hide — long flat periods, sudden drawdowns, or returns concentrated in a short window.

fig, axes = plt.subplots(2, 1, figsize=(12, 8), gridspec_kw={"height_ratios": [3, 1]})

# Equity curve
axes[0].plot(data.index, data["strategy_net_cumulative"], label="Strategy (net)", linewidth=1.5)
axes[0].plot(data.index, data["benchmark_cumulative"], label="Buy & Hold", linewidth=1.5, alpha=0.7)
axes[0].set_title("Equity Curve: 50/200 SMA Crossover on SPY", fontsize=14)
axes[0].set_ylabel("Growth of $1")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Drawdown
cumulative = (1 + data["strategy_return_net"]).cumprod()
running_max = cumulative.cummax()
drawdown = (cumulative - running_max) / running_max
axes[1].fill_between(data.index, drawdown, 0, alpha=0.4, color="red")
axes[1].set_title("Strategy Drawdown", fontsize=12)
axes[1].set_ylabel("Drawdown")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("backtest_equity_curve.png", dpi=150, bbox_inches="tight")
plt.show()

The top panel shows the cumulative growth of $1 invested in the strategy versus buying and holding SPY. The bottom panel shows the strategy’s drawdown at each point in time. Look for extended periods of drawdown (underwater time) and compare the shape of the equity curves. A strategy that underperforms buy-and-hold on a risk-adjusted basis is not providing value.

Step 7: Compare to the Buy-and-Hold Benchmark

Every strategy must be evaluated relative to a benchmark. For a long-only equity strategy, the natural benchmark is buying and holding the same instrument. This comparison answers the fundamental question: did the trading signals add value, or would you have been better off doing nothing?

# Summary comparison table
comparison = pd.DataFrame({
    "Strategy (net)": compute_metrics(data["strategy_return_net"].dropna()),
    "Buy & Hold": compute_metrics(data["benchmark_return"].dropna()),
})
print(comparison.to_string())

Trend-following strategies like the SMA crossover tend to underperform buy-and-hold during strong bull markets because they are periodically out of the market. Their value shows up during bear markets: by going flat when the short-term average drops below the long-term average, they avoid the worst drawdowns. The tradeoff is missing some upside to avoid the deepest losses.

A fair comparison should also account for the opportunity cost of sitting in cash. When the strategy is flat, that capital could be invested in Treasury bills. At a 5% annualized risk-free rate, parking cash for 40% of the year adds approximately 2% in returns that our simple backtest ignores.

Common Backtesting Mistakes

Most backtests produce results that are far more optimistic than what you would experience in live trading. Here are the primary sources of error:

1. Lookahead Bias

Lookahead bias occurs when your backtest uses information that was not available at the time of the trading decision. The most common form is failing to shift signals, as we discussed in Step 2. More subtle forms include using adjusted price data that was retroactively modified (e.g., a stock split that was applied to historical prices after the fact) or using fundamental data that was restated after the reporting period.

2. Not Modeling Transaction Costs

A strategy that looks profitable before costs can easily be unprofitable after. This is especially true for high-frequency strategies. Even for daily strategies, costs of 10-20 bps per round trip compound significantly over hundreds of trades. Always include realistic cost estimates. For illiquid small-cap stocks, costs can be 50-100 bps or more per side due to wide bid-ask spreads and market impact.

3. Overfitting to Historical Data

If you optimize 15 parameters on five years of daily data, you will almost certainly find a combination that produces spectacular historical returns. This is not skill — it is curve fitting. The parameters were tuned to fit random noise in that specific data sample, and they will not generalize to future data. We cover overfitting in detail in a dedicated article.

4. Survivorship Bias

If your backtest universe only contains stocks that currently exist, you are systematically excluding companies that went bankrupt, were delisted, or were acquired at distressed prices. This inflates returns because your universe is full of “winners.” We discuss this in depth in our article on survivorship bias.

5. Ignoring Market Impact

When you place an order, the act of buying or selling moves the price against you. For SPY, this is negligible. But for a small-cap stock trading 50,000 shares per day, a $100,000 order represents two full days of volume and will cause significant price impact. Square-root impact models, such as the one described by Jim Gatheral in “No-Dynamic-Arbitrage and Market Impact” (2010), estimate that impact scales as the square root of the trade size relative to daily volume.

Advanced: Walk-Forward Optimization

Walk-forward analysis is the gold standard for validating a backtested strategy. Instead of optimizing parameters on the full dataset and testing on the same data (in-sample only), you split the data into sequential train-test windows and roll forward through time.

def walk_forward_backtest(data, train_years=3, test_years=1):
    """
    Walk-forward optimization for SMA crossover.
    Train on train_years, test on test_years, roll forward.
    """
    results = []
    price = data["Adj Close"]

    # Define candidate parameter pairs
    fast_windows = [20, 30, 40, 50, 60]
    slow_windows = [100, 150, 200, 250]

    start = data.index[0]
    end = data.index[-1]

    train_start = start
    while True:
        train_end = train_start + pd.DateOffset(years=train_years)
        test_start = train_end
        test_end = test_start + pd.DateOffset(years=test_years)

        if test_end > end:
            break

        # Training period: find best parameter pair
        best_sharpe = -np.inf
        best_params = (50, 200)

        for fast in fast_windows:
            for slow in slow_windows:
                if fast >= slow:
                    continue
                train_data = price[train_start:train_end]
                sma_fast = train_data.rolling(fast).mean()
                sma_slow = train_data.rolling(slow).mean()
                sig = (sma_fast > sma_slow).astype(int).shift(1)
                ret = train_data.pct_change() * sig
                ret = ret.dropna()
                if len(ret) > 0 and ret.std() > 0:
                    sharpe = ret.mean() / ret.std() * np.sqrt(252)
                    if sharpe > best_sharpe:
                        best_sharpe = sharpe
                        best_params = (fast, slow)

        # Test period: apply best parameters from training
        fast, slow = best_params
        test_data = price[train_start:test_end]  # need history for MA
        sma_fast = test_data.rolling(fast).mean()
        sma_slow = test_data.rolling(slow).mean()
        sig = (sma_fast > sma_slow).astype(int).shift(1)
        ret = test_data.pct_change() * sig
        test_ret = ret[test_start:test_end].dropna()

        results.append({
            "train": f"{train_start.date()} to {train_end.date()}",
            "test": f"{test_start.date()} to {test_end.date()}",
            "params": best_params,
            "test_return": (1 + test_ret).prod() - 1,
            "test_sharpe": test_ret.mean() / test_ret.std() * np.sqrt(252) if test_ret.std() > 0 else 0,
        })

        # Roll forward
        train_start = train_start + pd.DateOffset(years=test_years)

    return pd.DataFrame(results)

# Run walk-forward analysis
wf_results = walk_forward_backtest(data)
print(wf_results.to_string(index=False))

Walk-forward analysis reveals the true out-of-sample performance of your strategy. The test period returns are genuine out-of-sample results because the parameters were chosen using only the training data that preceded each test window. If the Sharpe ratio degrades dramatically from train to test, your strategy is likely overfit.

A realistic expectation: if your training Sharpe is 1.5, the out-of-sample Sharpe will often be 0.5-0.8. This degradation is normal and expected. If your out-of-sample Sharpe is consistently above 0.5, you likely have a genuine edge.

Backtesting Frameworks

Once you understand the fundamentals, you may want to use a dedicated backtesting framework instead of building everything from scratch. Here are three widely used options:

backtrader

backtrader is an event-driven backtesting framework written in Python. It simulates the order-by-order mechanics of trading, including broker models, commission structures, and order types (market, limit, stop). It is well-suited for strategies that require realistic order execution modeling. The event-driven architecture means it processes data bar by bar, which is slower than vectorized approaches but more realistic. backtrader is open source and available on PyPI (pip install backtrader).

vectorbt

vectorbt takes the opposite approach: it uses numpy and pandas vectorization to run backtests extremely fast. It can test thousands of parameter combinations in seconds by computing all variations simultaneously using array operations. This makes it ideal for parameter optimization and strategy screening. The tradeoff is that it is harder to model complex order execution logic. vectorbt is open source and installable via pip install vectorbt.

zipline

zipline was originally developed by Quantopian, the now-defunct crowd-sourced quantitative hedge fund. After Quantopian shut down in 2020, zipline has been maintained by the open-source community, with Stefan Jansen (author of Machine Learning for Algorithmic Trading) maintaining an active fork called zipline-reloaded. Zipline uses an event-driven architecture similar to backtrader and includes built-in integration with the Quandl data bundle. It is installable via pip install zipline-reloaded.

Framework	Architecture	Speed	Best For
backtrader	Event-driven	Moderate	Realistic execution modeling
vectorbt	Vectorized	Very fast	Parameter sweeps, screening
zipline-reloaded	Event-driven	Moderate	Full pipeline with data bundles

Putting It All Together

Here is the complete, self-contained backtest script that combines all the steps above into a single runnable file:

import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt

# --- Configuration ---
TICKER = "SPY"
START = "2020-01-01"
END = "2025-01-01"
FAST_MA = 50
SLOW_MA = 200
COST_BPS = 10  # round-trip cost in basis points

# --- Step 1: Data ---
data = yf.download(TICKER, start=START, end=END)
price = data["Adj Close"]

# --- Step 2: Signals ---
sma_fast = price.rolling(FAST_MA).mean()
sma_slow = price.rolling(SLOW_MA).mean()
signal = (sma_fast > sma_slow).astype(int)
position = signal.shift(1)  # avoid lookahead bias

# --- Step 3: Returns ---
asset_return = price.pct_change()
strategy_return = position * asset_return

# --- Step 4: Costs ---
trades = position.diff().abs().fillna(0)
cost_per_trade = COST_BPS / 10000
strategy_net = strategy_return - trades * cost_per_trade / 2

# --- Step 5: Metrics ---
strategy_net = strategy_net.dropna()
benchmark = asset_return.loc[strategy_net.index]

sharpe = strategy_net.mean() / strategy_net.std() * np.sqrt(252)
cumulative = (1 + strategy_net).cumprod()
max_dd = ((cumulative - cumulative.cummax()) / cumulative.cummax()).min()
total_return = cumulative.iloc[-1] - 1

print(f"Total Return: {total_return:.2%}")
print(f"Sharpe Ratio: {sharpe:.2f}")
print(f"Max Drawdown: {max_dd:.2%}")

# --- Step 6: Plot ---
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(cumulative.index, cumulative, label="Strategy (net)")
ax.plot(cumulative.index, (1 + benchmark).cumprod(), label="Buy & Hold", alpha=0.7)
ax.set_title(f"{FAST_MA}/{SLOW_MA} SMA Crossover on {TICKER}")
ax.set_ylabel("Growth of $1")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("backtest_result.png", dpi=150)
plt.show()

Copy this into a file called backtest_sma.py, run it with python backtest_sma.py, and you will have a complete, working backtest in under 30 seconds. From here, you can experiment with different moving average windows, different instruments, or different signal generation logic entirely.

The most important lesson from this exercise is not the specific results of the SMA crossover strategy — it is the process. Every strategy, no matter how complex, follows these same steps: acquire data, generate signals, simulate trades, apply costs, compute metrics, and compare to a benchmark. The more rigorous you are at each step, the less likely you are to fool yourself with a backtest that looks great on paper but fails with real money.