PerpForge
Get started

Concept · The traps that fake an edge

Meta-Analysis

Combining results from multiple independent backtests (or studies) into one pooled estimate, by weighting each result by its precision (inverse-variance weighting). The proper way to summarize per-venue or per-period results — but ONLY when those results are genuinely independent.

Meta-Analysis

Combining results from multiple independent backtests (or studies) into one pooled estimate, by weighting each result by its precision (inverse-variance weighting). The proper way to summarize per-venue or per-period results — but ONLY when those results are genuinely independent.

In plain English

Meta-analysis means: combining several separate backtest results into a single, tighter estimate of a strategy's true edge — giving more weight to the more precise (larger-sample, lower-noise) results.

Say you ran one strategy separately on three independent slices of data. Three Sharpe ratios (Sharpe = mean per-trade return divided by its volatility) came out. If they're roughly consistent (their confidence intervals overlap around a similar value) AND the slices are truly independent, you can combine them into one tighter estimate.

Doing this naively (averaging the three Sharpes) is wrong — it treats them as if they had equal precision, which they don't. The dataset with more trades has a tighter SE (Standard Error) and deserves more weight. The dataset with fewer trades is noisier and deserves less.

The standard method is inverse-variance weighting: each estimate is weighted by 1 / SE². Datasets with smaller standard error contribute more.

Formula

Given k independent estimates S_i with standard errors SE(S_i):

w_i = 1 / SE(S_i)²
S_pooled = Σ (S_i × w_i) / Σ w_i
SE(S_pooled) = sqrt(1 / Σ w_i)

The pooled estimate has tighter standard error than any individual estimate — that's the whole point of pooling.

When pooling is VALID

  • Datasets are statistically independent. No shared market events, no shared trades.
  • Per-dataset CIs (Confidence Intervals) overlap a common region. If they don't overlap, you're trying to average estimates that disagree — which means the underlying parameter isn't constant across datasets (the edge is venue-bound, time-bound, or symbol-bound). Pooling is inappropriate; report separate results instead.
  • The estimand is the same. All three datasets must be estimating the same underlying edge — same strategy logic, same instrument class.

When pooling is NOT valid for this fleet

  • Same-period overlap data across venues. Orderly 2024 + Binance 2024 + Coinbase 2024 all reflect the same underlying market events. They're correlated. Pooling these overstates independence and produces an over-tight CI.
  • Per-symbol pooling within a single strategy. BTC, ETH, and SOL move together during major events. Pooling per-symbol Sharpes for one rule (e.g. the 21/50 1h long signal above) overstates independence — the resulting CI comes out too tight to trust.
  • Heterogeneous estimates. If per-venue CIs don't overlap, pooling hides the disagreement. Report the divergence instead.

When pooling IS valid for this fleet

  • Different-period cross-venue data. Binance 2018-2023 + Coinbase 2018-2023 + Orderly 2023-2026 — different time periods, different market events, much closer to independent. The pooled estimate here is the cleanest measure of regime-robust edge.
  • Different non-overlapping windows of the same venue. Walk-forward chunks of Binance data, each ~6 months, are roughly independent. Pooling N chunks gives a regime-averaged Sharpe estimate.

Heterogeneity test

Before pooling, formally check whether per-dataset estimates agree. The standard test is Cochran's Q:

Q = Σ w_i × (S_i − S_pooled)²

Compare Q against a χ² distribution with k-1 degrees of freedom. If Q is too large, the estimates are heterogeneous — do not pool. If Q is in the expected range, pooling is justified.

Less formal: if all per-dataset CIs overlap a common point, you can pool. If any pair fails to overlap, you can't.

Example relevant to the fleet — why pooling here is INVALID

Dossier #1 is the textbook case where pooling is tempting but wrong.

The same "21/50 cross, 1h, going long" rule was run on three symbols, and all three came out positive:

  • BTC (id511): Sharpe 0.020
  • ETH (id517): Sharpe 0.059
  • SOL (id523): Sharpe 0.110

You might be tempted to pool these three — inverse-variance-weight them — to gain confidence in "the 21/50 long signal." Don't. Pooling is only valid for independent results, and these three break independence on every axis at once:

  1. They are correlated variants of ONE signal — the same cross, just on three symbols. Not three separate ideas.
  2. The assets themselves are correlated. BTC, ETH, and SOL move together during major market events, so their per-trade returns share the same shocks.
  3. One window. All three were measured on the same in-sample period, so they saw the same regime.

Inverse-variance pooling assumes the three errors are independent. Here they are heavily shared. Pooling them would shrink the standard error as if you had three independent looks at the edge, when you really have something much closer to one — it would overstate your confidence. The honest move is to report the three numbers and their confidence intervals separately, exactly as above.

Practical use

For most analyses in this fleet, the simpler workflow is:

  1. Report per-stratum (per-venue, per-symbol, per-period) metrics with CIs.
  2. Eyeball whether they overlap.
  3. Pool only when overlap is clear and datasets are independent.
  4. When in doubt, present the unpooled results — three numbers with three CIs is more honest than one pooled number that hides disagreement.

Related

Sources

  • wiki/qa-sessions/2026-05-17-session.md#q7 (first asked here)
  • Borenstein et al. (2009), Introduction to Meta-Analysis
  • Cochran's Q-test for heterogeneity (standard biostatistics)

Related concepts

See it in a real result →

Put it to the test

Does your idea have a real edge, or just a big number?

Spawn your variant, run it on the same engine, and read the edge-significance verdict — before you risk real money.

Test your own idea — free →Free account, no card