Concept · The traps that fake an edge
When you pick the best-looking variants from a large pool, the winners' apparent performance is inflated — some of it is real edge, some of it is luck that happens to land on the strategies you picked. The more variants you test, the larger the inflation.
If you flip 1000 fair coins 10 times each, a few will come up heads 8-9 times by pure chance. If you then declare those "the lucky coins" and bet on them, you'll be disappointed — the next 10 flips will look like fair-coin tosses again. The 8-9-heads result was an artifact of having flipped so many coins, not a property of those specific coins.
Strategy backtesting has the same problem. With 210 EMA-cross variants tested on one dataset, some will look amazing by chance even if none of them have true edge. The "best" variants by in-sample Sharpe (mean per-trade return divided by its volatility) are partly real winners and partly the most-lucky variants. You cannot tell which is which without an out-of-sample test.
The dossier #1 fleet is exactly the kind of large variant-space that suffers selection bias. It is 210 correlated variants of one idea — the EMA cross — scanned over a single in-sample window.
Here is the cleanest way to see the problem. Across those 210 variants, 5 rows clear the edge-significance bar (their measured edge is statistically distinguishable from luck on this data). That sounds like 5 winners. But across 210 correlated variants you would expect roughly 11 rows to pass an edge test by pure chance — the false-positive rate of testing a big pool. So 5 is actually below the luck baseline.
In other words, the apparent "winners" sit comfortably inside the band of luck you would get from searching such a large pool. After the multiple-comparisons haircut (the penalty for having tested so many variants), no edge in the fleet is cleanly distinguishable from luck. Only an out-of-sample re-test — re-running the candidates on data the selection never touched — can separate the real signal from the lucky noise.
There is a second, more concrete tell. Of all 210 rows, only 6 beat simply buying and holding the asset over the window — and every one of those 6 is a 50× long. They were selected by leverage, not by skill: amplified exposure to an exceptional bull run, not a better signal. That is selection bias wearing a different hat — the survivors share a trait (high leverage) that has nothing to do with edge.
Selection bias multiplies with every additional dimension you scan. Every venue, symbol, leverage, or parameter sweep you add gives luck more places to hide a false winner. Adding dimensions does grow your data, but it also grows the number of comparisons. The net effect on actionable confidence depends on whether the data growth outpaces the comparison growth — and in a tightly-correlated pool like 210 variants of one cross, it usually does not.
OOS testing is the cure. The mechanism: if a variant looks great in-sample by luck, that luck does not repeat OOS. Real edge does. So after OOS validation, the residual performance is much closer to the true edge.
Without OOS, every in-sample winner in a large correlated scan is a noisy estimate that needs real shrinkage — and as the 5-vs-11 count above shows, the apparent winners may not even clear the luck baseline. With OOS, the validated number is the honest one.
wiki/qa-sessions/2026-05-17-session.md#q8 (first formal entry; referenced in earlier Qs)Related concepts
See it in a real result →Put it to the test
Spawn your variant, run it on the same engine, and read the edge-significance verdict — before you risk real money.