Concept · Is it a real edge, or luck?
The number of independent observations (trades) used to compute a metric. Bigger sample = sharper estimate.
Every metric in a strategy report — win rate, profit factor, Sharpe ratio, expectancy — is an estimate of an underlying truth, based on the trades the strategy generated. The more trades, the closer the estimate is to the truth. The fewer trades, the wider the range of plausible truths.
Sample size is the single biggest determinant of how much you should trust the metrics.
Trade counts in the 210-strategy fleet range from N=3 (a 50/200 daily variant) to N=10,574 (a 9/21 scalp on 1-minute candles) — a 3,500× spread. The median is N=436. Naively ranking strategies by PnL or Sharpe without weighting by sample size produces a top list dominated by lucky low-N strategies, not robust ones.
A clean illustration sits at the two extremes: id478 (N=3, win rate could be anywhere from 20% to 94%) is pure anecdote, while id511 (N=469, win rate pinned to ±3.9pp) is a trustworthy measurement — even though id478's headline numbers look far better.
Standard error scales with 1/sqrt(N). Doubling your sample only reduces uncertainty by ~30%. To halve the confidence interval, you need 4× the trades.
| Trades | 95% CI half-width on WR (true ≈ 40%) |
|---|---|
| 20 | ±21.5% |
| 50 | ±13.6% |
| 100 | ±9.6% |
| 200 | ±6.8% |
| 500 | ±4.3% |
| 1000 | ±3.0% |
A subtle but critical refinement: raw trade count overstates effective sample size when observations are correlated. See statistical independence. The 210 variants in this dossier are not 210 independent tests — they share the same 21/50-long signal family across correlated assets (BTC/ETH/SOL) in one window, so they move together. Real sample-size growth comes from genuinely different markets and time periods, not from re-running correlated variants.
Practical implication: prioritize temporal extension (out-of-sample windows the strategy has never seen) over piling up more correlated variants.
The "≥30 trades" threshold is the floor, not the target. For deployable confidence:
| Tier | N | Use |
|---|---|---|
| Floor | 30 | Anecdote only |
| Directional | 100 | Hypothesize edge |
| Deployable | 300 | Trust PF and Sharpe enough to risk capital |
| Robust | 1000+ | Near-bulletproof |
Three independent constraints justify the 300 mark:
N ≈ (1.96 / Sharpe)². For true per-trade Sharpe 0.1, you need ~385 trades to distinguish from zero.Additional binding constraint: number of winners. For low-win-rate trend-followers (win rate = share of trades that close in profit), a 1000-trade strategy with only 15 winners has the same problem as a 15-trade strategy — only 15 samples of "what works." Use W ≥ 30 directional, W ≥ 100 deployable.
Across this fleet the candidate pool is tiny on every reading: only 5 of 210 rows clear the edge test, and after the multiple-comparisons haircut (~11 would pass by pure chance across 210 correlated variants) none is distinguishable from luck. Sample size alone never rescues a fleet this correlated.
wiki/qa-sessions/2026-05-17-session.md#q2 (first asked here)wiki/qa-sessions/2026-05-17-session.md#q3 (refinement)/api/analyticsRelated concepts
See it in a real result →Put it to the test
Spawn your variant, run it on the same engine, and read the edge-significance verdict — before you risk real money.