Concept · Is it a real edge, or luck?
A range around a measured statistic that quantifies how uncertain the measurement is. A 95% confidence interval (CI) says: "we are 95% confident the true value lies within this range."
When you measure a strategy's win rate as 40% across 35 trades, you do not actually know the win rate is 40%. You know that in this sample of 35 trades it was 40%. The true underlying win rate could be higher or lower. The 95% CI gives you the range of plausible true values.
Reading a metric without its CI is like reading a thermometer that says "70°F" without knowing it's accurate to ±2°F or ±20°F. The number is meaningless without the uncertainty.
For a measured proportion p̂ (e.g. win rate as a fraction) over N trades:
SE = sqrt( p̂ × (1 − p̂) / N )
CI₉₅ = p̂ ± 1.96 × SE
The 1.96 multiplier comes from the normal distribution and corresponds to 95% confidence. For 99% confidence use 2.58; for 68% use 1.0.
The formula above is the simple Wald approximation. The project's standard is the Wilson interval — a refinement that stays sensible at small N and near 0% or 100% (where Wald can run off the end of [0, 100%]). The CIs cited below are Wilson intervals, taken directly from the dossier.
Almost no displayed metric in the trading simulator is shown with its CI. This means you, the human reader, have to mentally apply the CI when interpreting low-N strategies. Failing to do so leads to ranking lucky strategies as "best."
| Trades | ± half-width |
|---|---|
| 20 | 21.5% |
| 30 | 17.5% |
| 50 | 13.6% |
| 100 | 9.6% |
| 200 | 6.8% |
| 500 | 4.3% |
| 1000 | 3.0% |
To halve the CI you need 4× the trades. There is no shortcut.
The first has a flashier headline (66.7% win rate, profit factor 20.8), but the second has the only trustworthy estimate. Sample size buys precision.
Every backtest metric is an estimate with a CI. Sample size is the only knob that tightens it.
When comparing two strategies, ask: "Do their CIs overlap?"
These intervals overlap (both include the low-to-high-20s). Despite id478's headline win rate being more than double id511's, you cannot say id478 has a higher true win rate — its interval is so wide it reaches all the way down into id511's range. They might differ. Or they might not.
This is the formal way of saying "low-N strategies are not comparable to high-N strategies."
wiki/qa-sessions/2026-05-17-session.md#q2 (first asked here)Related concepts
See it in a real result →Put it to the test
Spawn your variant, run it on the same engine, and read the edge-significance verdict — before you risk real money.