Blog
A 60% win rate over 20 trades is statistically indistinguishable from a coin flip. Here's the math that proves it.
You ran a backtest (a simulation of a trading strategy against historical price data, as if you had traded it in the past). It won 12 of 20 trades. A 60% win rate (the percentage of trades that closed in profit). It feels like you found something.
Here is the uncomfortable question: would 20 coin flips have looked any different?
I built a backtester because I kept fooling myself on exactly this point. I would tweak a strategy, see a green number, and feel smart. Then the next batch of data would erase it. The fix was not a better strategy. It was a statistical test that says, for any backtest result, whether the win rate is distinguishable from luck. This post walks through that test in plain English, arithmetic included, so you can run it on your own results.
In statistics, a null hypothesis is the boring explanation you assume is true until the evidence forces you to drop it. For a trading strategy, the boring explanation is brutal: your strategy is a coin flip. Every trade is a 50/50 guess. Any win rate above 50% is noise.
This sounds harsh, but it is the honest starting point. Random entries on a drifting market produce streaks, and some streaks look like skill. The real question is never "did I win more than half my trades?" It is "did I win more than half by enough, over enough trades, that chance is an unreasonable explanation?" "By enough" is about the win rate. "Over enough trades" is about sample size (the number of trades in the backtest). A small sample is the whole problem.
Flip a fair coin 20 times. You expect 10 heads, but 12 would not surprise you. Getting 12 or more heads out of 20 happens about 25% of the time with a perfectly fair coin. One in four fair coins "wins 60% of its trades" over a 20-flip backtest.
So a 60% win rate over 20 trades is not evidence of skill. It is the kind of result a no-edge strategy produces routinely. The win rate happened, but it does not tell you the strategy's true win rate, because 20 trades is far too small a sample to pin that number down. What you need is a way to quantify how unpinned-down it is. That is what a confidence interval does.
A confidence interval is a range that says: given the data you observed, here is where the true underlying value plausibly lives. A 95% confidence interval is built by a procedure that captures the true value 95% of the time when used repeatedly. Applied to win rate: the interval around your observed rate is the plausible range for the true rate you cannot see directly.
Here is the decision rule, and it is the entire test:
If the 95% confidence interval still contains 50%, your strategy is statistically indistinguishable from a coin flip.
Not "is a coin flip." Indistinguishable from one. You may have an edge (a real, repeatable advantage over random guessing). You may not. The backtest, as run, cannot tell, and a result that cannot exclude luck is not a result worth acting on.
The textbook way to build this interval is the normal approximation: take the observed win rate, add and subtract a margin of error. It is what most spreadsheets and blog posts use, and it has two failure modes that bite hardest exactly where traders live, on small samples.
First, it produces impossible answers at the extremes. A strategy that won 9 of 10 trades (90%) gets a normal-approximation interval of roughly 71% to 109%. A win rate above 100% does not exist. The formula does not know that.
Second, it is overconfident on small samples. It centers on your observed win rate and treats that number as more trustworthy than it is.
The Wilson score interval (published by Edwin Wilson in 1927) fixes both. Instead of centering on your observed rate, it asks a sharper question: which true win rates could plausibly have produced this observation? Two practical consequences:
This is the test PerpForge runs on every backtest result. Let's run it by hand twice.
Strategy A: 12 wins, 8 losses. Observed win rate 60%, sample size 20. We use z = 1.96, the standard multiplier for 95% confidence, so z² ≈ 3.84.
The Wilson interval works in three steps.
Step 1: compute the adjusted center. Add z²/2 ≈ 1.92 phantom "half-wins" to the wins and z² ≈ 3.84 phantom trades to the sample: (12 + 1.92) / (20 + 3.84) = 13.92 / 23.84 ≈ 58.4%. Notice the center already moved from 60% down toward 50%. That is the small-sample skepticism at work.
Step 2: compute the margin of error. The margin is z × √(p(1−p)/n + z²/4n²) / (1 + z²/n), where p is the observed rate (0.60) and n is the sample size (20). Inside the square root: 0.60 × 0.40 / 20 = 0.012, plus 3.84 / 1600 ≈ 0.0024, totals ≈ 0.0144. Square root ≈ 0.12. Multiply by 1.96 and divide by 1.192: margin ≈ 19.7 percentage points.
Step 3: read the interval. 58.4% ± 19.7 gives roughly 39% to 78%.
Read that range again. The data is consistent with a true win rate of 39%, which loses money on symmetric bets, and with 78%, which would be exceptional. And 50% sits comfortably inside it. Twenty trades at 60% cannot rule out the coin flip. They cannot even rule out a losing strategy.
Strategy B: 220 wins, 180 losses. Observed win rate 55%, a less impressive headline number, but sample size 400.
Step 1: adjusted center. (220 + 1.92) / (400 + 3.84) = 221.92 / 403.84 ≈ 54.95%. With 400 trades, the shrinkage toward 50% is barely visible. Large samples earn the right to be taken nearly at face value.
Step 2: margin of error. Inside the square root: 0.55 × 0.45 / 400 ≈ 0.00062, plus 3.84 / 640,000 (negligible). Square root ≈ 0.025. Times 1.96, divided by 1.0096: margin ≈ 4.85 percentage points.
Step 3: the interval. 54.95% ± 4.85 gives roughly 50.1% to 59.8%.
The lower bound clears 50%. Barely, by a tenth of a percentage point, but it clears. This strategy's results are statistically distinguishable from a coin flip at 95% confidence.
Put the two side by side. The 60% trader has the better story. The 55% trader has the better evidence. A modest edge demonstrated across 400 trades beats an impressive one asserted across 20, and the Wilson interval is what makes that visible.
A large share of strategies with a headline-positive win rate still have a Wilson interval that straddles the coin-flip line. The headline number says "winner"; the interval says "not yet distinguishable from chance." That is the whole reason we lead with the interval rather than the win rate.
There is a second trap, and it is worse because it survives large samples. It is called the multiple comparisons problem.
Imagine 1,000 monkeys, each flipping a coin to decide every trade. Pure chance guarantees a spread of outcomes: most monkeys land near 50%, but a few land far above it. Rank all 1,000 by win rate and the top monkey looks brilliant. It is not. It is the right tail of a random distribution, and a right tail always exists.
Now replace monkeys with strategy variants. Sweep an indicator's parameters (every lookback period, every threshold, every timeframe) and you have run hundreds of experiments, so the best one is partly "best by chance" no matter what. The more variants you test, the more impressive your top performer looks, with zero added skill anywhere in the system.
This is why a leaderboard of best performers is systematically misleading unless every entry carries a significance verdict. Sorting by performance is literally selecting for luck. PerpForge's fleet runs over 1,500 strategy variants, which means we manufacture this exact trap daily, on purpose, and the per-result Wilson verdict is the countermeasure: the leaderboard shows you which top performers survive the coin-flip test and which are just the luckiest monkey this quarter.
The pattern is easy to spot once you look for it: the variant that ranks at the top of a parameter sweep on raw win rate is routinely the one whose interval still contains 50%. It won the sweep by being the luckiest, not the most skilled, and the interval is what tells the two apart.
One more thing, because skipping it would make this post the kind of half-truth it argues against.
A statistically significant win rate is not the same thing as a profitable strategy. Win rate ignores the size of wins and losses. A strategy that wins only 40% of the time but makes 3 units per win against 1 unit per loss (a 3:1 reward-to-risk ratio) averages 0.40 × 3 − 0.60 × 1 = +0.6 units per trade. Excellent, despite losing most of its trades. Meanwhile a 60% win rate paired with small wins and large losses bleeds money with full statistical significance.
So the Wilson test on win rate is one lens, not a profit guarantee. It answers exactly one question: is the win/loss pattern distinguishable from chance? Sizing, costs, and payoff asymmetry are separate questions, and they all bite. That is why our simulations include trading fees (flat rate) and simulate liquidation (the forced close of a leveraged position when losses exhaust its margin): a backtest that ignores costs and blowups overstates itself in a different way.
And a standing invitation: this is the test we chose, and we do not claim it is the last word. There are other lenses (binomial exact tests, Bayesian approaches, tests on returns rather than win rate). If you know a better way to test whether a strategy beats chance, or you can show a flaw in this one, show us. We will adopt it, credit you, and publish the change.
Before trusting any backtest, yours or anyone else's, ask three questions. How many trades? What is the 95% confidence interval on the win rate? Does that interval still contain 50%? If the answer to the last one is yes, you do not have a result yet. You have a sample that is too small to mean anything, and the honest move is to gather more trades, not more conviction.
If you want to see this test running on real results, the PerpForge leaderboard is public and free to browse, no signup. Every backtest on it, run on real perp data (perpetual futures: crypto contracts with no expiry date that you can hold indefinitely), carries its significance verdict next to its win rate, coin flips and busts included. Looking is free. Testing your own variant is the product.
PerpForge is an educational simulator. No real money is traded, nothing here is financial advice, and it is intended for adults 18+.
Put it to the test
Spawn your variant, run it on the same engine we use for every result on this site, and read the edge-significance verdict — before you risk real money.