PerpForge
Get started

Concept · Is it a real edge, or luck?

Sample Size

The number of independent observations (trades) used to compute a metric. Bigger sample = sharper estimate.

Sample Size

The number of independent observations (trades) used to compute a metric. Bigger sample = sharper estimate.

In plain English

Every metric in a strategy report — win rate, profit factor, Sharpe ratio, expectancy — is an estimate of an underlying truth, based on the trades the strategy generated. The more trades, the closer the estimate is to the truth. The fewer trades, the wider the range of plausible truths.

Sample size is the single biggest determinant of how much you should trust the metrics.

Why it matters for this fleet

Trade counts in the 210-strategy fleet range from N=3 (a 50/200 daily variant) to N=10,574 (a 9/21 scalp on 1-minute candles) — a 3,500× spread. The median is N=436. Naively ranking strategies by PnL or Sharpe without weighting by sample size produces a top list dominated by lucky low-N strategies, not robust ones.

A clean illustration sits at the two extremes: id478 (N=3, win rate could be anywhere from 20% to 94%) is pure anecdote, while id511 (N=469, win rate pinned to ±3.9pp) is a trustworthy measurement — even though id478's headline numbers look far better.

How sample size drives down uncertainty

Standard error scales with 1/sqrt(N). Doubling your sample only reduces uncertainty by ~30%. To halve the confidence interval, you need 4× the trades.

Trades 95% CI half-width on WR (true ≈ 40%)
20 ±21.5%
50 ±13.6%
100 ±9.6%
200 ±6.8%
500 ±4.3%
1000 ±3.0%

What drives sample size in this fleet

  • Interval. Shorter candles → more crossings → more trades. The 50/200 macro pair on daily candles is the fuzzy corner: structurally few daily crossings give it a median of only ≈28 trades — too few to prove anything (0 of that pair are edge-significant). The 21/50 pair on 1h fires far more often.
  • Filter strictness. Volume-gated variants fire less than ungated ones (18 of the 210 are volume-gated).
  • Side. Long-only on this bull window generates more entries than short-only.
  • Symbol. SOL whipsaws more than BTC, so SOL trade counts are typically higher on noisy intervals.

Examples from the live fleet

  • id478 (EMA 50/200 · BTC · 1d · 2× · long): just N=3 trades. The profit factor of 20.8 and 66.7% win rate (the share of trades that close in profit) rest on three outcomes. Removing one would change everything. This is sample-size fragility in raw form.
  • id511 (EMA 21/50 · BTC · 1h · 2× · long): N=469 trades. The win rate is pinned to ±3.9pp (percentage points). Removing a handful of trades barely moves the metric.

How to mitigate small samples

  • Pool related strategies. A "family" view pools variants that share the same signal. Beware: pooling across leverage adds nothing (id523 at 2× and id659 at 1× are the identical 436 trades — leverage scales PnL, not which trades fire). Pool across genuinely different conditions instead.
  • Extend the backtest window. More years = more trades.
  • Add more symbols. Spawning the rule on three symbols (Phase 125: explicit multi-select at spawn time) gives effectively 3× the data if the signal is symbol-independent (its edge does not depend on which symbol it's applied to).
  • Lower the interval thoughtfully. Going from 1d to 4h on the same logic typically generates ~5× more trades. But beware: noisier intervals introduce more whipsaws and may not preserve the original edge.

Refinement — independence matters more than count

A subtle but critical refinement: raw trade count overstates effective sample size when observations are correlated. See statistical independence. The 210 variants in this dossier are not 210 independent tests — they share the same 21/50-long signal family across correlated assets (BTC/ETH/SOL) in one window, so they move together. Real sample-size growth comes from genuinely different markets and time periods, not from re-running correlated variants.

Practical implication: prioritize temporal extension (out-of-sample windows the strategy has never seen) over piling up more correlated variants.

Refined 2026-05-17 — deployable thresholds

The "≥30 trades" threshold is the floor, not the target. For deployable confidence:

Tier N Use
Floor 30 Anecdote only
Directional 100 Hypothesize edge
Deployable 300 Trust PF and Sharpe enough to risk capital
Robust 1000+ Near-bulletproof

Three independent constraints justify the 300 mark:

  1. Win-rate confidence interval tightens to ±5% (actionable precision) at N ≈ 400.
  2. Sharpe-significance test: N ≈ (1.96 / Sharpe)². For true per-trade Sharpe 0.1, you need ~385 trades to distinguish from zero.
  3. Profit factor estimates stabilize at 300–500 trades for trend-following systems.

Additional binding constraint: number of winners. For low-win-rate trend-followers (win rate = share of trades that close in profit), a 1000-trade strategy with only 15 winners has the same problem as a 15-trade strategy — only 15 samples of "what works." Use W ≥ 30 directional, W ≥ 100 deployable.

Across this fleet the candidate pool is tiny on every reading: only 5 of 210 rows clear the edge test, and after the multiple-comparisons haircut (~11 would pass by pure chance across 210 correlated variants) none is distinguishable from luck. Sample size alone never rescues a fleet this correlated.

Related

Sources

  • wiki/qa-sessions/2026-05-17-session.md#q2 (first asked here)
  • wiki/qa-sessions/2026-05-17-session.md#q3 (refinement)
  • /api/analytics

Related concepts

See it in a real result →

Put it to the test

Does your idea have a real edge, or just a big number?

Spawn your variant, run it on the same engine, and read the edge-significance verdict — before you risk real money.

Test your own idea — free →Free account, no card