CalcTune
📐
Math · Statistics

Statistical Power Calculator

Calculate the required sample size or achieved statistical power for your study. Enter Cohen's d effect size, significance level, and desired power to plan your experiment.

Standardized mean difference

Cohen's d conventions:
Example values — enter yours above
RESULT
64TOTAL N

32 per group

0.5
Effect Size (Cohen's d)
1.9604
z(α)
0.8415
z(β)
Based on normal approximation using Abramowitz & Stegun CDF. For small samples, consider exact t-test power calculations.

Understanding Statistical Power: A Practical Guide

Statistical power is the probability that a study will correctly detect an effect when one truly exists. In formal terms, power equals one minus the probability of a Type II error (failing to reject a false null hypothesis). A study with 80% power has a 20% chance of missing a real effect. Power analysis is a fundamental step in experimental design, helping researchers balance the cost of data collection against the risk of inconclusive results.

The Four Components of Power Analysis

Four quantities are interconnected in any power analysis: effect size, significance level (alpha), statistical power, and sample size. Given any three of these, the fourth can be determined. Effect size measures the magnitude of the difference or relationship you expect to find. The significance level, typically set at 0.05, defines the threshold for rejecting the null hypothesis. Power, commonly targeted at 0.80 or higher, is the probability of detecting the effect. Sample size is the number of observations needed in each group.

This calculator allows you to fix three of these values and solve for the remaining one. In the most common use case, researchers specify their expected effect size, alpha, and desired power, then solve for the required sample size.

Cohen's d: Measuring Effect Size

Cohen's d is one of the most widely used measures of effect size for comparing means. It expresses the difference between two group means in units of the pooled standard deviation: d = (M1 - M2) / SD_pooled. Jacob Cohen proposed conventions for interpreting d values: 0.2 represents a small effect, 0.5 a medium effect, and 0.8 a large effect. These conventions are widely cited but should be used cautiously, as what constitutes a meaningful effect varies by field.

In medical research, a small effect size may be clinically important if the intervention is inexpensive and low-risk. In educational research, a medium effect might represent a substantial improvement in test scores. Researchers are encouraged to estimate expected effect sizes from pilot data or prior literature rather than relying solely on Cohen's conventions.

One-Tailed vs Two-Tailed Tests

A two-tailed test checks for an effect in either direction (the treatment could help or harm), while a one-tailed test only checks for an effect in one specified direction. One-tailed tests require smaller sample sizes for the same power because the critical region is concentrated on one side of the distribution. However, they are only appropriate when there is a strong theoretical reason to expect the effect in only one direction and when an effect in the opposite direction would be scientifically irrelevant.

Most research uses two-tailed tests as the default because they are more conservative and do not require pre-specifying the direction of the effect.

Test Types: One-Sample, Two-Sample, and Paired

A one-sample test compares a single group's mean to a known or hypothesized value. The sample size formula applies directly: n = ((z_alpha + z_beta) / d)^2. A two-sample (independent) test compares the means of two separate groups, and the formula gives the required size per group, so the total sample size is doubled. A paired test compares measurements from the same subjects under two conditions, and uses the same formula as the one-sample test with d representing the standardized mean difference of the paired observations.

Paired designs are often more powerful than independent designs because within-subject variability is typically smaller than between-subject variability, leading to larger effective effect sizes for the same underlying treatment effect.

The Normal Approximation

This calculator uses the normal (z) approximation for power calculations, which is standard for planning purposes and yields sample sizes that are accurate for most practical scenarios. The Abramowitz and Stegun approximation provides the normal cumulative distribution function and its inverse with high precision. For very small sample sizes (roughly n less than 30), the t-distribution provides more accurate power estimates, and researchers may want to verify results with specialized statistical software.

The z-approximation tends to slightly underestimate the required sample size compared to exact t-test calculations, so the results from this calculator can be considered a reasonable lower bound for planning purposes.

Practical Considerations

When planning a study, it is common practice to inflate the calculated sample size by 10–20% to account for anticipated dropout, missing data, or protocol deviations. The sample size from a power analysis represents the minimum number of analyzable observations needed, not the number of participants to recruit.

If the required sample size is impractically large, researchers can consider increasing the anticipated effect size (by refining the intervention), relaxing the significance level (though this increases Type I error risk), accepting lower power (though this increases the risk of a null result), or switching to a paired or within-subjects design if feasible. Each of these trade-offs should be evaluated in the context of the specific research question and its practical implications.

Frequently Asked Questions

What is statistical power?

Statistical power is the probability that a test will correctly reject the null hypothesis when there is a true effect. A power of 0.80 means there is an 80% chance of detecting the effect if it exists. Higher power reduces the risk of a Type II error (a false negative), but typically requires larger sample sizes.

What effect size should I use?

The best approach is to estimate your expected effect size from pilot data or published studies in your field. If no prior data is available, Cohen's conventions provide a starting point: d = 0.2 for a small effect, d = 0.5 for a medium effect, and d = 0.8 for a large effect. However, these conventions are general guidelines and may not reflect what is meaningful in your specific context.

Why is 0.80 a common target for power?

The 0.80 convention was proposed by Jacob Cohen as a reasonable balance between the risk of a Type II error (20%) and the cost of collecting additional data. Some fields, particularly clinical trials, may require higher power (0.90 or 0.95) because the consequences of missing a real treatment effect are more serious.

What is the difference between one-tailed and two-tailed tests?

A two-tailed test checks for effects in both directions (the treatment could be better or worse), while a one-tailed test only checks one direction. One-tailed tests need fewer subjects to achieve the same power, but they assume you have no interest in detecting an effect in the opposite direction. Two-tailed tests are the default in most research.

How accurate is the normal approximation for power calculations?

The normal (z) approximation is standard for sample size planning and is accurate for moderate to large sample sizes. For very small expected sample sizes (roughly under 30 per group), the exact t-distribution yields slightly different results. The z-approximation tends to slightly underestimate the required sample size, making it a reasonable lower bound for planning.