Linear Regression Calculator
Enter (x, y) data pairs to find the least-squares best-fit line. The calculator returns the equation y = mx + b along with R², Pearson r, slope, and y-intercept.
Enter one x,y pair per line (comma or space separated)
Regression Statistics
Linear Regression Explained: Slope, R², and the Best-Fit Line
Linear regression is one of the most widely used statistical techniques in science, engineering, economics, and everyday data analysis. Given a set of (x, y) data pairs, it finds the straight line that best represents the underlying relationship between the two variables — minimizing the total squared distance from each data point to the line. This calculator performs ordinary least-squares (OLS) regression and returns the equation, goodness-of-fit statistics, and correlation measures instantly.
What Is Linear Regression?
Linear regression models the relationship between a dependent variable y and an independent variable x using the equation y = mx + b. The parameter m is the slope, describing how much y changes on average for each one-unit increase in x. The parameter b is the y-intercept, representing the predicted value of y when x equals zero.
The method of ordinary least squares determines m and b by minimizing the sum of squared residuals — the squared differences between observed y values and the values predicted by the line. This criterion makes OLS optimal under the Gauss-Markov theorem when standard assumptions are met (linearity, constant variance, and independent errors).
Linear regression is descriptive when applied to existing data (it summarizes the relationship) and predictive when used to forecast y for new x values. In both cases, understanding the quality of the fit — through R² and Pearson r — is essential for interpreting the results responsibly.
The Formulas Behind Linear Regression
For n data pairs (x₁, y₁), (x₂, y₂), …, (xₙ, yₙ), the least-squares slope is calculated as m = (nΣxy − ΣxΣy) ÷ (nΣx² − (Σx)²), where Σ denotes summation over all n points. The y-intercept is then b = (Σy − mΣx) ÷ n, or equivalently b = ȳ − mẋ, where ẋ and ȳ are the sample means.
Once m and b are known, the predicted value for any x is ŷ = mx + b. The residual for each observation is eᵢ = yᵢ − ŷᵢ. The sum of squared residuals (SSR) is the quantity that least-squares minimizes. A companion quantity is the total sum of squares (SST), which measures overall variability in y around its mean: SST = Σ(yᵢ − ȳ)².
These two quantities give rise to R², computed as R² = 1 − SSR/SST. When the line fits perfectly, SSR = 0 and R² = 1. When the line does no better than simply predicting ȳ for every x, SSR = SST and R² = 0.
Understanding R² (Coefficient of Determination)
R² quantifies the proportion of variance in y that is statistically accounted for by its linear relationship with x. An R² of 0.85 means that 85% of the variability in y can be attributed to the fitted linear model, while the remaining 15% reflects other factors or inherent randomness. R² is always between 0 and 1 for simple linear regression with an intercept.
There is no universal threshold for a 'good' R². In fields with controlled laboratory conditions, R² values above 0.99 may be expected. In behavioral and social sciences, values of 0.3 to 0.5 are sometimes considered meaningful. The appropriate benchmark depends entirely on the domain, the noise level inherent to the data, and the purpose of the analysis.
Importantly, a high R² does not confirm causation, nor does it guarantee that the linear model is the correct functional form. Residual plots — graphs of eᵢ against x or ŷ — are a valuable diagnostic tool for assessing whether the linearity assumption holds.
Pearson r: Correlation Coefficient
The Pearson correlation coefficient r measures the strength and direction of the linear association between x and y. It ranges from −1 to +1. A value of +1 indicates a perfect positive linear relationship (as x increases, y increases proportionally). A value of −1 indicates a perfect negative linear relationship. A value of 0 suggests no linear association, though nonlinear relationships may still exist.
The relationship between r and R² is straightforward: for simple linear regression, R² = r². This means r captures the sign (direction) of the relationship, while R² describes the magnitude of the fit. If r = 0.9, the slope is positive and R² = 0.81, meaning 81% of variance in y is explained. If r = −0.9, the slope is negative but R² is still 0.81.
The Pearson r formula used here is r = (nΣxy − ΣxΣy) ÷ √[(nΣx² − (Σx)²)(nΣy² − (Σy)²)]. This is equivalent to dividing the sample covariance by the product of the sample standard deviations of x and y.
Practical Applications of Linear Regression
In economics, linear regression underpins a vast range of analyses — from estimating supply and demand elasticities to modeling the relationship between advertising spend and sales revenue. Analysts might regress monthly sales figures against marketing expenditure to estimate the return on each dollar invested.
In the natural sciences, linear regression is routinely applied in calibration curves — for instance, relating absorbance readings to known concentrations of a chemical solution (Beer–Lambert law). Physicists use it to fit experimental data to theoretical predictions, with R² indicating how well the theory describes the observations.
In health research, epidemiologists use regression to quantify associations such as the relationship between age and blood pressure, or between body weight and drug dosage. These analyses inform public health guidelines, though the observational nature of the data means correlation does not imply causation.
In machine learning, simple linear regression is the conceptual foundation for more complex models. Ridge regression, LASSO, and polynomial regression all build on the same least-squares objective. Understanding simple linear regression therefore provides a solid mental model for interpreting more advanced methods.
Assumptions and Limitations
Ordinary least-squares regression relies on several assumptions. First, the relationship between x and y should be approximately linear — if the true relationship is curved, the linear model will systematically misfit the data. Second, the residuals should be roughly normally distributed with constant variance (homoscedasticity) across the range of x. Third, observations should be independent of one another.
Outliers can exert disproportionate influence on the slope and intercept in OLS regression. A single extreme point can shift the line considerably, inflating or deflating R² in misleading ways. Robust regression methods or outlier diagnostics (such as Cook's distance or leverage statistics) are recommended when outliers may be present.
Extrapolation — predicting y for x values far outside the range of the observed data — carries additional risk. The linear relationship observed within the data range may not hold outside it. Predictions should be interpreted cautiously when x lies beyond the support of the training data.
Finally, linear regression describes association, not causation. Confounding variables, selection bias, and reverse causation can all produce strong linear correlations between x and y even when no direct causal mechanism exists. Causal conclusions require careful study design, not statistical fitting alone.
How to Use This Calculator
Enter your (x, y) data pairs in the text area, one pair per line. You can separate x and y values with a comma, space, or tab. For example, entering '1, 2' on the first line and '2, 4' on the second line represents two points: (1, 2) and (2, 4). Lines beginning with # are treated as comments and ignored.
The calculator requires at least two distinct x values to compute a valid regression line. If all x values are identical, the slope is undefined (a vertical line cannot be represented in y = mx + b form) and the calculator will display an error prompt.
Once results appear, the equation y = mx + b is displayed prominently. The sign convention is explicit: if the intercept is negative, the equation reads y = mx − |b| to avoid double-negative confusion. Slope, intercept, R², and Pearson r are shown in the statistics panel. Use the share button to save and share your regression results.
Frequently Asked Questions
What does the slope m tell me in a linear regression?
The slope m represents the average change in y for each one-unit increase in x. If m = 2.5, then on average, y increases by 2.5 units whenever x increases by 1. A negative slope means y decreases as x increases. The slope is central to interpreting the direction and rate of the linear relationship.
What is R² and how should I interpret it?
R² (the coefficient of determination) measures the proportion of variance in y that is explained by the linear model. It ranges from 0 to 1. An R² of 0.80 means 80% of the variability in y is captured by the fitted line; the remaining 20% is due to other factors or random variation. The appropriate level of R² depends heavily on the field and the inherent noise in the data — there is no single threshold that defines a 'good' fit.
What is the difference between R² and Pearson r?
Pearson r measures the strength and direction of the linear correlation, ranging from −1 to +1. R² is simply r² (r squared), ranging from 0 to 1, and measures the proportion of variance explained. Pearson r carries directional information (positive or negative), while R² does not. For simple linear regression, both are computed from the same underlying sums, and R² = r².
Why does the calculator need at least 2 data points?
A straight line is uniquely defined by any two distinct points. With only one data point, infinitely many lines pass through it, so there is no unique best-fit line. With two points, the line passes exactly through both, giving R² = 1 by definition. Meaningful regression statistics — especially R² and Pearson r — become interpretable with more data points, as they then reflect genuine fit quality rather than perfect interpolation.
Can I use linear regression if my data is not perfectly linear?
Yes. Linear regression can still be applied to data that is approximately linear. The R² value will reflect how well the linear model fits — a lower R² indicates that a nonlinear model might describe the data better. You can also transform variables (for example, taking the logarithm of x or y) to linearize certain curved relationships before applying regression.
Does a high R² prove that x causes y?
No. R² measures statistical fit, not causation. Two variables can be highly correlated — and a regression line can fit them extremely well — without either variable causing the other. Confounding variables, coincidental trends, or reverse causation can all produce high R² values. Establishing causation requires experimental design, randomization, or rigorous causal inference methods beyond regression fitting.