Correlation & Regression
Community Medicine · Biostatistics · lean revision notes
Correlation & Regression
Correlation and regression are the two pillars of bivariate analysis in biostatistics — they describe and quantify the relationship between two quantitative variables. Correlation tells you whether and how strongly two variables move together; regression lets you predict one variable from another. NEET PG repeatedly tests the difference between the two, the interpretation of the coefficient r, the choice between Pearson and Spearman, and the meaning of r².
Core concepts: what each method answers
A scatter (dot) diagram is the starting point for any bivariate analysis. Each point represents a paired observation (x, y) for one individual — e.g. height vs weight, age vs systolic BP, cigarettes/day vs FEV1. The shape and tilt of the cloud of points tells you the nature of the relationship.
- Correlation → measures the degree of linear association. It is symmetric: correlation of x with y equals correlation of y with x. It is unitless and dimensionless.
- Regression → produces an equation to predict the dependent (outcome) variable from the independent (predictor) variable. It is asymmetric: regression of y on x differs from regression of x on y. The slope has units (units of y per unit of x).
High-yield: Correlation quantifies strength and direction of a linear relationship; regression quantifies the form of the relationship and is used for prediction. Correlation = "how tight"; regression = "predict the value".
Correlation coefficient (r): definition and interpretation
The Pearson product-moment correlation coefficient (r) measures the strength and direction of a linear relationship between two quantitative, normally distributed variables.
Formula essence: r = covariance(x, y) ÷ (SDₓ × SD_y). It always lies between −1 and +1.
| r value | Direction & strength |
|---|---|
| +1 | Perfect positive linear correlation (all points on an upward line) |
| +0.7 to +0.99 | Strong positive |
| +0.3 to +0.7 | Moderate positive |
| 0 to +0.3 | Weak positive |
| 0 | No linear correlation |
| −0.3 to 0 | Weak negative |
| −0.7 to −0.3 | Moderate negative |
| −1 | Perfect negative linear correlation (downward line) |
Interpreting the sign: a positive r means y increases as x increases (e.g. height and weight); a negative r means y decreases as x increases (e.g. age and renal function, or number of cigarettes and FEV1).
High-yield: The numerical magnitude (closeness to 1) reflects strength; the sign reflects direction. An r of −0.8 indicates a stronger relationship than an r of +0.5 — do not be fooled by the minus sign.
Crucial caveat — r = 0 does not mean "no relationship": Pearson's r only detects linear association. A perfect U-shaped (quadratic) relationship can give r ≈ 0 despite a strong, deterministic relationship. Always look at the scatter plot.
Assumptions for Pearson's r
- Both variables are continuous (interval/ratio) and quantitative.
- Relationship is linear.
- Variables are approximately normally distributed (bivariate normality).
- No major outliers (Pearson is highly sensitive to outliers — a single extreme point can inflate or deflate r dramatically).
- Homoscedasticity (constant spread of y across the range of x).
Spearman rank correlation (ρ / r_s)
The Spearman rank correlation coefficient (rho, ρ or r_s) is the non-parametric counterpart. It assesses monotonic association by ranking the data and applying Pearson's formula to the ranks.
When to prefer Spearman over Pearson — a favourite MCQ:
| Situation | Use Spearman because… |
|---|---|
| Ordinal data (e.g. pain score 0–10, tumour grade, Likert scale) | Pearson requires interval/ratio data |
| Skewed / non-normal distribution | Pearson assumes bivariate normality |
| Presence of outliers | Ranks blunt the leverage of extreme values |
| Non-linear but monotonic relationship | Spearman detects any consistently increasing/decreasing trend |
| Small sample with unknown distribution | Distribution-free method |
High-yield: Choose Spearman when data are ordinal, non-normal/skewed, or contain outliers, OR when the relationship is monotonic but not strictly linear. Choose Pearson for two normally-distributed continuous variables with a linear relationship.
Both ρ and r range from −1 to +1 and are interpreted identically by sign and magnitude. Kendall's tau (τ) is another rank-based measure tested occasionally; it too is non-parametric and ranges −1 to +1.
Coefficient of determination (r²)
The coefficient of determination, r², is simply r squared. It equals the proportion of the variability (variance) in the dependent variable that is explained by the independent variable through the regression model.
- If r = 0.8 → r² = 0.64 → 64% of the variation in y is explained by x; the remaining 36% is due to other/unmeasured factors and random error.
- r² ranges from 0 to 1 (or 0–100%) and is always positive (squaring removes the sign).
High-yield: r² = proportion of variance in the outcome explained by the predictor. An r of 0.5 explains only 25% of variance — looks moderate but explains relatively little. This "r → r² shrinkage" is a classic trap.
Approach to an r² question: Find r → square it → multiply by 100 → state "% of variation in y explained by x".
Linear regression: the equation
Simple linear regression fits the best-fitting straight line through the scatter using the method of least squares (minimising the sum of squared vertical distances of points from the line, i.e. minimising the residuals).
The regression equation:
y = a + bx
where:
- y = dependent / outcome / predicted variable
- x = independent / predictor / explanatory variable
- a = intercept (value of y when x = 0; where the line crosses the y-axis)
- b = regression coefficient / slope = the change in y for every one-unit increase in x
The slope b carries the units of (y-units per x-unit). Its sign matches the sign of r — both are positive or both negative.
Worked prediction (the commonest numerical MCQ)
Suppose the regression of systolic BP (y, mmHg) on age (x, years) is: y = 80 + 1.5x Predicted SBP for a 40-year-old = 80 + (1.5 × 40) = 80 + 60 = 140 mmHg.
Stepwise: Identify a and b → substitute given x → compute y.
High-yield: In y = a + bx, b (slope) = change in y per unit change in x, and a (intercept) = y when x = 0. To predict, substitute the x value. Extrapolating far beyond the observed range of x is statistically unsound.
Relationship between r and b
The slope and correlation coefficient are linked: b = r × (SD_y / SDₓ). So when SDₓ = SD_y, the slope equals r. They always share the same sign, but their magnitudes differ because b is scaled by the standard deviations and carries units, whereas r is unitless.
Correlation vs causation — the cardinal warning
A statistically significant correlation, however strong, does not prove that x causes y. This is one of the most-loved conceptual questions in Community Medicine.
Reasons a correlation may exist without causation:
- Confounding / lurking variable — a third factor drives both (e.g. ice-cream sales correlate with drowning deaths; the confounder is hot weather/summer).
- Reverse causation — y may actually cause x.
- Chance / spurious correlation — coincidental, especially with small samples or many comparisons.
- Bias in selection or measurement.
High-yield: "Correlation does not imply causation." Establishing causation requires the Bradford Hill criteria (strength, consistency, specificity, temporality, biological gradient/dose-response, plausibility, coherence, experiment, analogy). Temporality (cause precedes effect) is the only absolutely essential criterion, and the randomised controlled trial is the strongest design for causation.
Mnemonic for Bradford Hill criteria — "Cause-effect TROUBLES": Temporality, Reversibility/experiment, Outcome plausibility, Biological gradient, Strength, etc. A simpler classic: "Strength, Consistency, Specificity, Temporality, Dose-response, Plausibility, Coherence, Experiment, Analogy."
Multiple linear regression
When two or more independent variables are used to predict a single continuous outcome:
y = a + b₁x₁ + b₂x₂ + b₃x₃ + … + bₙxₙ
Each partial regression coefficient (b₁, b₂…) gives the effect of that predictor on y while holding all other predictors constant — this is how multiple regression adjusts for confounding.
| Outcome (dependent) variable | Appropriate regression |
|---|---|
| Continuous (e.g. BP, BMI, blood glucose) | Linear regression |
| Binary / dichotomous (disease yes/no) | Logistic regression (gives odds ratios) |
| Time-to-event (survival, with censoring) | Cox proportional hazards regression |
| Count data (number of events) | Poisson regression |
High-yield: Linear regression → continuous outcome. Logistic regression → binary outcome (yields odds ratio). Cox regression → survival/time-to-event data. This mapping is frequently asked.
Multiple linear regression uses adjusted R², which penalises the addition of useless predictors, so it is preferred over plain R² when comparing models with different numbers of variables.
Diagnosis of relationship type — choosing the right test
A stepwise diagnostic approach to "which coefficient should be used?":
- Are both variables quantitative & continuous, normally distributed, linear relationship, no outliers? → Pearson's r.
- Is either variable ordinal, or data skewed, or are there outliers, or is the trend monotonic but non-linear? → Spearman's ρ.
- Do you need to predict the value of a continuous outcome? → Linear regression (y = a + bx).
- Is the outcome binary? → Logistic regression, not linear correlation.
Common complications & pitfalls (test traps)
- Outliers distort Pearson's r severely — one aberrant point can create or destroy an apparent correlation.
- Restricted range of x artificially lowers r (e.g. studying only tall people weakens the height–weight correlation).
- Ecological fallacy — a correlation observed at the group/population level (e.g. country-level data) cannot be assumed to hold for individuals.
- Non-linear relationships give misleadingly low r despite strong association.
- Extrapolation beyond the data range using the regression line is invalid.
- Spurious correlation from confounders — always suspect a lurking variable.
- A high r does not mean a steep slope; r reflects scatter tightness, b reflects steepness.
Key differentials — correlation vs regression at a glance
| Feature | Correlation | Regression |
|---|---|---|
| Purpose | Strength & direction of association | Prediction / functional relationship |
| Symmetry | Symmetric (x↔y same) | Asymmetric (y on x ≠ x on y) |
| Statistic | r (Pearson) or ρ (Spearman) | a (intercept), b (slope) |
| Range | −1 to +1 | b can be any real number |
| Units | Unitless | b has units (y per x) |
| Key output for variance | r² | R² / adjusted R² |
| Dependence on which is x/y | No | Yes |
Recently asked / exam angle
- "Interpret this scatter plot / r value" — given a diagram or an r (e.g. r = −0.85), identify it as a strong negative linear correlation.
- Prediction numericals — given y = a + bx and a value of x, compute predicted y (e.g. SBP from age, weight from height).
- r² interpretation — "If r = 0.6, what percentage of variation in y is explained by x?" → 0.36 → 36%.
- Pearson vs Spearman choice — ordinal data (pain scores, tumour grade), skewed data, or outliers → Spearman.
- Correlation ≠ causation — vignette with a confounder (ice-cream and drowning; coffee and lung cancer confounded by smoking).
- Regression type by outcome — binary outcome → logistic, survival → Cox, continuous → linear.
- Meaning of slope b vs intercept a in y = a + bx.
- r = 0 with a clear non-linear (U-shaped) relationship — Pearson misses non-linear associations.
- Method of least squares as the basis of the regression line.
- b = r·(SD_y/SDₓ) linking slope and correlation — occasionally asked in tougher papers.
Rapid revision
- r ranges from −1 to +1; sign = direction, magnitude = strength; it is unitless and symmetric.
- Pearson r → two continuous, normal, linear variables; very sensitive to outliers.
- Spearman ρ → ordinal data, skewed distributions, outliers, or monotonic non-linear trends; rank-based, non-parametric.
- r = 0 means no linear association — a strong non-linear (U-shaped) relationship can still give r ≈ 0.
- r² (coefficient of determination) = proportion of variance in y explained by x; r = 0.5 explains only 25%.
- Regression equation: y = a + bx; a = intercept (y at x = 0), b = slope (Δy per unit Δx, has units).
- Regression line is fitted by the method of least squares (minimises squared residuals); regression is asymmetric (y-on-x ≠ x-on-y).
- Slope b and correlation r always share the same sign; b = r × (SD_y/SDₓ).
- Correlation does NOT imply causation — beware confounding, reverse causation, chance; temporality is the essential causal criterion.
- Linear regression → continuous outcome; logistic → binary outcome (odds ratio); Cox → time-to-event/survival.
- Multiple linear regression (y = a + b₁x₁ + b₂x₂ …) adjusts for confounders; each partial coefficient holds the others constant.
- Watch for the ecological fallacy — group-level correlation cannot be applied to individuals.