Hypothesis Testing & P-value

Community Medicine · Biostatistics · lean revision notes

Hypothesis Testing & P-value

Hypothesis testing is the formal framework biostatistics uses to decide whether an observed difference (between treatments, groups, or against a reference) is a real effect or merely the play of chance. For NEET PG, this is one of the most reliably tested Biostatistics areas: examiners love to make you distinguish type I from type II error, interpret a borderline p-value, and connect power to sample size. Get the definitions crisp and you bank these marks.

The core logic of significance testing

We never directly "prove" that a treatment works. Instead we set up a null hypothesis (H₀) — a statement of no difference / no association / no effect — and ask: if H₀ were true, how surprising is the data we actually observed? If the data would be very unlikely under H₀, we reject H₀ in favour of the alternative.

Null hypothesis (H₀): there is no difference between groups (e.g. mean BP in drug group = mean BP in placebo group; relative risk = 1; odds ratio = 1; difference in proportions = 0). It is the hypothesis of "status quo / chance alone."
Alternative hypothesis (H₁ or Hₐ): there is a difference. This is what the researcher usually hopes to demonstrate.
A two-tailed test asks whether there is a difference in either direction (drug is better OR worse). A one-tailed (one-sided) test pre-specifies a single direction. Two-tailed is the default and more conservative; one-tailed needs strong prior justification.

The stepwise flow of a hypothesis test:

State H₀ and H₁ → fix the significance level α (usually 0.05) → choose the correct statistical test → compute the test statistic → derive the p-value → compare p with α → reject or fail to reject H₀ → interpret in context.

High-yield: We never "accept" the null hypothesis. We either reject H₀ or fail to reject H₀. Failing to reject is not the same as proving H₀ true — absence of evidence is not evidence of absence.

The p-value — what it really means

The p-value is the probability of obtaining a result as extreme as, or more extreme than, the one observed, assuming the null hypothesis is true. It quantifies how compatible your data are with H₀.

Small p-value → data are unlikely under H₀ → evidence against H₀.
Large p-value → data are quite compatible with H₀ → insufficient evidence to reject it.

The conventional cut-off (significance level, α) is 0.05. By convention:

p < 0.05 → "statistically significant" → reject H₀.
p ≥ 0.05 → "not statistically significant" → fail to reject H₀.

High-yield: A p-value is NOT the probability that the null hypothesis is true, and NOT the probability that the result occurred by chance. It is conditional on H₀ being true. This subtle misinterpretation is a classic exam trap.

Common misinterpretations examiners test:

Statement	Correct?	Why
"p = 0.04 means there is a 4% chance H₀ is true"	✗ Wrong	p is computed assuming H₀ true; it is not P(H₀\|data)
"p = 0.04 means a 96% chance the drug works"	✗ Wrong	Same fallacy in reverse
"p < 0.05 means the effect is large / clinically important"	✗ Wrong	Significance ≠ magnitude; large samples make tiny effects significant
"p = 0.06 means the treatment definitely does not work"	✗ Wrong	Non-significant ≠ no effect; may be underpowered
"Smaller p = stronger evidence against H₀"	✓ Correct	This is the only fully safe reading

Borderline p-values: A result of p = 0.049 and p = 0.051 are almost identical in evidential strength, yet one is "significant" and the other "not" at α = 0.05. The 0.05 threshold is an arbitrary convention. Modern interpretation favours reporting the exact p-value plus the confidence interval and effect size, rather than a binary verdict.

Type I and Type II errors

Because we decide based on a probabilistic sample, we can be wrong in two ways. This 2×2 table is the single most examined concept in the topic — memorise it cold.

	H₀ actually TRUE (no real effect)	H₀ actually FALSE (real effect exists)
We REJECT H₀ (declare significant)	Type I error (α) — false positive	Correct decision — Power (1 − β)
We FAIL TO REJECT H₀ (declare not significant)	Correct decision (1 − α)	Type II error (β) — false negative

Type I error (α): rejecting a true null hypothesis. You conclude there is an effect when there is none → a false positive. Probability = α (the significance level itself). Setting α = 0.05 means you accept a 5% chance of a false-positive conclusion.
Type II error (β): failing to reject a false null hypothesis. You miss a real effect → a false negative. Probability = β. Conventionally set at 0.10–0.20.

High-yield: Type I = false positive = α = "convicting an innocent person." Type II = false negative = β = "letting a guilty person go free." This courtroom analogy is the fastest way to never confuse them.

A clinical analogy NEET loves: think of α and β like sensitivity/specificity of a decision.

Convicting an innocent man (Type I) is usually considered the more serious error, which is why α is kept small (0.05 or 0.01).
A drug declared effective when it is not (Type I) exposes patients to useless/harmful therapy; a truly effective drug missed (Type II) deprives patients of benefit.

Mnemonic: "αlpha = Accusing the innocent (false Alarm); βeta = Blind to the truth (missed it)." Or simply: Type 1 has ONE letter shape like the bar "I" = "I told you there was an effect when there wasn't."

Statistical power (1 − β)

Power = 1 − β = the probability that a test correctly rejects a false null hypothesis, i.e. the probability of detecting a real effect when one truly exists. Conventionally we aim for power ≥ 80% (0.80), i.e. β ≤ 0.20.

Determinants of power — what increases power:

Larger sample size (n) — the single most important and most modifiable factor. ↑ n → ↑ power.
Larger effect size — bigger true difference is easier to detect.
Lower variability (smaller SD) — less noise, clearer signal.
Higher α — a more lenient threshold (e.g. 0.05 vs 0.01) raises power but also raises Type I error risk.
One-tailed test has more power than two-tailed (for the specified direction).

High-yield: To increase power without inflating Type I error, increase the sample size. Sample-size calculation is essentially solving for the n needed to achieve a desired power (usually 80–90%) at a fixed α for an anticipated effect size and variance.

The α–β trade-off (fixed n): lowering α (say from 0.05 to 0.01) makes it harder to reject H₀, which reduces Type I error but increases β (Type II error) and therefore reduces power. You cannot minimise both errors simultaneously without increasing n.

Quantity	Symbol	Typical value	Plain meaning
Significance level	α	0.05	Max acceptable false-positive rate
Type I error rate	α	0.05	P(reject true H₀)
Type II error rate	β	0.20	P(fail to reject false H₀)
Power	1 − β	0.80	P(detect a true effect)
Confidence level	1 − α	0.95	Coverage of the CI

Confidence intervals (CI)

A confidence interval gives a range of plausible values for the true population parameter, with a stated confidence level (usually 95%). It conveys both the estimate and its precision, which a bare p-value cannot.

Interpretation of a 95% CI: if we repeated the study many times, 95% of the constructed intervals would contain the true population value. A wide CI = imprecise estimate (often small n); a narrow CI = precise.

General structure: CI = point estimate ± (critical value × standard error).

For a mean: x̄ ± Z(1−α/2) × (SD/√n). For 95% CI, Z = 1.96. (Use t instead of Z for small samples.)
For a proportion: p ± 1.96 × √[p(1−p)/n].
For a difference in means / proportions: estimate of the difference ± 1.96 × SE of the difference.

Linking CI to significance — the most exam-relevant rule:

For a difference (in means or proportions): if the 95% CI includes 0 (the null value), the result is NOT significant at p = 0.05. If it excludes 0, it is significant.
For a ratio (relative risk, odds ratio, hazard ratio): the null value is 1. If the 95% CI includes 1, the result is NOT significant. If the CI lies entirely above or entirely below 1, it is significant.

High-yield: RR or OR with 95% CI of (0.8 – 1.4) crosses 1 → not significant. RR 2.1 (95% CI 1.3 – 3.5) excludes 1 → significant and indicates increased risk. This single rule answers a large fraction of CI MCQs.

A 99% CI is wider than a 95% CI (more confidence demands a broader net), and corresponds to α = 0.01.

Choosing the correct statistical test (quick map)

Examiners may ask which test to apply. A rapid decision approach:

Two means, two groups, quantitative data, normally distributed → Student's t-test (unpaired/independent). Paired data (before–after in same subjects) → paired t-test.
More than two means → ANOVA (analysis of variance); post-hoc tests if significant.
Proportions / categorical data → Chi-square (χ²) test. Small expected cell counts (< 5) → Fisher's exact test.
Correlation between two quantitative variables → Pearson's r (parametric) or Spearman's rank (non-parametric).
Non-normal / ordinal data → non-parametric tests: Mann–Whitney U (≈ unpaired t), Wilcoxon signed-rank (≈ paired t), Kruskal–Wallis (≈ one-way ANOVA).

High-yield: Chi-square is for categorical/qualitative data; t-test/ANOVA for continuous data comparisons of means. Mixing these up is a frequent error.

Multiple comparisons & related pitfalls

Multiple testing inflates Type I error. If you run 20 independent tests at α = 0.05, you expect ~1 false-positive by chance alone. Correction methods: Bonferroni (divide α by the number of tests, e.g. 0.05/20 = 0.0025) is the classic, conservative fix.
Statistical vs clinical significance: a statistically significant result (p < 0.05) may be clinically trivial if the effect size is tiny — common with very large samples. Conversely, a clinically meaningful effect may miss significance in an underpowered study. Always interpret p alongside the effect size and CI.
Publication bias: significant results are published more, distorting the literature.

Complications & consequences of getting it wrong

A Type I error in a trial → an ineffective/harmful drug enters practice; resources wasted; patients exposed to risk.
A Type II error → a genuinely useful intervention is discarded; often due to inadequate sample size (underpowered study).
Over-reliance on p < 0.05 as a binary → ignores magnitude and precision; the ASA (American Statistical Association) 2016 statement explicitly warns against this.

Key differentials / commonly confused pairs

Pair	Distinction
α vs p-value	α is the pre-set threshold; p is the computed result for that dataset
Type I vs Type II error	False positive (reject true H₀) vs false negative (miss false H₀)
Power vs significance	Power = detecting real effect (1−β); significance = α threshold
Confidence level vs CI width	Higher confidence (99%) → wider interval
Statistical vs clinical significance	p-value vs real-world importance of effect size
One-tailed vs two-tailed	Directional vs non-directional; one-tailed has more power but needs justification

Recently asked / exam angle

NEET PG and INI-CET have repeatedly probed this topic, usually in these flavours:

"Rejecting a true null hypothesis is which error?" → Type I (α). The mirror question asks the same for failing to reject a false null → Type II (β).
"What does power of a study mean / how to increase it?" → 1 − β; increase by raising sample size (preferred), effect size, or α.
"A 95% CI for relative risk is 0.9 – 1.6. Interpret." → crosses 1 → not statistically significant.
"p = 0.06 in a small trial of an effective drug — what error?" → likely Type II (false negative) due to inadequate power.
"Which is NOT a correct interpretation of the p-value?" → tests the "p ≠ probability H₀ is true" fallacy.
"Lowering α from 0.05 to 0.01 does what to β?" → increases β (and reduces power) for fixed n.
"Best way to reduce both Type I and Type II error?" → increase sample size.
Test-selection MCQs: comparing two means → t-test; categorical data → chi-square; >2 group means → ANOVA.

High-yield: When a question describes a drug that truly works but the study found "no significant difference," the answer is almost always Type II error / the study was underpowered.

Rapid revision

H₀ = no difference; H₁ = difference exists. We reject or fail to reject H₀ — never "accept" it.
p-value = P(data this extreme | H₀ true). It is NOT the probability H₀ is true.
p < 0.05 → reject H₀ (significant); p ≥ 0.05 → fail to reject.
Type I error = α = false positive = rejecting a TRUE null (convicting the innocent).
Type II error = β = false negative = missing a REAL effect (acquitting the guilty).
Power = 1 − β; aim ≥ 0.80; it is the chance of detecting a true effect.
Best way to raise power without inflating α = increase sample size.
Lowering α raises β (lowers power) at fixed n — the α–β trade-off.
95% CI for a difference that includes 0 → not significant; CI for RR/OR that includes 1 → not significant.
Higher confidence level → wider CI; 99% CI is wider than 95% CI.
t-test/ANOVA for means (continuous); chi-square for proportions (categorical); Fisher's exact for small cells.
Statistical significance ≠ clinical importance; always read effect size and CI alongside the p-value.

← Back to hub Practice MCQs →