AT

Hypothesis Testing & P-value

Community Medicine · Biostatistics · lean revision notes

Hypothesis Testing & P-value

Hypothesis testing is the formal framework biostatistics uses to decide whether an observed difference (between treatments, groups, or against a reference) is a real effect or merely the play of chance. For NEET PG, this is one of the most reliably tested Biostatistics areas: examiners love to make you distinguish type I from type II error, interpret a borderline p-value, and connect power to sample size. Get the definitions crisp and you bank these marks.

The core logic of significance testing

We never directly "prove" that a treatment works. Instead we set up a null hypothesis (H₀) — a statement of no difference / no association / no effect — and ask: if H₀ were true, how surprising is the data we actually observed? If the data would be very unlikely under H₀, we reject H₀ in favour of the alternative.

  • Null hypothesis (H₀): there is no difference between groups (e.g. mean BP in drug group = mean BP in placebo group; relative risk = 1; odds ratio = 1; difference in proportions = 0). It is the hypothesis of "status quo / chance alone."
  • Alternative hypothesis (H₁ or Hₐ): there is a difference. This is what the researcher usually hopes to demonstrate.
  • A two-tailed test asks whether there is a difference in either direction (drug is better OR worse). A one-tailed (one-sided) test pre-specifies a single direction. Two-tailed is the default and more conservative; one-tailed needs strong prior justification.

The stepwise flow of a hypothesis test:

State H₀ and H₁ → fix the significance level α (usually 0.05) → choose the correct statistical test → compute the test statistic → derive the p-value → compare p with α → reject or fail to reject H₀ → interpret in context.

High-yield: We never "accept" the null hypothesis. We either reject H₀ or fail to reject H₀. Failing to reject is not the same as proving H₀ true — absence of evidence is not evidence of absence.

The p-value — what it really means

The p-value is the probability of obtaining a result as extreme as, or more extreme than, the one observed, assuming the null hypothesis is true. It quantifies how compatible your data are with H₀.

  • Small p-value → data are unlikely under H₀ → evidence against H₀.
  • Large p-value → data are quite compatible with H₀ → insufficient evidence to reject it.

The conventional cut-off (significance level, α) is 0.05. By convention:

  • p < 0.05 → "statistically significant" → reject H₀.
  • p ≥ 0.05 → "not statistically significant" → fail to reject H₀.

High-yield: A p-value is NOT the probability that the null hypothesis is true, and NOT the probability that the result occurred by chance. It is conditional on H₀ being true. This subtle misinterpretation is a classic exam trap.

Common misinterpretations examiners test:

Statement Correct? Why
"p = 0.04 means there is a 4% chance H₀ is true" ✗ Wrong p is computed assuming H₀ true; it is not P(H₀|data)
"p = 0.04 means a 96% chance the drug works" ✗ Wrong Same fallacy in reverse
"p < 0.05 means the effect is large / clinically important" ✗ Wrong Significance ≠ magnitude; large samples make tiny effects significant
"p = 0.06 means the treatment definitely does not work" ✗ Wrong Non-significant ≠ no effect; may be underpowered
"Smaller p = stronger evidence against H₀" ✓ Correct This is the only fully safe reading

Borderline p-values: A result of p = 0.049 and p = 0.051 are almost identical in evidential strength, yet one is "significant" and the other "not" at α = 0.05. The 0.05 threshold is an arbitrary convention. Modern interpretation favours reporting the exact p-value plus the confidence interval and effect size, rather than a binary verdict.

Type I and Type II errors

Because we decide based on a probabilistic sample, we can be wrong in two ways. This 2×2 table is the single most examined concept in the topic — memorise it cold.

H₀ actually TRUE (no real effect) H₀ actually FALSE (real effect exists)
We REJECT H₀ (declare significant) Type I error (α) — false positive Correct decision — Power (1 − β)
We FAIL TO REJECT H₀ (declare not significant) Correct decision (1 − α) Type II error (β) — false negative
  • Type I error (α): rejecting a true null hypothesis. You conclude there is an effect when there is none → a false positive. Probability = α (the significance level itself). Setting α = 0.05 means you accept a 5% chance of a false-positive conclusion.
  • Type II error (β): failing to reject a false null hypothesis. You miss a real effect → a false negative. Probability = β. Conventionally set at 0.10–0.20.

High-yield: Type I = false positive = α = "convicting an innocent person." Type II = false negative = β = "letting a guilty person go free." This courtroom analogy is the fastest way to never confuse them.

A clinical analogy NEET loves: think of α and β like sensitivity/specificity of a decision.

  • Convicting an innocent man (Type I) is usually considered the more serious error, which is why α is kept small (0.05 or 0.01).
  • A drug declared effective when it is not (Type I) exposes patients to useless/harmful therapy; a truly effective drug missed (Type II) deprives patients of benefit.

Mnemonic: "αlpha = Accusing the innocent (false Alarm); βeta = Blind to the truth (missed it)." Or simply: Type 1 has ONE letter shape like the bar "I" = "I told you there was an effect when there wasn't."

Statistical power (1 − β)

Power = 1 − β = the probability that a test correctly rejects a false null hypothesis, i.e. the probability of detecting a real effect when one truly exists. Conventionally we aim for power ≥ 80% (0.80), i.e. β ≤ 0.20.

Determinants of power — what increases power:

  1. Larger sample size (n) — the single most important and most modifiable factor. ↑ n → ↑ power.
  2. Larger effect size — bigger true difference is easier to detect.
  3. Lower variability (smaller SD) — less noise, clearer signal.
  4. Higher α — a more lenient threshold (e.g. 0.05 vs 0.01) raises power but also raises Type I error risk.
  5. One-tailed test has more power than two-tailed (for the specified direction).

High-yield: To increase power without inflating Type I error, increase the sample size. Sample-size calculation is essentially solving for the n needed to achieve a desired power (usually 80–90%) at a fixed α for an anticipated effect size and variance.

The α–β trade-off (fixed n): lowering α (say from 0.05 to 0.01) makes it harder to reject H₀, which reduces Type I error but increases β (Type II error) and therefore reduces power. You cannot minimise both errors simultaneously without increasing n.

Quantity Symbol Typical value Plain meaning
Significance level α 0.05 Max acceptable false-positive rate
Type I error rate α 0.05 P(reject true H₀)
Type II error rate β 0.20 P(fail to reject false H₀)
Power 1 − β 0.80 P(detect a true effect)
Confidence level 1 − α 0.95 Coverage of the CI

Confidence intervals (CI)

A confidence interval gives a range of plausible values for the true population parameter, with a stated confidence level (usually 95%). It conveys both the estimate and its precision, which a bare p-value cannot.

Interpretation of a 95% CI: if we repeated the study many times, 95% of the constructed intervals would contain the true population value. A wide CI = imprecise estimate (often small n); a narrow CI = precise.

General structure: CI = point estimate ± (critical value × standard error).

  • For a mean: x̄ ± Z(1−α/2) × (SD/√n). For 95% CI, Z = 1.96. (Use t instead of Z for small samples.)
  • For a proportion: p ± 1.96 × √[p(1−p)/n].
  • For a difference in means / proportions: estimate of the difference ± 1.96 × SE of the difference.

Linking CI to significance — the most exam-relevant rule:

  • For a difference (in means or proportions): if the 95% CI includes 0 (the null value), the result is NOT significant at p = 0.05. If it excludes 0, it is significant.
  • For a ratio (relative risk, odds ratio, hazard ratio): the null value is 1. If the 95% CI includes 1, the result is NOT significant. If the CI lies entirely above or entirely below 1, it is significant.

High-yield: RR or OR with 95% CI of (0.8 – 1.4) crosses 1 → not significant. RR 2.1 (95% CI 1.3 – 3.5) excludes 1 → significant and indicates increased risk. This single rule answers a large fraction of CI MCQs.

A 99% CI is wider than a 95% CI (more confidence demands a broader net), and corresponds to α = 0.01.

Choosing the correct statistical test (quick map)

Examiners may ask which test to apply. A rapid decision approach:

  1. Two means, two groups, quantitative data, normally distributed → Student's t-test (unpaired/independent). Paired data (before–after in same subjects) → paired t-test.
  2. More than two means → ANOVA (analysis of variance); post-hoc tests if significant.
  3. Proportions / categorical data → Chi-square (χ²) test. Small expected cell counts (< 5) → Fisher's exact test.
  4. Correlation between two quantitative variables → Pearson's r (parametric) or Spearman's rank (non-parametric).
  5. Non-normal / ordinal data → non-parametric tests: Mann–Whitney U (≈ unpaired t), Wilcoxon signed-rank (≈ paired t), Kruskal–Wallis (≈ one-way ANOVA).

High-yield: Chi-square is for categorical/qualitative data; t-test/ANOVA for continuous data comparisons of means. Mixing these up is a frequent error.

Multiple comparisons & related pitfalls

  • Multiple testing inflates Type I error. If you run 20 independent tests at α = 0.05, you expect ~1 false-positive by chance alone. Correction methods: Bonferroni (divide α by the number of tests, e.g. 0.05/20 = 0.0025) is the classic, conservative fix.
  • Statistical vs clinical significance: a statistically significant result (p < 0.05) may be clinically trivial if the effect size is tiny — common with very large samples. Conversely, a clinically meaningful effect may miss significance in an underpowered study. Always interpret p alongside the effect size and CI.
  • Publication bias: significant results are published more, distorting the literature.

Complications & consequences of getting it wrong

  • A Type I error in a trial → an ineffective/harmful drug enters practice; resources wasted; patients exposed to risk.
  • A Type II error → a genuinely useful intervention is discarded; often due to inadequate sample size (underpowered study).
  • Over-reliance on p < 0.05 as a binary → ignores magnitude and precision; the ASA (American Statistical Association) 2016 statement explicitly warns against this.

Key differentials / commonly confused pairs

Pair Distinction
α vs p-value α is the pre-set threshold; p is the computed result for that dataset
Type I vs Type II error False positive (reject true H₀) vs false negative (miss false H₀)
Power vs significance Power = detecting real effect (1−β); significance = α threshold
Confidence level vs CI width Higher confidence (99%) → wider interval
Statistical vs clinical significance p-value vs real-world importance of effect size
One-tailed vs two-tailed Directional vs non-directional; one-tailed has more power but needs justification

Recently asked / exam angle

NEET PG and INI-CET have repeatedly probed this topic, usually in these flavours:

  • "Rejecting a true null hypothesis is which error?" → Type I (α). The mirror question asks the same for failing to reject a false null → Type II (β).
  • "What does power of a study mean / how to increase it?" → 1 − β; increase by raising sample size (preferred), effect size, or α.
  • "A 95% CI for relative risk is 0.9 – 1.6. Interpret." → crosses 1 → not statistically significant.
  • "p = 0.06 in a small trial of an effective drug — what error?" → likely Type II (false negative) due to inadequate power.
  • "Which is NOT a correct interpretation of the p-value?" → tests the "p ≠ probability H₀ is true" fallacy.
  • "Lowering α from 0.05 to 0.01 does what to β?" → increases β (and reduces power) for fixed n.
  • "Best way to reduce both Type I and Type II error?" → increase sample size.
  • Test-selection MCQs: comparing two means → t-test; categorical data → chi-square; >2 group means → ANOVA.

High-yield: When a question describes a drug that truly works but the study found "no significant difference," the answer is almost always Type II error / the study was underpowered.

Rapid revision

  1. H₀ = no difference; H₁ = difference exists. We reject or fail to reject H₀ — never "accept" it.
  2. p-value = P(data this extreme | H₀ true). It is NOT the probability H₀ is true.
  3. p < 0.05 → reject H₀ (significant); p ≥ 0.05 → fail to reject.
  4. Type I error = α = false positive = rejecting a TRUE null (convicting the innocent).
  5. Type II error = β = false negative = missing a REAL effect (acquitting the guilty).
  6. Power = 1 − β; aim ≥ 0.80; it is the chance of detecting a true effect.
  7. Best way to raise power without inflating α = increase sample size.
  8. Lowering α raises β (lowers power) at fixed n — the α–β trade-off.
  9. 95% CI for a difference that includes 0 → not significant; CI for RR/OR that includes 1 → not significant.
  10. Higher confidence level → wider CI; 99% CI is wider than 95% CI.
  11. t-test/ANOVA for means (continuous); chi-square for proportions (categorical); Fisher's exact for small cells.
  12. Statistical significance ≠ clinical importance; always read effect size and CI alongside the p-value.