Tests of Hypotheses
Null and alternative hypotheses, Type I/II errors, P-values, statistical power, and tests for means, proportions, variances, and Goodness-of-Fit.
While estimation focuses on finding the value of a population parameter, hypothesis testing focuses on making decisions about a population parameter based on sample data. An engineer might ask: "Does this new steel alloy have a mean tensile strength greater than 400 MPa?" or "Is the variance in asphalt thickness less than 5 mm²?" Hypothesis testing provides a formal, objective framework to answer these yes-or-no questions.
The Framework of Hypothesis Testing
The formal steps required to set up and evaluate a statistical test.
1. Null Hypothesis ()
The statement of the status quo, no effect, or no difference. It always contains an equality sign (). We assume is true until the sample data provides overwhelming evidence to the contrary.
Example: MPa (The new alloy is no stronger than the old one).
2. Alternative Hypothesis ( or )
The statement we are trying to prove. It contradicts and never contains an equality sign (). If the sample data strongly supports , we "reject ."
Example: MPa (The new alloy is stronger).
3. Test Statistic
A standardized value calculated from the sample data (e.g., a Z-score, t-score, or value) assuming is true. It measures how far our sample result is from the null hypothesis value, expressed in units of standard error.
4. P-Value
The probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample, assuming the null hypothesis is true.
- A very small P-value (typically ) indicates the observed data is highly unlikely under , leading us to reject .
- A large P-value indicates the data is consistent with , so we "fail to reject ."
5. Significance Level ()
The predetermined threshold for rejecting . It is the maximum allowable probability of making a Type I Error. Common values are 0.05 (5%), 0.01 (1%), or 0.10 (10%).
Decision Rule: If P-value , reject . If P-value , fail to reject .
Errors in Decision Making and Statistical Power
The risks inherent in statistical inference.
Because we rely on partial information (a sample), we can make mistakes.
Type I Error ()
Rejecting a true Null Hypothesis (a "false positive"). You conclude the new alloy is stronger when it actually isn't. The probability of a Type I error is precisely the significance level .
Type II Error ()
Failing to reject a false Null Hypothesis (a "false negative"). You conclude the new alloy is no better, but it actually is stronger.
Statistical Power ()
The probability of correctly rejecting a false Null Hypothesis. A highly powerful test is very likely to detect a real difference if one exists.
- Power increases as the true difference (effect size) increases.
- Power increases as the significance level increases (but this raises the risk of a Type I error).
- Power increases as the sample size increases. Sample Size Determination: Engineers often calculate the minimum sample size needed to achieve a specific power (e.g., 80%) before running an expensive test.
Common Hypothesis Tests
1. Tests for a Single Population Mean ()
Testing claims about the center of a population.
- Z-Test (Variance Known): Rarely used in practice. Assumes population variance is known. Test statistic: .
- t-Test (Variance Unknown): The standard test. Uses the sample standard deviation . Test statistic: with .
2. Tests for Two Population Means ()
Comparing two different groups (e.g., compressive strength of Mix A vs. Mix B).
- Independent Samples (Pooled t-Test): Assumes the two populations have equal (but unknown) variances. The sample variances are pooled to estimate a single standard error.
- Independent Samples (Welch's t-Test): Does not assume equal variances. More robust and generally preferred.
- Paired t-Test (Dependent Samples): Used when observations are naturally paired or matched (e.g., measuring the stiffness of the exact same beam before and after a retrofitting procedure). The test is performed on the differences between paired values, treating them as a single sample.
3. Tests for a Single Proportion ()
Testing categorical outcomes (e.g., percentage of defective items).
Uses the normal approximation (Z-test) if and . Test statistic: .
4. Tests for a Single Variance ()
Testing claims about the variability or consistency of a process.
Uses the Chi-square () distribution. Highly sensitive to departures from normality in the population. Test statistic: .
5. Goodness-of-Fit Tests
Checking if data follows a specific theoretical distribution.
Chi-Square Goodness-of-Fit Test
Used to determine whether a sample follows an expected probability distribution (e.g., "Is the arrival of cars at this intersection truly Poisson distributed?" or "Are these soil samples normally distributed?").
- : The data follows the specified distribution.
- : The data does not follow the specified distribution.
- Test Statistic: , where are observed frequencies and are expected frequencies under .
- A large value means the observed data deviates significantly from what was expected, leading to rejection of .
The Connection Between CIs and Hypothesis Tests
There is a direct, mathematical duality between Confidence Intervals and two-sided Hypothesis Tests. If a 95% CI for the mean is [390, 410], then a two-sided hypothesis test (with ) will:
- Fail to reject (because 400 is inside the interval).
- Reject (because 380 is outside the interval).
Hypothesis Testing Simulator
Test Statistic (Z)1.96
Conclusion
Fail to Reject H₀
The test statistic is within the acceptance region.
Key Takeaways
- and : Formulate mutually exclusive hypotheses; contains equality.
- P-value: The probability of the sample data assuming is true. Small P-values (typically ) trigger rejection of .
- Type I Error (): False positive (rejecting true ).
- Type II Error (): False negative (failing to reject false ).
- Power (): The probability of correctly identifying a real effect. Highly dependent on sample size.
- Goodness-of-Fit (): Tests whether observed categorical data matches an expected distribution.
- Duality: A 95% Confidence Interval contains all values of the parameter that would not be rejected by a two-sided test at .