Tests of Hypotheses
While estimation focuses on finding the value of a population parameter, hypothesis testing focuses on making decisions about a population parameter based on sample data. An engineer might ask: "Does this new steel alloy have a mean tensile strength greater than 400 MPa?" or "Is the variance in asphalt thickness less than 5 mm²?" Hypothesis testing provides a formal, objective framework to answer these yes-or-no questions.
The Framework of Hypothesis Testing
1. Null Hypothesis ()
The statement of the status quo, no effect, or no difference. It always contains an equality sign (). We assume is true until the sample data provides overwhelming evidence to the contrary.
Example: MPa (The new alloy is no stronger than the old one).
2. Alternative Hypothesis ( or )
The statement we are trying to prove. It contradicts and never contains an equality sign (). If the sample data strongly supports , we "reject ."
Example: MPa (The new alloy is stronger).
3. Test Statistic
A standardized value calculated from the sample data (e.g., a Z-score, t-score, or value) assuming is true. It measures how far our sample result is from the null hypothesis value, expressed in units of standard error.
4. P-Value
The probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample, assuming the null hypothesis is true.
- A very small P-value (typically ) indicates the observed data is highly unlikely under , leading us to reject .
- A large P-value indicates the data is consistent with , so we "fail to reject ."
5. Significance Level ()
The predetermined threshold for rejecting . It is the maximum allowable probability of making a Type I Error. Common values are 0.05 (5%), 0.01 (1%), or 0.10 (10%).
Decision Rule: If P-value , reject . If P-value , fail to reject .
Errors in Decision Making and Statistical Power
Because we rely on partial information (a sample), we can make mistakes.
Type I Error ()
Rejecting a true Null Hypothesis (a "false positive"). You conclude the new alloy is stronger when it actually isn't. The probability of a Type I error is precisely the significance level .
Type II Error ()
Failing to reject a false Null Hypothesis (a "false negative"). You conclude the new alloy is no better, but it actually is stronger.
Statistical Power ()
The probability of correctly rejecting a false Null Hypothesis. A highly powerful test is very likely to detect a real difference if one exists.
- Power increases as the true difference (effect size) increases.
- Power increases as the significance level increases (but this raises the risk of a Type I error).
- Power increases as the sample size increases. Sample Size Determination: Engineers often calculate the minimum sample size needed to achieve a specific power (e.g., 80%) before running an expensive test.
Common Hypothesis Tests
1. Tests for a Single Population Mean ()
- Z-Test (Variance Known): Rarely used in practice. Assumes population variance is known. Test statistic: .
- t-Test (Variance Unknown): The standard test. Uses the sample standard deviation . Test statistic: with .
2. Tests for Two Population Means ()
- Independent Samples (Pooled t-Test): Assumes the two populations have equal (but unknown) variances. The sample variances are pooled to estimate a single standard error.
- Independent Samples (Welch's t-Test): Does not assume equal variances. More robust and generally preferred.
- Paired t-Test (Dependent Samples): Used when observations are naturally paired or matched (e.g., measuring the stiffness of the exact same beam before and after a retrofitting procedure). The test is performed on the differences between paired values, treating them as a single sample.
3. Tests for a Single Proportion ()
Uses the normal approximation (Z-test) if and . Test statistic: .
4. Tests for a Single Variance ()
Uses the Chi-square () distribution. Highly sensitive to departures from normality in the population. Test statistic: .
5. Goodness-of-Fit Tests
Chi-Square Goodness-of-Fit Test
Used to determine whether a sample follows an expected probability distribution (e.g., "Is the arrival of cars at this intersection truly Poisson distributed?" or "Are these soil samples normally distributed?").
- : The data follows the specified distribution.
- : The data does not follow the specified distribution.
- Test Statistic: , where are observed frequencies and are expected frequencies under .
- A large value means the observed data deviates significantly from what was expected, leading to rejection of .
The Connection Between CIs and Hypothesis Tests
There is a direct, mathematical duality between Confidence Intervals and two-sided Hypothesis Tests. If a 95% CI for the mean is [390, 410], then a two-sided hypothesis test (with ) will:
- Fail to reject (because 400 is inside the interval).
- Reject (because 380 is outside the interval).
Interact with the simulation below to explore hypothesis testing concepts.
Engineering Data Analysis
Hypothesis Testing Simulator
Visualize the relationships between the null distribution, critical value, significance level (), Type I/II errors, and p-value by adjusting the sliders in the simulation below.
Engineering Data Analysis • Topic 10
p-Value vs. Significance Level (α) Visualizer
Conclusion
Since the p-value (0.035) is significance level (0.050), the result is statistically significant.
- and : Formulate mutually exclusive hypotheses; contains equality.
- P-value: The probability of the sample data assuming is true. Small P-values (typically ) trigger rejection of .
- Type I Error (): False positive (rejecting true ).
- Type II Error (): False negative (failing to reject false ).
- Power (): The probability of correctly identifying a real effect. Highly dependent on sample size.
- Goodness-of-Fit (): Tests whether observed categorical data matches an expected distribution.
- Duality: A 95% Confidence Interval contains all values of the parameter that would not be rejected by a two-sided test at .