Regression and Correlation
Simple and multiple linear regression, correlation coefficients, least squares method, and residual analysis.
Engineers frequently need to predict the value of one variable based on the value of another. For example, predicting the compressive strength of concrete () based on its curing time (), or estimating a river's peak flow rate () based on rainfall intensity ().
While correlation measures the strength and direction of a linear relationship, regression provides the mathematical equation to make actual predictions.
Correlation
Quantifying the strength of the linear relationship.
Scatter Plots
A graphical tool used to plot two quantitative variables on a Cartesian coordinate system. Scatter plots provide the first visual indication of whether a linear relationship, a non-linear relationship, or no relationship exists between and . They are essential before proceeding with correlation or regression calculations.
Pearson Correlation Coefficient ()
A unitless measure that ranges from -1 to +1, describing how closely data points fall along a straight line.
- : Strong positive linear relationship (as increases, predictably increases).
- : Strong negative linear relationship (as increases, predictably decreases).
- : Weak or no linear relationship (the data appears as a random scatter). Note: A value of does not mean there is no relationship at all; there could be a strong non-linear (e.g., quadratic) relationship.
Correlation does not imply causation
Just because two variables are highly correlated (e.g., asphalt sales and ice cream sales) does not mean one causes the other. Both might be driven by a lurking third variable (summer weather).
Simple Linear Regression
Modeling the relationship with a straight line.
The Regression Model
We hypothesize that the true relationship is a straight line, plus some random, unobservable error ().
- : The dependent (response) variable we want to predict.
- : The independent (predictor or explanatory) variable.
- : The true -intercept (the value of when ).
- : The true slope (the change in for a one-unit change in ).
- : The random error term. We assume these errors are normally distributed with a mean of 0 and a constant variance ().
The Method of Least Squares
Finding the "line of best fit."
Because we only have a sample, we estimate the true parameters () with sample statistics ().
Estimated Regression Equation
The "hat" on indicates it is an estimated or predicted value, not an actual observed value.
Least Squares Principle
The line of best fit is the one that minimizes the Sum of Squared Errors (SSE). The error (or residual) is the vertical distance between an observed data point () and the predicted value on the line ().
Using calculus to minimize SSE yields the formulas for the slope () and intercept ():
Assessing the Model and Residual Analysis
How good is our prediction?
Before relying on a regression equation for engineering decisions, we must verify that the model is appropriate.
Coefficient of Determination ()
The proportion of the total variation in the dependent variable () that is explained by the regression model (the independent variable ).
- .
- An means that 85% of the variation in concrete strength can be explained by the variation in curing time.
- For simple linear regression (one predictor), .
Standard Error of the Estimate ()
A measure of the typical distance that observed data points fall from the regression line. It estimates the standard deviation of the error term ().
Residual Analysis (Validating Assumptions)
A critical step. We plot the residuals () against the predicted values () or the predictor (). For the linear model to be valid, the residual plot should show a random, structureless horizontal band around zero.
- Non-linearity: If the residuals show a curved pattern (like a U-shape), a straight line is not appropriate; a polynomial regression is needed.
- Heteroscedasticity: If the spread of the residuals increases (a fan shape), the variance of the errors is not constant. This violates a key assumption and requires data transformation (e.g., taking the log of ).
Multiple Linear Regression
Using more than one predictor variable.
Engineers rarely predict an outcome based on a single variable. Concrete strength depends on water-cement ratio, curing time, temperature, and aggregate type.
Multiple Regression Model
- Each represents the change in the estimated for a one-unit change in , holding all other predictor variables constant.
- Adjusted : Unlike regular (which always increases when you add a variable), Adjusted penalizes adding variables that do not significantly improve the model, preventing "overfitting."
Hypothesis Testing in Regression (ANOVA Approach)
Testing if the model is statistically significant.
Hypothesis Test for the Slope ()
In simple linear regression, testing whether a linear relationship exists is equivalent to testing the null hypothesis (the true slope is zero). We use a t-test with degrees of freedom. If we reject , there is sufficient evidence that provides information in predicting .
F-Test for Overall Significance
Tests whether the regression model as a whole is better than simply predicting the mean of ().
- (The model is useless).
- If the resulting P-value is very small, at least one predictor is significantly related to .
t-Tests for Individual Coefficients
If the overall model is significant, we run t-tests on each individual slope () to determine which specific variables are actually contributing.
- (This specific predictor is useless, assuming all others are already in the model).
Key Takeaways
- Correlation (): Measures the strength of a linear relationship (-1 to 1). Does not imply causation.
- Least Squares: The mathematical method used to find the line that minimizes the sum of squared errors (SSE).
- : The percentage of variation in explained by the model.
- Residual Analysis: Crucial for validating the assumptions of linearity and constant variance. Residual plots should look like random noise.
- Multiple Regression: Uses multiple predictors. coefficients represent the effect of holding all other variables constant.
- ANOVA F-Test: Determines if the overall regression model is statistically significant.