Curve Fitting and Interpolation

Engineers often need to construct a continuous mathematical function that represents discrete data points. There are two distinct, fundamentally different approaches depending on the nature of the data.

Regression vs. Interpolation

  • Regression (Curve Fitting): Used when the data exhibits significant scatter or error (e.g., experimental measurements with noise). The goal is to derive a single curve that represents the general trend of the data. Crucially, the regression curve does not necessarily pass through any of the specific data points.
  • Interpolation: Used when the data is known to be highly precise or exact (e.g., tabulated thermodynamic properties or computer-generated coordinates). The goal is to derive a curve that passes exactly through every single data point.

Least-Squares Regression

When fitting a trend line to noisy data, the most common method minimizes the sum of the squares of the residuals (errors) between the discrete points and the continuous curve.

Linear Regression

Fitting a straight line y=a0+a1xy = a_0 + a_1x to nn data points. The coefficients a0a_0 (intercept) and a1a_1 (slope) are found by minimizing the sum of the squared residuals.

Normal Equations

The coefficients are derived by taking the partial derivatives of the sum of squared residuals with respect to a0a_0 and a1a_1 and setting them to zero. This yields the "Normal Equations":
  1. na0+(xi)a1=yin a_0 + (\sum x_i) a_1 = \sum y_i
  2. (xi)a0+(xi2)a1=xiyi(\sum x_i) a_0 + (\sum x_i^2) a_1 = \sum x_i y_i Solving this 2×22 \times 2 linear system directly yields the formulas for a0a_0 and a1a_1.
a1=nxiyixiyinxi2(xi)2a_1 = \frac{n \sum x_iy_i - \sum x_i \sum y_i}{n \sum x_i^2 - (\sum x_i)^2}
a0=yˉa1xˉa_0 = \bar{y} - a_1\bar{x}
To quantify the "goodness of fit", we use the coefficient of determination (R2R^2). It measures how much of the variance in the dependent variable is explained by the model:
R2=StSrStR^2 = \frac{S_t - S_r}{S_t}
Where St=(yiyˉ)2S_t = \sum(y_i - \bar{y})^2 is the total sum of squares, and Sr=(yia0a1xi)2S_r = \sum(y_i - a_0 - a_1x_i)^2 is the sum of the squares of the residuals. An R2R^2 value close to 1 indicates a perfect fit. The standard error of the estimate (sy/xs_{y/x}) is sy/x=Sr/(n2)s_{y/x} = \sqrt{S_r / (n - 2)}.

Least-Squares Linear Regression

Drag the points or click on the graph to add new points. Observe how the best-fit line y=a0+a1xy = a_0 + a_1x minimizes the sum of squared residuals.

1122334455667788991010
Click to add points (max 15). Drag to move.

Regression Statistics

Equation of the Line
y = 0.329 + 1.032x
Intercept (a0a_0)
0.3286
Slope (a1a_1)
1.0321
R-squared (r2r^2)
0.9820
Sum of Squares (SrS_r)
0.5482
Pointxy
11.001.50
22.002.20
33.003.10
44.004.80
55.005.50
66.006.90
77.007.20

Linearization of Nonlinear Relationships

Many nonlinear relationships can be transformed to allow for linear regression by taking the natural logarithm or base-10 logarithm of the variables.

Common Linearizations

  • Exponential Equation: y=αeβxy = \alpha e^{\beta x} becomes ln(y)=ln(α)+βx\ln(y) = \ln(\alpha) + \beta x. Let y=ln(y)y^* = \ln(y), a0=ln(α)a_0 = \ln(\alpha), a1=βa_1 = \beta, x=xx^* = x.
  • Power Equation: y=αxβy = \alpha x^{\beta} becomes log(y)=log(α)+βlog(x)\log(y) = \log(\alpha) + \beta \log(x). Let y=log(y)y^* = \log(y), a0=log(α)a_0 = \log(\alpha), a1=βa_1 = \beta, x=log(x)x^* = \log(x).
  • Saturation-Growth-Rate: y=αxβ+xy = \frac{\alpha x}{\beta + x} becomes 1y=βα1x+1α\frac{1}{y} = \frac{\beta}{\alpha} \frac{1}{x} + \frac{1}{\alpha}. Let y=1yy^* = \frac{1}{y}, a0=1αa_0 = \frac{1}{\alpha}, a1=βαa_1 = \frac{\beta}{\alpha}, x=1xx^* = \frac{1}{x}.

Multiple Linear Regression

When a dependent variable yy is a linear function of two or more independent variables (e.g., y=a0+a1x1+a2x2y = a_0 + a_1x_1 + a_2x_2), least-squares regression can be extended to find the coefficients by solving a system of simultaneous linear equations formed by partial derivatives of the sum of squared residuals.

Polynomial Regression and Overfitting

Extending the least-squares concept to fit polynomials of degree mm: y=a0+a1x+a2x2++amxmy = a_0 + a_1x + a_2x^2 + \dots + a_mx^m. This involves solving an (m+1)×(m+1)(m+1) \times (m+1) system of simultaneous linear equations (the general Normal Equations) to find the coefficients.

The Danger of Overfitting

While a higher-degree polynomial will technically yield a lower sum of squared residuals, arbitrarily increasing the degree mm simply to chase a higher R2R^2 value is a severe error known as overfitting. An overfitted polynomial will weave wildly to hit noise points rather than capturing the true underlying physics. This resulting function becomes highly unstable and nearly useless for making predictions between or beyond the data points. A general rule is to use the lowest degree polynomial that adequately captures the physical trend, validating it against an independent test set.

Orthogonal Polynomials and Fourier Approximation

For continuous functions or extensive data sets, using orthogonal polynomials (like Legendre or Chebyshev polynomials) for regression minimizes numerical errors.

Fourier Series

Instead of fitting polynomials, Fourier approximation uses a series of sine and cosine functions to model periodic phenomena (e.g., tides or vibrations). It's a form of least-squares fitting using orthogonal trigonometric functions. f(t)=a0+k=1n(akcos(kω0t)+bksin(kω0t))f(t) = a_0 + \sum_{k=1}^n (a_k \cos(k\omega_0 t) + b_k \sin(k\omega_0 t))

The Dangers of Extrapolation

Interpolation refers to predicting values within the range of the given data set. Extrapolation refers to predicting values outside that range.

Caution

Extrapolating outside the bounds of the base data is highly dangerous. Polynomials, in particular, tend to oscillate wildly outside their fitting range. A trend that appears strictly linear or exponential within the given data may completely break down outside it, leading to grossly erroneous predictions.

Newton's Divided-Difference Interpolating Polynomials

Interpolation is used when data is known to be highly precise, and the fitted curve must pass directly through every data point. Newton's method is one of the most popular and useful forms.

Newton's Interpolating Polynomial

A polynomial of order nn that passes through n+1n+1 points: fn(x)=b0+b1(xx0)+b2(xx0)(xx1)++bn(xx0)(xx1)(xxn1)f_n(x) = b_0 + b_1(x - x_0) + b_2(x - x_0)(x - x_1) + \dots + b_n(x - x_0)(x - x_1)\dots(x - x_{n-1}) The coefficients bib_i are evaluated using finite divided differences.

Inverse Interpolation

In standard interpolation, you provide a value xx to find f(x)f(x). In inverse interpolation, you are given a value f(x)f(x) and must find the corresponding xx.

How to perform Inverse Interpolation

  • Swapping variables: The simplest approach is to swap xx and yy (making xx the dependent variable) and use a standard method like Newton's or Lagrange. However, this only works well if the function is monotonic.
  • Root-finding: A more robust approach is to form a new function g(x)=f(x)ytargetg(x) = f(x) - y_{\text{target}} and use root-finding methods (like the Secant Method) on the interpolating polynomial.

Runge's Phenomenon and Chebyshev Nodes

When fitting a high-degree polynomial to a large set of equally spaced data points, severe oscillations can occur near the edges of the interval. This is known as Runge's Phenomenon.

Chebyshev Nodes

To mitigate Runge's Phenomenon, we can choose non-equally spaced points called Chebyshev nodes. These nodes are clustered near the ends of the interval and sparse in the center, which minimizes the maximum interpolation error across the domain. xk=cos(2k12nπ),k=1,,nx_k = \cos\left(\frac{2k-1}{2n}\pi\right), \quad k=1,\dots,n For an interval [a,b][a, b], these points are linearly scaled from [1,1][-1, 1]. Using Chebyshev nodes with Newton's or Lagrange polynomials dramatically improves stability.

Lagrange Interpolating Polynomials

A reformulation of Newton's polynomial that avoids the computation of divided differences. It is conceptually simpler but computationally more expensive if the order of the polynomial is unknown beforehand.
fn(x)=i=0nLi(x)f(xi)f_n(x) = \sum_{i=0}^n L_i(x) f(x_i)
Li(x)=j=0,jinxxjxixjL_i(x) = \prod_{j=0, j \neq i}^n \frac{x - x_j}{x_i - x_j}

Hermite Interpolation

Unlike standard interpolation (which only guarantees the curve passes through the given points), Hermite interpolation forces the polynomial to match both the function values and the first derivatives (slopes) at those points.

Hermite Polynomials

To interpolate nn points where both f(xi)f(x_i) and f(xi)f'(x_i) are specified, a polynomial of degree 2n12n-1 is required. This ensures the resulting interpolating curve has smooth tangents at the data points, which is particularly useful in computer graphics and path planning.

Piecewise Linear Interpolation

Before moving to complex splines, the most basic form of spline interpolation is piecewise linear interpolation, which simply connects consecutive data points with straight lines (first-degree polynomials).
f1(x)=f(xi1)+f(xi)f(xi1)xixi1(xxi1)f_1(x) = f(x_{i-1}) + \frac{f(x_i) - f(x_{i-1})}{x_i - x_{i-1}} (x - x_{i-1})
While easy to implement, its primary drawback is that the first derivative (slope) is discontinuous at the knots, resulting in a curve with sharp corners rather than a smooth profile.

Spline Interpolation

Using a single high-degree polynomial for a large number of points can cause severe oscillations (Runge's phenomenon). Spline interpolation applies lower-degree polynomials to subsets of data points, ensuring smooth transitions at the connection points (knots).

Cubic Splines

The most common type of spline. It fits a third-degree polynomial to each interval between data points. To ensure a smooth continuous curve across the entire data set, the cubic spline enforces matching conditions at the internal knots:
  • The function values must match at the knots (continuity).
  • The first derivatives must match at the knots (smooth slope).
  • The second derivatives must match at the knots (smooth curvature).
This results in a tridiagonal system of linear equations that can be solved very efficiently.

Spline Boundary Conditions

Because there are nn intervals but conditions only specify the internal connections, two extra boundary conditions at the first and last knot are required to close the system:
  • Natural Spline: The second derivative is set to zero at the endpoints (f(x0)=f(xn)=0f''(x_0) = f''(x_n) = 0). This makes the curve "straight" at the ends.
  • Clamped Spline: The first derivatives (slopes) at the endpoints are specified by the user to match a known tangent.
  • Not-a-knot: Forces the third derivative to be continuous across the first and last internal knots, effectively treating the first two and last two intervals as single cubic functions.
Key Takeaways
  • Regression creates a general trend line that minimizes errors for noisy data, while Interpolation forces the curve to pass exactly through precise data points.
  • Regression uses Normal Equations to derive coefficients that minimize the sum of squared residuals.
  • The Coefficient of Determination (R2R^2) quantifies how well the regression line fits the data.
  • Linearization allows nonlinear models (exponential, power) to be fitted using linear regression techniques.
  • Increasing the polynomial degree too high leads to Overfitting, causing wild non-physical oscillations.
  • Extrapolation outside the data bounds is dangerous because fitted polynomials tend to diverge rapidly.
  • Runge's Phenomenon causes high-degree polynomials to oscillate on equally spaced points; using Chebyshev Nodes mitigates this.
  • Hermite Interpolation matches both the function values and their first derivatives at the data points.
  • Piecewise Linear Interpolation connects points with straight lines but has discontinuous slopes.
  • Orthogonal polynomials and Fourier series are powerful tools for continuous least-squares fitting, especially for periodic data.
  • Newton's and Lagrange polynomials provide global interpolation formulas but are susceptible to Runge's phenomenon for high degrees.
  • Inverse Interpolation finds the xx-value for a given f(x)f(x), often by swapping variables or using root-finding methods on the interpolant.
  • Cubic Splines provide local, piecewise continuous polynomials that maintain smooth slopes and curvatures across knots, avoiding large oscillations.
  • Spline systems require specific Boundary Conditions (like Natural, Clamped, or Not-a-knot) to fully solve the resulting tridiagonal matrix.