Descriptive Statistics

Measures of central tendency, dispersion, position, skewness, and kurtosis, including grouped data analysis.
Descriptive statistics summarize and organize characteristics of a dataset. They provide simple, quantitative summaries about the sample and the measures, forming the basis of virtually every quantitative analysis of data. For civil engineers, these statistics describe the fundamental properties of materials, environmental conditions, and structural behaviors.

Measures of Central Tendency

These measures indicate the "center" or typical value of a data set.

Mean (Arithmetic Average)

The sum of all values divided by the number of values. It incorporates every data point but is sensitive to extreme outliers (e.g., an unusually high compressive strength reading).
xˉ=xin \bar{x} = \frac{\sum x_i}{n}

Median

The middle value when the data is sorted in ascending or descending order. If there is an even number of observations, it is the arithmetic average of the two middle values. The median is robust and not heavily influenced by extreme outliers, making it a better measure of center for skewed data (e.g., income, or highly variable soil permeability).

Mode

The value that appears most frequently in the data set. A distribution can be unimodal, bimodal (two distinct peaks), or multimodal. It is primarily useful for categorical data (nominal level).

Mean for Grouped Data

When dealing with large datasets presented in a frequency distribution table, the exact mean cannot be calculated. Instead, we approximate it using class midpoints.

Grouped Mean Formula

Where fif_i is the frequency of the ithi^{th} class and mim_i is the midpoint of the ithi^{th} class.
xˉgrouped(fimi)fi \bar{x}_{grouped} \approx \frac{\sum (f_i \cdot m_i)}{\sum f_i}

Descriptive Statistics Explorer

Dataset (5)

5
8
12
15
20
Mean (Average)
12.00

The sum of all values divided by the number of values.

Median
12.00

The middle value when the data is sorted.

ModeMo
None

The most frequently occurring value(s).

RangeR
15.00
Std. Deviations
5.87
Key Takeaways
  • Mean: Arithmetic average, highly sensitive to extreme values or outliers.
  • Median: The exact middle value, robust against outliers, providing a better center for skewed data.
  • Mode: The most frequent value, useful for categorical analysis.

Measures of Dispersion (Variability)

These measures describe the spread, scatter, or variability of the data around the central value.
In engineering, variability is often synonymous with risk or uncertainty. High variance in concrete strength means a less reliable material.

Range

The difference between the maximum and minimum values in the dataset. It is a quick measure of total spread but is highly susceptible to extreme outliers.
R=xmaxxmin R = x_{max} - x_{min}

Variance (s2s^2 or σ2\sigma^2)

The average of the squared differences from the mean. It quantifies the average squared distance of each data point from the center.
  • Population Variance (σ2\sigma^2): Used when the dataset includes the entire population of interest.
σ2=(xiμ)2N \sigma^2 = \frac{\sum (x_i - \mu)^2}{N}
  • Sample Variance (s2s^2): Used when working with a sample to estimate the population variance. It uses n1n-1 (degrees of freedom) in the denominator to provide an unbiased estimate.
s2=(xixˉ)2n1 s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}

Standard Deviation (ss or σ\sigma)

The positive square root of the variance. It is expressed in the same units as the original data (e.g., MPa, mm, seconds), making it far easier to interpret practically than variance.
s=s2 s = \sqrt{s^2}

Coefficient of Variation (CVCV)

A measure of relative variability. It expresses the standard deviation as a percentage of the mean, allowing for the comparison of dispersion across datasets with different units or vastly different means.
CV=sxˉ×100% CV = \frac{s}{\bar{x}} \times 100\%

The Empirical Rule and Chebyshev's Theorem

Rules for interpreting standard deviation relative to the mean.

The Empirical Rule (Normal Distributions)

If the data distribution is approximately bell-shaped (normal):
  • Approximately 68% of the data falls within one standard deviation of the mean (xˉ±1s\bar{x} \pm 1s).
  • Approximately 95% of the data falls within two standard deviations (xˉ±2s\bar{x} \pm 2s).
  • Approximately 99.7% of the data falls within three standard deviations (xˉ±3s\bar{x} \pm 3s).

Chebyshev's Theorem (Any Distribution)

For any set of data (regardless of the shape of the distribution), the proportion of values that lie within kk standard deviations of the mean is at least 11k21 - \frac{1}{k^2}, where k>1k > 1.
  • For k=2k=2: At least 75% of the data falls within xˉ±2s\bar{x} \pm 2s.
  • For k=3k=3: At least 88.9% of the data falls within xˉ±3s\bar{x} \pm 3s.
Key Takeaways
  • Range: Quick measure of the total spread, easily skewed by outliers.
  • Variance: Average squared deviation from the mean (use n1n-1 for sample variance to correct for bias).
  • Standard Deviation: The most common measure of spread, expressed in original data units.
  • Empirical Rule: Useful heuristic for bell-shaped distributions (68-95-99.7 rule).

Measures of Position

These describe the relative location of a specific data value within the entire dataset.

Percentiles

Values that divide a sorted dataset into 100 equal parts. The kthk^{th} percentile (PkP_k) is a value such that at least k%k\% of the observations are less than or equal to this value, and (100k)%(100-k)\% are greater.

Quartiles and the Five-Number Summary

Values that divide the sorted data into four equal parts. They form the basis of the Five-Number Summary (Min, Q1Q_1, Median, Q3Q_3, Max) and the Box Plot visualization.
  • Q1Q_1 (First Quartile): 25th percentile (P25P_{25})
  • Q2Q_2 (Second Quartile): 50th percentile (Median, P50P_{50})
  • Q3Q_3 (Third Quartile): 75th percentile (P75P_{75})

Interquartile Range (IQR) and Outliers

The range of the middle 50% of the sorted data. It is a robust measure of variability.
IQR=Q3Q1 IQR = Q_3 - Q_1
Outlier Detection: Data points are typically considered outliers if they fall below Q11.5(IQR)Q_1 - 1.5(IQR) or above Q3+1.5(IQR)Q_3 + 1.5(IQR).
Key Takeaways
  • Percentiles: Indicate relative standing (e.g., scoring in the 90th percentile).
  • Quartiles: Divide the dataset into quarters (Q1,Q2,Q3Q_1, Q_2, Q_3).
  • IQR: The spread of the middle half of the data, critical for robust outlier detection using the 1.5×IQR1.5 \times IQR rule.

Skewness and Kurtosis

These measures describe the shape of the data's distribution compared to a standard normal (bell-shaped) curve.

Skewness

A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
  • Positive Skew (Right-Skewed): The right tail is longer or fatter; the mass of the distribution is concentrated on the left. The Mean is typically greater than the Median.
  • Negative Skew (Left-Skewed): The left tail is longer or fatter; the mass of the distribution is concentrated on the right. The Mean is typically less than the Median.
  • Zero Skew: The distribution is perfectly symmetric (e.g., normal distribution). Mean equals Median.

Kurtosis

A measure of the "tailedness" (heavy or light tails) of the probability distribution. It describes how much of the data is clustered in the extreme tails versus the center, relative to a normal distribution.
  • Leptokurtic (High Kurtosis, >3>3): Heavy tails and a sharper, higher peak compared to a normal distribution. Indicates a higher propensity for extreme outliers (critical in risk assessment for extreme loads).
  • Platykurtic (Low Kurtosis, <3<3): Light tails and a flatter peak. Fewer extreme outliers.
  • Mesokurtic (Kurtosis 3\approx 3): The kurtosis of a standard normal distribution.

Distribution Shape Visualizer

Negative SkewSymmetricPositive Skew
PlatykurticMesokurticLeptokurtic
Loading chart...
Key Takeaways
  • Skewness: Indicates whether data is asymmetric to the left or right of the mean.
  • Kurtosis: Measures the extremity of tails in a distribution, helping engineers predict the likelihood and severity of extreme, outlier events.