Descriptive Statistics

Descriptive Statistics

Measures of central tendency, dispersion, position, skewness, and kurtosis, including grouped data analysis.

Descriptive statistics summarize and organize characteristics of a dataset. They provide simple, quantitative summaries about the sample and the measures, forming the basis of virtually every quantitative analysis of data. For civil engineers, these statistics describe the fundamental properties of materials, environmental conditions, and structural behaviors.

Measures of Central Tendency

These measures indicate the "center" or typical value of a data set.

Mean (Arithmetic Average)

The sum of all values divided by the number of values. It incorporates every data point but is sensitive to extreme outliers (e.g., an unusually high compressive strength reading).

\bar{x} = \frac{\sum x_i}{n}

Median

The middle value when the data is sorted in ascending or descending order. If there is an even number of observations, it is the arithmetic average of the two middle values. The median is robust and not heavily influenced by extreme outliers, making it a better measure of center for skewed data (e.g., income, or highly variable soil permeability).

Mode

The value that appears most frequently in the data set. A distribution can be unimodal, bimodal (two distinct peaks), or multimodal. It is primarily useful for categorical data (nominal level).

Mean for Grouped Data

When dealing with large datasets presented in a frequency distribution table, the exact mean cannot be calculated. Instead, we approximate it using class midpoints.

Grouped Mean Formula

Where $f_i$ is the frequency of the $i^{th}$ class and $m_i$ is the midpoint of the $i^{th}$ class.

\bar{x}_{grouped} \approx \frac{\sum (f_i \cdot m_i)}{\sum f_i}

Interact with the simulation below to explore measures of central tendency and dispersion.

Engineering Data Analysis

Descriptive Statistics Explorer

Add Data Point

Dataset (5)

Mean (Average)

\bar{x}

12.00

The sum of all values divided by the sample size.

Median

\tilde{x}

12.00

The middle value when the data is sorted in order.

Mode

\text{Mo}

None

The most frequently occurring value(s) in the dataset.

Range

R

15.00

Sample Std. Dev.

s

5.87

Key Takeaways

Mean: Arithmetic average, highly sensitive to extreme values or outliers.
Median: The exact middle value, robust against outliers, providing a better center for skewed data.
Mode: The most frequent value, useful for categorical analysis.

Measures of Dispersion (Variability)

These measures describe the spread, scatter, or variability of the data around the central value.

In engineering, variability is often synonymous with risk or uncertainty. High variance in concrete strength means a less reliable material.

Range

The difference between the maximum and minimum values in the dataset. It is a quick measure of total spread but is highly susceptible to extreme outliers.

R = x_{max} - x_{min}

Variance ( $s^{2}$ or $\sigma^2$ )

The average of the squared differences from the mean. It quantifies the average squared distance of each data point from the center.

Population Variance ( $\sigma^2$ ): Used when the dataset includes the entire population of interest.

\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}

Sample Variance ( $s^2$ ): Used when working with a sample to estimate the population variance. It uses $n-1$ (degrees of freedom) in the denominator to provide an unbiased estimate.

s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}

Standard Deviation ( $s$ or $\sigma$ )

The positive square root of the variance. It is expressed in the same units as the original data (e.g., MPa, mm, seconds), making it far easier to interpret practically than variance.

s = \sqrt{s^2}

Coefficient of Variation ( $C V$ )

A measure of relative variability. It expresses the standard deviation as a percentage of the mean, allowing for the comparison of dispersion across datasets with different units or vastly different means.

CV = \frac{s}{\bar{x}} \times 100\%

The Empirical Rule and Chebyshev's Theorem

Rules for interpreting standard deviation relative to the mean.

The Empirical Rule (Normal Distributions)

If the data distribution is approximately bell-shaped (normal):

Approximately 68% of the data falls within one standard deviation of the mean ( $\bar{x} \pm 1s$ ).
Approximately 95% of the data falls within two standard deviations ( $\bar{x} \pm 2s$ ).
Approximately 99.7% of the data falls within three standard deviations ( $\bar{x} \pm 3s$ ).

Chebyshev's Theorem (Any Distribution)

For any set of data (regardless of the shape of the distribution), the proportion of values that lie within $k$ standard deviations of the mean is at least $1 - \frac{1}{k^2}$ , where $k > 1$ .

For $k=2$ : At least 75% of the data falls within $\bar{x} \pm 2s$ .
For $k=3$ : At least 88.9% of the data falls within $\bar{x} \pm 3s$ .

Key Takeaways

Range: Quick measure of the total spread, easily skewed by outliers.
Variance: Average squared deviation from the mean (use $n-1$ for sample variance to correct for bias).
Standard Deviation: The most common measure of spread, expressed in original data units.
Empirical Rule: Useful heuristic for bell-shaped distributions (68-95-99.7 rule).

Measures of Position

These describe the relative location of a specific data value within the entire dataset.

Percentiles

Values that divide a sorted dataset into 100 equal parts. The $k^{th}$ percentile ( $P_k$ ) is a value such that at least $k\%$ of the observations are less than or equal to this value, and $(100-k)\%$ are greater.

Quartiles and the Five-Number Summary

Values that divide the sorted data into four equal parts. They form the basis of the Five-Number Summary (Min, $Q_1$ , Median, $Q_3$ , Max) and the Box Plot visualization.

$Q_1$ (First Quartile): 25th percentile ( $P_{25}$ )
$Q_2$ (Second Quartile): 50th percentile (Median, $P_{50}$ )
$Q_3$ (Third Quartile): 75th percentile ( $P_{75}$ )

Interquartile Range (IQR) and Outliers

The range of the middle 50% of the sorted data. It is a robust measure of variability.

IQR = Q_3 - Q_1

Outlier Detection: Data points are typically considered outliers if they fall below $Q_1 - 1.5(IQR)$ or above $Q_3 + 1.5(IQR)$ .

Key Takeaways

Percentiles: Indicate relative standing (e.g., scoring in the 90th percentile).
Quartiles: Divide the dataset into quarters ( $Q_1, Q_2, Q_3$ ).
IQR: The spread of the middle half of the data, critical for robust outlier detection using the $1.5 \times IQR$ rule.

Interact with the box plot simulation below to explore quartiles, IQR, and outliers.

Engineering Data Analysis • Topic 2

Interactive Box & Whisker Plot

Data Values

Value x₁20

Value x₂35

Value x₃40

Value x₄50

Value x₅55

Value x₆60

Value x₇75

Value x₈95

Median (Q2)52.5

Q137.5

Q367.5

IQR30.0

• Outliers are values beyond fences:

[\text{Q1} - 1.5\text{IQR}, \text{Q3} + 1.5\text{IQR}]

.No outliers.

Skewness and Kurtosis

These measures describe the shape of the data's distribution compared to a standard normal (bell-shaped) curve.

Skewness

A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

Positive Skew (Right-Skewed): The right tail is longer or fatter; the mass of the distribution is concentrated on the left. The Mean is typically greater than the Median.
Negative Skew (Left-Skewed): The left tail is longer or fatter; the mass of the distribution is concentrated on the right. The Mean is typically less than the Median.
Zero Skew: The distribution is perfectly symmetric (e.g., normal distribution). Mean equals Median.

Kurtosis

A measure of the "tailedness" (heavy or light tails) of the probability distribution. It describes how much of the data is clustered in the extreme tails versus the center, relative to a normal distribution.

Leptokurtic (High Kurtosis, $>3$ ): Heavy tails and a sharper, higher peak compared to a normal distribution. Indicates a higher propensity for extreme outliers (critical in risk assessment for extreme loads).
Platykurtic (Low Kurtosis, $<3$ ): Light tails and a flatter peak. Fewer extreme outliers.
Mesokurtic (Kurtosis $\approx 3$ ): The kurtosis of a standard normal distribution.

Interact with the simulation below to visualize skewness and kurtosis.

Engineering Data Analysis

Distribution Shape: Skewness & Kurtosis

Skewness (

\gamma_1

): 0.0Symmetric (Zero Skew)

Negative SkewSymmetric (0)Positive Skew

Kurtosis (

\beta_2

): 3.0Mesokurtic (Normal)

Platykurtic (Flat)Mesokurtic (3)Leptokurtic (Peaked)

Statistical Moments

• Skewness measures the asymmetry of the PDF around the mean. A positive skew has a tail extending towards more positive values.

• Kurtosismeasures the "tailedness" of the distribution. Fatter tails and a sharper peak characterize high kurtosis (Leptokurtic).

Loading chart...

Key Takeaways

Skewness: Indicates whether data is asymmetric to the left or right of the mean.
Kurtosis: Measures the extremity of tails in a distribution, helping engineers predict the likelihood and severity of extreme, outlier events.

PreviousIntroduction to Data Analysis - Examples & Applications

Quiz Me

NextDescriptive Statistics - Examples & Applications

Measures of Central Tendency

Mean (Arithmetic Average)

Median

Mode

Mean for Grouped Data

Grouped Mean Formula

Engineering Data Analysis

Dataset (5)

Measures of Dispersion (Variability)

Range

Variance (s2s^2s2 or σ2\sigma^2σ2)

Standard Deviation (sss or σ\sigmaσ)

Coefficient of Variation (CVCVCV)

The Empirical Rule and Chebyshev's Theorem

The Empirical Rule (Normal Distributions)

Chebyshev's Theorem (Any Distribution)

Measures of Position

Percentiles

Quartiles and the Five-Number Summary

Interquartile Range (IQR) and Outliers

Engineering Data Analysis • Topic 2

Data Values

Skewness and Kurtosis

Skewness

Kurtosis

Engineering Data Analysis

Statistical Moments

Variance ( $s^{2}$ or $\sigma^2$ )

Standard Deviation ( $s$ or $\sigma$ )

Coefficient of Variation ( $C V$ )