Descriptive Statistics
Measures of central tendency, dispersion, position, skewness, and kurtosis, including grouped data analysis.
Descriptive statistics summarize and organize characteristics of a dataset. They provide simple, quantitative summaries about the sample and the measures, forming the basis of virtually every quantitative analysis of data. For civil engineers, these statistics describe the fundamental properties of materials, environmental conditions, and structural behaviors.
Measures of Central Tendency
These measures indicate the "center" or typical value of a data set.
Mean (Arithmetic Average)
The sum of all values divided by the number of values. It incorporates every data point but is sensitive to extreme outliers (e.g., an unusually high compressive strength reading).
Median
The middle value when the data is sorted in ascending or descending order. If there is an even number of observations, it is the arithmetic average of the two middle values. The median is robust and not heavily influenced by extreme outliers, making it a better measure of center for skewed data (e.g., income, or highly variable soil permeability).
Mode
The value that appears most frequently in the data set. A distribution can be unimodal, bimodal (two distinct peaks), or multimodal. It is primarily useful for categorical data (nominal level).
Mean for Grouped Data
When dealing with large datasets presented in a frequency distribution table, the exact mean cannot be calculated. Instead, we approximate it using class midpoints.
Grouped Mean Formula
Where is the frequency of the class and is the midpoint of the class.
Descriptive Statistics Explorer
Dataset (5)
5
8
12
15
20
Mean (Average)x̄
12.00
The sum of all values divided by the number of values.
Medianx̃
12.00
The middle value when the data is sorted.
ModeMo
None
The most frequently occurring value(s).
RangeR
15.00
Std. Deviations
5.87
Key Takeaways
- Mean: Arithmetic average, highly sensitive to extreme values or outliers.
- Median: The exact middle value, robust against outliers, providing a better center for skewed data.
- Mode: The most frequent value, useful for categorical analysis.
Measures of Dispersion (Variability)
These measures describe the spread, scatter, or variability of the data around the central value.
In engineering, variability is often synonymous with risk or uncertainty. High variance in concrete strength means a less reliable material.
Range
The difference between the maximum and minimum values in the dataset. It is a quick measure of total spread but is highly susceptible to extreme outliers.
Variance ( or )
The average of the squared differences from the mean. It quantifies the average squared distance of each data point from the center.
- Population Variance (): Used when the dataset includes the entire population of interest.
- Sample Variance (): Used when working with a sample to estimate the population variance. It uses (degrees of freedom) in the denominator to provide an unbiased estimate.
Standard Deviation ( or )
The positive square root of the variance. It is expressed in the same units as the original data (e.g., MPa, mm, seconds), making it far easier to interpret practically than variance.
Coefficient of Variation ()
A measure of relative variability. It expresses the standard deviation as a percentage of the mean, allowing for the comparison of dispersion across datasets with different units or vastly different means.
The Empirical Rule and Chebyshev's Theorem
Rules for interpreting standard deviation relative to the mean.
The Empirical Rule (Normal Distributions)
If the data distribution is approximately bell-shaped (normal):
- Approximately 68% of the data falls within one standard deviation of the mean ().
- Approximately 95% of the data falls within two standard deviations ().
- Approximately 99.7% of the data falls within three standard deviations ().
Chebyshev's Theorem (Any Distribution)
For any set of data (regardless of the shape of the distribution), the proportion of values that lie within standard deviations of the mean is at least , where .
- For : At least 75% of the data falls within .
- For : At least 88.9% of the data falls within .
Key Takeaways
- Range: Quick measure of the total spread, easily skewed by outliers.
- Variance: Average squared deviation from the mean (use for sample variance to correct for bias).
- Standard Deviation: The most common measure of spread, expressed in original data units.
- Empirical Rule: Useful heuristic for bell-shaped distributions (68-95-99.7 rule).
Measures of Position
These describe the relative location of a specific data value within the entire dataset.
Percentiles
Values that divide a sorted dataset into 100 equal parts. The percentile () is a value such that at least of the observations are less than or equal to this value, and are greater.
Quartiles and the Five-Number Summary
Values that divide the sorted data into four equal parts. They form the basis of the Five-Number Summary (Min, , Median, , Max) and the Box Plot visualization.
- (First Quartile): 25th percentile ()
- (Second Quartile): 50th percentile (Median, )
- (Third Quartile): 75th percentile ()
Interquartile Range (IQR) and Outliers
The range of the middle 50% of the sorted data. It is a robust measure of variability.
Outlier Detection: Data points are typically considered outliers if they fall below or above .
Key Takeaways
- Percentiles: Indicate relative standing (e.g., scoring in the 90th percentile).
- Quartiles: Divide the dataset into quarters ().
- IQR: The spread of the middle half of the data, critical for robust outlier detection using the rule.
Skewness and Kurtosis
These measures describe the shape of the data's distribution compared to a standard normal (bell-shaped) curve.
Skewness
A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
- Positive Skew (Right-Skewed): The right tail is longer or fatter; the mass of the distribution is concentrated on the left. The Mean is typically greater than the Median.
- Negative Skew (Left-Skewed): The left tail is longer or fatter; the mass of the distribution is concentrated on the right. The Mean is typically less than the Median.
- Zero Skew: The distribution is perfectly symmetric (e.g., normal distribution). Mean equals Median.
Kurtosis
A measure of the "tailedness" (heavy or light tails) of the probability distribution. It describes how much of the data is clustered in the extreme tails versus the center, relative to a normal distribution.
- Leptokurtic (High Kurtosis, ): Heavy tails and a sharper, higher peak compared to a normal distribution. Indicates a higher propensity for extreme outliers (critical in risk assessment for extreme loads).
- Platykurtic (Low Kurtosis, ): Light tails and a flatter peak. Fewer extreme outliers.
- Mesokurtic (Kurtosis ): The kurtosis of a standard normal distribution.
Distribution Shape Visualizer
Negative SkewSymmetricPositive Skew
PlatykurticMesokurticLeptokurtic
Loading chart...
Key Takeaways
- Skewness: Indicates whether data is asymmetric to the left or right of the mean.
- Kurtosis: Measures the extremity of tails in a distribution, helping engineers predict the likelihood and severity of extreme, outlier events.