In the previous blogs we looked at measures of central tendency, and expanded it to other measures of location. Measures of central tendency give us an idea about the middle point around which data is spread. But they don’t exactly tell us how much the variability there is in the data. Though percentiles give us some idea about the spread of data, there are other measures more specifically meant to gauge the variability.
Look at data below for example.
Though both students have averaged at 70, it is quite evident that the variability in their scores is significantly different. Student 1 has been quite inconsistent fluctuating all the way from 51 to 91. Student 2 has been pretty much around 70 all the time.
Just looking at the measure of central tendency, or “mean” in this case, doesn’t tell the complete story about your data. You need to qualify the data with a figure representing its variability as well.
Range is the simplest of variability measures. It is just the difference between maximum and minimum values.
Range for given example is illustrated below.
For student 1 it is 91 – 51 = 40
For student 2 it is 73 – 67 = 6
Though range is quite simple and straightforward to calculate, it is sensitive to extreme values. Imagine a dataset having employee salaries in a 100 member company. 99 employees have salaries ranging between 10,000 and 30,000 rupees per month and the CEO takes away 20,000,000 rupees per month.
The range in this case will be 19,990,000. This doesn’t actually tell the reader that the employee salaries have a range of just 30,000 – 10,000 which is 20k. It is getting exaggerated into a huge figure just because of one person’s salary.
Range also doesn’t take rest of the values into account. It only bothers about the top and bottom values.
Inter quartile range is the range of middle 50% of values. It is the difference between 75th percentile and 25th percentile. The calculation leaves out the first quartile and last quartile, thereby hoping to get rid of the influence of low and high extreme values.
IQR for given example is shown below.
“i” value for quartile 1 is (25/100) * 12 = 3
If “i” value is an integer the result is calculated as average of ith and (i + 1)st value. So, quartile 1 is the mean of 3rd and 4th month marks.
Similarly, quartile 3 is the mean of 9th and 10th month marks.
IQR = Q3 – Q1 as shown in the table above.
The inter quartile range attempts to retain the simplicity of range calculation to some extent, and at the same time overcome the problem of extreme values.
But, it still doesn’t take all values in the dataset into account. It only uses the values at 25th and 75th percentiles.
Mean Absolute Deviation:
Another approach to take all values into consideration is to calculate the deviation of each value from the mean and then find the average of such deviations. But, the deviations from mean will be a combination of positive and negative values which always total to zero.
To overcome the property of all deviations totaling to zero, we use Mean Absolute Deviation (MAD). This approach takes the absolute deviation of each value from the mean (ignores the sign) and then finds the mean of such deviations.
Image above shows the calculation of deviation from mean
and absolute deviation from mean
using marks of Student 1. MAD for Student 1 is 12.17.
Variance is a more popular measure that also overcomes the problem of deviations from mean totaling to zero. This method squares the deviations from mean and then averages the square deviations.
Formula for variance is
Variance is denoted by “sigma square” symbol. Calculation for variance is shown using Student 1 marks below.
Variance for Student 1 marks across 12 months is 178.17.
The problem with variance is that it squares the deviations and hence the unit of measure also gets squared for the result. This is difficult to interpret and doesn’t convey any proper meaning. For example, if the data deals with weight in kilograms or currency in dollars, the variance will come out as kilogram square or dollar square respectively.
Standard Deviation just takes a square root of the variance. Thus, it brings back the variability measure to the original unit of measurement which is easier to interpret. At the same time, it retains the advantage of variance in the sense it takes every single data item into consideration while computing variability.
Formula for Standard Deviation is
In example of student 1 above, standard deviation is SQRT(178.17) = 13.35
Coefficient of Variation:
Though Standard Deviation captures the variability of data in the same unit of measurement as original data, it still doesn’t relate it to the original scale of data.
A standard deviation of 60 seconds is much bigger in the context of run times of a 100 meter race when compared to 60 seconds in a 10 km race.
Coefficient of Variation takes care of relating standard deviation to the mean, thereby bringing the variability into perspective. The formula for coefficient of variation is
For student 1 example the coefficient of variation is (13.35/70) * 100 = 19.07%
Range is used more often in Quality Assurance applications to plot the control charts. It is not used that much in complex Data Science models. MAD is used as a measure of accuracy sometimes in forecasting scenarios. Variance and Standard Deviation are used very often and we will see many such instances in our future posts. There are more useful in comparing variables that have more or less the same means. Coefficient of Variation is applicable to compare variables with different standard deviations and different means.