Before getting into any statistical modelling and more detailed analytics, it is important for us to understand the data and its distribution at a more basic level. Below are some distributions and plots that will help us to understand the categorical variables in our data set. These are called Descriptive Statistics.
Assume a theoretical scenario where there are 50 sale entries in a stationery shop, with each sale containing just one item. All sales are across just 5 products – Pen, Pencil, Eraser, Sharpener & Scale. A frequency distribution for this example dataset would look like below.
Frequency Distribution is a tabular summary showing the number of occurrences of each categorical value in the dataset. The stationary sales dataset containing 50 observations comprises of 18 pen sales, 13 pencil sales and so on.
Relative Frequency and Percent Frequency Distribution:
Instead of summarizing the individual frequencies, the representation can be converted to relative proportions as below.
The proportion of pen sales out of the total sales is 18/50 which is 0.36. Similarly the proportion of pencils in total sales is 0.26. Notice that the sum of all relative frequencies is 1. This same information can also be represented as percentages as below.
Cumulative Frequency Distribution:
Cumulative Frequency Distribution is a running total of frequencies. First row starts with frequency of 18 which corresponds to Pen sales. Second row has the value 31 which is the total of Pen and Pencil sales. Third row adds 10 more to this total to account for Eraser sales. By now we know that Pen, Pencil and Eraser put together have contributed to 41 out of 50 sales. When it comes to the last item, the cumulative frequency equals the total number of observations.
Cumulative Frequency Distribution can also be represented in terms of Relative Frequency or Percent Frequency.
The descriptive statistics described above were tabular distributions. Bar Chart is a graphical representation of the same information which enables us to get a visual comparison in a quick way.
The bars are separated with spaces in between to show that they are non-overlapping categories.
The data in this example happens to be in descending order of frequencies. It is just a coincidence and need not always be so for bar charts. Bar charts can also be plotted using Relative or Percent Frequencies.
Pie Chart is another graphical alternative to view the same information.
But pie charts are not usually that easy to compare visually, unless the sector sizes are vastly different from each other. Bar charts are much simpler to read and compare visually.
Sometimes, just the simple descriptive statistics can yield us very useful insights. A classic example is the Pareto Chart used primarily in quality control applications. Bar chart is drawn in descending order of bar heights and optionally a cumulative line graph is added as a dual plot on the same chart. This shows the top few causes contributing to majority of the problems / defects.