9 – Descriptive Statistics – Categorical Variable

Before getting into any statistical modelling and more detailed analytics, it is important for us to understand the data and its distribution at a more basic level. Below are some distributions and plots that will help us to understand the categorical variables in our data set. These are called Descriptive Statistics.

Frequency Distribution:

Assume a theoretical scenario where there are 50 sale entries in a stationery shop, with each sale containing just one item. All sales are across just 5 products – Pen, Pencil, Eraser, Sharpener & Scale. A frequency distribution for this example dataset would look like below.

Freq Dist

Frequency Distribution is a tabular summary showing the number of occurrences of each categorical value in the dataset. The stationary sales dataset containing 50 observations comprises of 18 pen sales, 13 pencil sales and so on.

Relative Frequency and Percent Frequency Distribution:

Instead of summarizing the individual frequencies, the representation can be converted to relative proportions as below.

Rel Freq Dist

The proportion of pen sales out of the total sales is 18/50 which is 0.36. Similarly the proportion of pencils in total sales is 0.26. Notice that the sum of all relative frequencies is 1. This same information can also be represented as percentages as below.

Perc Freq Dist

Cumulative Frequency Distribution:

Cumulative Frequency Distribution is a running total of frequencies. First row starts with frequency of 18 which corresponds to Pen sales. Second row has the value 31 which is the total of Pen and Pencil sales. Third row adds 10 more to this total to account for Eraser sales. By now we know that Pen, Pencil and Eraser put together have contributed to 41 out of 50 sales. When it comes to the last item, the cumulative frequency equals the total number of observations.

Cum Freq Dist

Cumulative Frequency Distribution can also be represented in terms of Relative Frequency or Percent Frequency.

Bar Chart:

The descriptive statistics described above were tabular distributions. Bar Chart is a graphical representation of the same information which enables us to get a visual comparison in a quick way.

Bar Chart

The bars are separated with spaces in between to show that they are non-overlapping categories.

The data in this example happens to be in descending order of frequencies. It is just a coincidence and need not always be so for bar charts. Bar charts can also be plotted using Relative or Percent Frequencies.

Pie Chart:

Pie Chart is another graphical alternative to view the same information.

Pie Chart

But pie charts are not usually that easy to compare visually, unless the sector sizes are vastly different from each other. Bar charts are much simpler to read and compare visually.

Sometimes, just the simple descriptive statistics can yield us very useful insights. A classic example is the Pareto Chart used primarily in quality control applications. Bar chart is drawn in descending order of bar heights and optionally a cumulative line graph is added as a dual plot on the same chart. This shows the top few causes contributing to majority of the problems / defects.

Measures of Central Tendency, Measures of Location and Measures of Variability discussed in previous posts also come under the subject of Descriptive Statistics.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s