In the previous post we saw the different distributions and charts available to summarize the categorical variables. There are similar distributions and charts available for Numeric variables which we will see in this post.
Similar to categorical variables, Frequency Distributions can be created for quantitative variables too. In the case of categorical variables it was straight-forward to summarize the number of occurrences as frequencies against each categorical value. In the case of numeric variables the values are continuous numbers. There is more work to be done to convert them to the form of frequency distributions.
The steps for creating frequency distribution for numeric variable are outlined below.
- Choose the number of classes
- Arrive at the class width
- Choose class end points
Once these three steps are done, the classes can be treated similar to categories and the occurrences of data points can be rolled up as frequencies against each class.
For example, let us consider the monthly average temperature data for a city shown in table below.
Number of classes is up to the discretion of the analyst. Rough thumb rule is to have classes between 5 and 15 (or a maximum of 20). There shouldn’t be too few classes that no inference can be made from the distribution. At the same time, there shouldn’t be too many making it difficult to read and leaving certain classes with very less values. This number is determined by the size of the dataset.
In this case, the minimum temperature is 31.30 and the maximum is 66.50. The range for this dataset is 66.50 – 31.30 = 35.20. Let us choose to do 8 classes for this data.
To arrive at the class width, divide the range by the number of classes – 35.20 / 8 = 4.4. Generally this width is rounded up to the next higher value for ease of computations. So, the class width in this case is 5.
We need to start with a value that is less than or equal to the minimum temperature. The last class should finish with a value that is greater than the maximum temperature. So, let us start with 30 as the left class end point for first class and form classes in increments of 5. The resulting frequency distribution will be as below.
The curly bracket on the left indicates that the end point value is inclusive. Square bracket on the right hand side shows the value is exclusive. This will ensure that the values that fall on the boundary will be counted only in one class and not on both sides.
Once we have the frequency distribution in this format, we can compute relative frequencies, percent frequencies, cumulative frequencies, cumulative relative frequencies & cumulative percent frequencies similar to the categorical variable example in previous post.
Once we have the frequency distribution, a histogram can be drawn with class intervals on the x axis and frequencies on the y axis. Rectangular boxes are drawn connecting the left and right class end points and raising them to the height equal to the frequency of that class.
This diagram is similar to the bar chart in categorical variable scenario and almost conveys the same meaning. The subtle difference is that the consecutive bars are connected without any space in between, implying that the values are continuous. Any numeric value from the minimum to maximum is possible. In the case of bar chart there was space in between, indicating that they were non overlapping categories.
Frequency Polygon is a line chart plotted using frequencies on y axis and class midpoints on x axis. For each class, class midpoint is calculated as an average of the two class end points. For the first class that has values from 30 to 35, class mid-point is 32.5 and its frequency is 3.
Ogive is a line chart similar to frequency polygon. But the x axis consists of class end points instead of class mid points. Also, y axis is plotted using cumulative frequencies instead of normal frequencies.
Dot plot is a one dimensional plot with the actual values from the variable of interest plotted on x axis. On y axis, as many dots are stacked vertically as there are repetitions of a particular value.
The advantage of using a dot plot is that it gives the frequencies at individual value level. It can be used where the data points are relatively lesser and it is possible to interpret the chart. Else, the other summarized charts make better sense.
Stem and Leaf Plot:
Stem and Leaf Plot is a numeric arrangement with the left most digit (or multiple digits as required for larger numbers) is chosen as the stem. The right side digit (or digits) is written in sorted order on the right side of the stem for all the values starting with that stem. They constitute the leaves.
In the temperature example let us choose the 10th digit as the stem. The combination of unit digit and the decimal portion is considered the leaf. Due to the fewer number of stem values and more leaves against each stem, the plot can be extended by repeating each stem value 5 times and plotting 0 to 1.9 leaves against first occurrence of a stem value and 2 to 3.9 leaves against the second occurrence of the same stem value and so on. The resulting plot looks as below.
If rotated 90 degrees anti-clockwise, this would look similar to a histogram. But, this plot contains the actual data values if we are interested in that level of details.
These simple tabulations and plots help us to understand the data better. We get to know whether the data is concentrated in a particular range or skewed towards any one side or normally distributed. These descriptive statistics also tell us whether any transformations are required to make certain variables fit to be used as inputs into a chosen statistical model.