Analytical Concepts for Business: Skewness, Mean and a bit more
Why and when mean and median should be used?
Perhaps the most used formulae to analyse a chunk of number data could be none other than the average. Like there’s the average sales per day, average expenses so on and so forth. This tiny little math logic gives quiet good intuition into the financial data at hand by basically trying to answer the question, “What value to expect?”. To answer this very same question, statistics as a subject gives you two fundamental measures being the mean and the median. In this, we try to answer firstly the difference between mean and median, and why one should be chosen over the other for a situation at hand.
Please note that, the intention in not an in-depth statistic understanding but a head start for a quick deployment of the concepts for more efficient analytics.
Mean = Sum of Observations / Number of Observations
Median = The middle reading if the data is arranged in the ascending order. If there are 2 readings at the middle point, the simple average of that.
Understanding the histogram and skewness
Fundamental to understanding this concept is to visualise data using a histogram. In short, a histogram is a graphical representation of the quantity distribution of a numeric variable. Now, we try understanding this very concept of histogram using randomly generated sales data.
This data has the sales figure per day for a 365 days period given, starting April 04, 2016. Here in this data the minimum sales figure comes to INR 1,38,947 and maximum sales amount comes to INR 13,90,490. If these sales figures were to be arranged in the ascending order and then allocated to bins of INR 20,000 (the bin range being 2,00,000 to 2,20,000 and so on up till 14,20,000) then we get the distribution of sales data. Plotting of this very same table with the bins in the x axis and the distribution in the y axis is what makes a histogram (one given below).
From the above chart, we can find that the values pile up the most, around the region 2,00,000 to 2,60,000, then slowly falls down till 7,60,000 region and then has 3 relatively small pileups at 12,40,000 to 14, 00,000 ranges. This chart as one moves more to the right, the tail gets narrower and narrower and so is said to be right skewed. Similarly, had the chart been like the one below, it would have been left skewed.
Most data sets demonstrate skew and so is quiet important to appreciate this phenomena. Data might not demonstrate skew at all, like the one given below.
The concept of skew, is fundamental in understanding how to answer the question, “What value to expect?”.
Understanding the use of Median for analyzing data that demonstrates skewness
The data shown in the above histogram (the same histogram shown before) gives a mean of INR 3,48,612 and a median of INR 2,99,332. This mean is calculated by taking a simple average of the values given for a 365 day period and median by first arranging the data in ascending order and then choosing the 50th value. To the question, “What to expect?” INR 2,99,332 seems to be a better guess compared to 3,48,612 as one does not see much of distribution in that region.
In the above right skewed histogram with mean and median marked , because of the existence of strong outliers more towards the 14,20,000 side, we see that the simple average denoted by the red line being pulled more towards the skeweness. If business decisions are taken, using this mean, the overall expectation of sales per day will be relatively overstated. For instance if mean is used to fix the future expected daily sales, the fixed amount might be a high flung target for the sales team to achieve. In these situations, its most advised to rely on the median instead on the mean. The outliers being the readings at the extreme right ends should not be overlooked as these may represent festival and seasonal sales.
Similar to a right skewed distribution, a left skewed distribution as shown above demonstrates a mean that falls short of the median, thus leading to understating the expectations.
The mean and median were also marked for the data with no skewness and the result was with a mean and median not so far apart.
Though median is a very reliable estimate when dealing with single skewness, it may not be so when the data demonstrates extreme volatility. For volatile data sets, it’s best advised to either stick with the mean or split the data using an acceptable logic (Festive period) and then analyse the blocks separately.
The above discussion of mean and median though cannot be taken as the final word for data analytics to answer the question of “what to expect?”. It sure is a robust method to analyse data sets, though choosing the best method should depend upon the decision making at hand and the nature of data set.