In the first part of our study (
https://antonio-catalano.github.io/NY_Airbnb_kaggle.html) we made an exploratory data analysis of a 98th truncated distribution of prices. Moreover we
truncated the price distribution having as reference the overall price distribution.
In this second part we are going to be more precise about the outlier prices of the distribuion.
First of all, it's better to divide the overall distribution in 15 different sub-distributions, because a private-room listing in Manhattan will have an expected price very different compared to a shared-room listing in Bronx.
Indeed, if we group the listings by the room type and by the borough we have 15 distributions, because there are 3 room type and 5 boroughes.
In the first part of our study we started to do that grouping and the pivot table resulting was:
But in order to make a more precise outlier analysis we need more informations about those 15 price distributions.
Which informations?
Basically, we are going to do an extreme value analysis (EVA), which is only one of the possible outlier analysis.
The aim of an EVA is to find and describe which points lie at one of the ends of a probability or empirical distribution.
Because we have already seen that the empirical price distribution is right-skewed and it can't be take negative values, the only possible outliers, if any, must lie in the right tail of the distribution.
The problem is to define a criterion to know when an outlier....well, is an outlier.
Indeed the outlier analysis is almost always an unsupervised problem: in other worlds, we haven't past labels which say to us the threshold below/above which a point must be considered an outlier point.
Generally, if the empirical distribution is simmetric and has some similarity with a normal distribution, as location metric we use the sample average, and as dispersion metric we use the sample standard deviation. Those two statistics, in those type of distributions, can define the threshold below/which a point is an outlier.
In the field of the statistical process control for example, where the processes generally resemble one of the well-behaved distributions (normal, binomial, poisson), the threshold is fixed to 3 sample standard deviation ($ \hat{\sigma} $) from the average value.
For example the mean chart has two bounds:
where AverageValue is calculated as average of averages (you can think that we have 100 samples and each sample has 50 instances: in this case $n$ is 50 and AverageValue is the average of each 100 averages), and $ \hat{\sigma} $ is an estimate of the standard deviation of the random variable characterizing the industrial process (more details in https://en.wikipedia.org/wiki/Control_chart).
The reason of this threshold is this: in a standard normal distribution the 99.7% of points are inside the value range (-3, 3).
So if in an empirical distribution that reasonably follows a normal distribution there is a normalized point (i.e. a point obtained subtracting the average and dividing by the sample standard deviation) that exceeds 3 in absolute value, there is only a 0.3% of probability that the point "comes" from the same diffusion process. And this probability decreases quickly as the point moves away from 3. For this reason this point is considered an outlier, in other words a point that it is very different from most of the remaing data.
"An outlier is an observation which deviates so much from the other observations as to arouse suspicious that it was generated by a different mechanism" (D. Hawkins)
But our distribution is not simmetric and we have already seen that extreme values are present: so we preferred the median to the mean as location metric, and the interquartile range to the standard deviation as dispersion metric.
So we can't use the threshold of 3 standard deviation from the mean to spot an outlier.
A largely used threshold for the right tail and for not well-behaved distributions is represented by the value: $ Q3 + 3\times IQR $ (sometimes we can find the factor 1.5 and not 3: let's say that the 3 version is the severe threshold), where $ Q3 $ is the third quartile and $ IQR $ is the difference between the third quartile and the first quartile ($ Q3 - Q1 $).
Once we have calculated that threshold for each empirical distribution, we can calculate also the ratio between the maximum value of the distribution and that threshold.
If that ratio is > 1 , so at least one point (the maximum) is an outlier.
Moreover, the value of this ratio says to us how extreme an outlier is.
In the table below we reported all the principal statistics we talked about for each of the 15 empirical distributions.
The threshold column reports the $Q3 + 3 \times IQR $ value.
In the table above we have sorted the (room_type, borough) distributions by the max/threshold value in descending order.
For example, the "Private room" type is the type where the outliers are the most extreme.
We can also see that Staten Island is the borough with less listings and less extreme outliers.
Besides, in the (Shared room, Staten island) no outlier is present (but we have only 9 listings there).
So now we have a clearer view of each distribution and we know that extreme values exist.
Through the last table we have measured the intensity of the most extreme outlier, with the max/threshold value.
But we don't know how many outliers exist for each (room_type, borough) distribution.
In order to know that we made the following script (toogle the code button if you want to visualize it), and the result is reported:
As you can see, for each distribution we have reported the threshold and the number of listings with a price below the threshold (index = False) and the number of listings with a price above that threshold (index = True).
Then we reported the fraction of outliers for each distribution.
So while the distribution with the most extreme outlier is (Private room, Queens) with a max/threshold value of 62.8,
the distribution with more outliers is the (Shared room, Brooklyn), where 7 % of listings are outliers.