In [1]:
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)

# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)
In [2]:
from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
In [3]:
%%javascript 
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}
In [4]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt
plt.style.use('bmh')
import warnings
warnings.filterwarnings('ignore')

pd.options.display.float_format = '{:.6f}'.format
import seaborn as sns
import matplotlib
matplotlib.rcParams.update({'font.size': 12, 'font.family': 'Verdana'})
%matplotlib inline

%store -r df1
%store -r by_room_type

An Outlier Analysis of Airbnb listings prices


In the first part of our study ( https://antonio-catalano.github.io/NY_Airbnb_kaggle.html) we made an exploratory data analysis of a 98th truncated distribution of prices. Moreover we truncated the price distribution having as reference the overall price distribution.
In this second part we are going to be more precise about the outlier prices of the distribuion.
First of all, it's better to divide the overall distribution in 15 different sub-distributions, because a private-room listing in Manhattan will have an expected price very different compared to a shared-room listing in Bronx.
Indeed, if we group the listings by the room type and by the borough we have 15 distributions, because there are 3 room type and 5 boroughes.
In the first part of our study we started to do that grouping and the pivot table resulting was:

In [5]:
by_room_type
Out[5]:
median price IQR number listings
room_type neighbourhood_group
Entire home/apt Manhattan $ 191 $ 116 13198
Brooklyn $ 145 $ 94 9558
Queens $ 120 $ 78 2096
Bronx $ 100 $ 60 379
Staten Island $ 100 $ 76 176
Private room Manhattan $ 90 $ 52 7982
Brooklyn $ 65 $ 30 10126
Queens $ 60 $ 28 3372
Bronx $ 54 $ 30 651
Staten Island $ 50 $ 35 188
Shared room Manhattan $ 69 $ 40 480
Bronx $ 40 $ 28 60
Queens $ 37 $ 24 198
Brooklyn $ 36 $ 20 411
Staten Island $ 30 $ 46 9

But in order to make a more precise outlier analysis we need more informations about those 15 price distributions.
Which informations?
Basically, we are going to do an extreme value analysis (EVA), which is only one of the possible outlier analysis.
The aim of an EVA is to find and describe which points lie at one of the ends of a probability or empirical distribution.
Because we have already seen that the empirical price distribution is right-skewed and it can't be take negative values, the only possible outliers, if any, must lie in the right tail of the distribution.
The problem is to define a criterion to know when an outlier....well, is an outlier.
Indeed the outlier analysis is almost always an unsupervised problem: in other worlds, we haven't past labels which say to us the threshold below/above which a point must be considered an outlier point.
Generally, if the empirical distribution is simmetric and has some similarity with a normal distribution, as location metric we use the sample average, and as dispersion metric we use the sample standard deviation. Those two statistics, in those type of distributions, can define the threshold below/which a point is an outlier.
In the field of the statistical process control for example, where the processes generally resemble one of the well-behaved distributions (normal, binomial, poisson), the threshold is fixed to 3 sample standard deviation ($ \hat{\sigma} $) from the average value.
For example the mean chart has two bounds:

  • UpperBound = AverageValue + 3 $\times$ $\hat{\sigma} $ / $ \sqrt{n} $
  • LowerBound = AverageValue - 3 $\times$ $\hat{\sigma} $ / $ \sqrt{n} $

where AverageValue is calculated as average of averages (you can think that we have 100 samples and each sample has 50 instances: in this case $n$ is 50 and AverageValue is the average of each 100 averages), and $ \hat{\sigma} $ is an estimate of the standard deviation of the random variable characterizing the industrial process (more details in https://en.wikipedia.org/wiki/Control_chart).

The reason of this threshold is this: in a standard normal distribution the 99.7% of points are inside the value range (-3, 3).
So if in an empirical distribution that reasonably follows a normal distribution there is a normalized point (i.e. a point obtained subtracting the average and dividing by the sample standard deviation) that exceeds 3 in absolute value, there is only a 0.3% of probability that the point "comes" from the same diffusion process. And this probability decreases quickly as the point moves away from 3. For this reason this point is considered an outlier, in other words a point that it is very different from most of the remaing data.

"An outlier is an observation which deviates so much from the other observations as to arouse suspicious that it was generated by a different mechanism" (D. Hawkins)

But our distribution is not simmetric and we have already seen that extreme values are present: so we preferred the median to the mean as location metric, and the interquartile range to the standard deviation as dispersion metric.
So we can't use the threshold of 3 standard deviation from the mean to spot an outlier.

A largely used threshold for the right tail and for not well-behaved distributions is represented by the value: $ Q3 + 3\times IQR $ (sometimes we can find the factor 1.5 and not 3: let's say that the 3 version is the severe threshold), where $ Q3 $ is the third quartile and $ IQR $ is the difference between the third quartile and the first quartile ($ Q3 - Q1 $).
Once we have calculated that threshold for each empirical distribution, we can calculate also the ratio between the maximum value of the distribution and that threshold.
If that ratio is > 1 , so at least one point (the maximum) is an outlier.
Moreover, the value of this ratio says to us how extreme an outlier is.

In the table below we reported all the principal statistics we talked about for each of the 15 empirical distributions.
The threshold column reports the $Q3 + 3 \times IQR $ value.

In [6]:
from scipy.stats import iqr
In [7]:
t = lambda x: int(np.percentile(x,75) + 3*(np.percentile(x, 75) - np.percentile(x, 25)))
t.__name__ = 'threshold'
                                     
outs = pd.pivot_table(df1, values = ['price'], index = ['room_type','neighbourhood_group'], 
               aggfunc = ('min', lambda x: [np.percentile(x, 25), 
                                            np.percentile(x, 50), 
                                            np.percentile(x, 75),], 
                          'max', np.count_nonzero, iqr, t ))

outs.columns = outs.columns.droplevel(0)     
col = ['count_nonzero', 'min', '<lambda>', 'iqr', 'max','threshold']
outs = outs[col]
outs = outs.rename(columns = {'count_nonzero': 'number listings','<lambda>': 'Q1 Q2 Q3', 't': 'TRHESHOLD', 'iqr': 'iqr'.upper()})
outs.IQR = outs.IQR.values.astype(int)
outs['max/threshold'] = outs['max'] / outs['threshold']
outs.sort_values(by = 'max/threshold', ascending = False)
Out[7]:
number listings min Q1 Q2 Q3 IQR max threshold max/threshold
room_type neighbourhood_group
Private room Queens 3372 10 [47.0, 60.0, 75.0] 28 10000 159 62.893082
Brooklyn 10126 0 [50.0, 65.0, 80.0] 30 7500 170 44.117647
Manhattan 7982 10 [68.0, 90.0, 120.0] 52 9999 276 36.228261
Entire home/apt Brooklyn 9558 0 [105.0, 145.0, 199.0] 94 10000 481 20.790021
Manhattan 13198 0 [142.5, 191.0, 259.0] 116 10000 608 16.447368
Private room Bronx 651 0 [40.0, 53.5, 70.0] 30 2500 160 15.625000
Shared room Queens 198 11 [30.0, 37.0, 54.25] 24 1800 127 14.173228
Entire home/apt Staten Island 176 48 [75.0, 100.0, 151.25] 76 5000 380 13.157895
Shared room Brooklyn 411 0 [30.0, 36.0, 50.0] 20 725 110 6.590909
Entire home/apt Queens 2096 10 [90.0, 120.0, 168.25] 78 2600 403 6.451613
Shared room Bronx 60 20 [28.0, 40.0, 55.5] 27 800 138 5.797101
Manhattan 480 10 [49.0, 69.0, 89.25] 40 1000 210 4.761905
Entire home/apt Bronx 379 28 [80.0, 100.0, 140.0] 60 1000 320 3.125000
Private room Staten Island 188 20 [40.0, 50.0, 75.0] 35 300 180 1.666667
Shared room Staten Island 9 13 [29.0, 30.0, 75.0] 46 150 213 0.704225

In the table above we have sorted the (room_type, borough) distributions by the max/threshold value in descending order.
For example, the "Private room" type is the type where the outliers are the most extreme. We can also see that Staten Island is the borough with less listings and less extreme outliers. Besides, in the (Shared room, Staten island) no outlier is present (but we have only 9 listings there).

So now we have a clearer view of each distribution and we know that extreme values exist. Through the last table we have measured the intensity of the most extreme outlier, with the max/threshold value.
But we don't know how many outliers exist for each (room_type, borough) distribution.
In order to know that we made the following script (toogle the code button if you want to visualize it), and the result is reported:

In [8]:
from itertools import product
In [9]:
max_perc = 0
for apt, neigh in product(*outs.index.levels):
    cond = df1[(df1['room_type'] == apt) & (df1['neighbourhood_group'] == neigh)]
    print('(',apt, neigh,')', '--->', end = '  ')
    t = outs.loc[(apt, neigh)]['threshold']
    print('Q3 + 3*IQR =', t)  
    series = cond.price > t
    print(series.value_counts().to_frame().rename(columns = {'price': f'value_counts > {t}'}))
    try:
        perc_out = round(series.value_counts()[True] / series.value_counts().sum(),2)
        print('% of outliers:' , int(perc_out*100) ,'%')
    except:
        print('No outliers')
    if perc_out > max_perc:
        max_perc = perc_out
        max_combination = (apt, neigh)      
    print()
print('\x1b[0;37;40m' + ' Max % of outliers:', int(max_perc*100), '% in ', max_combination, '\x1b[0m')
( Entire home/apt Bronx ) --->  Q3 + 3*IQR = 320
       value_counts > 320
False                 362
True                   17
% of outliers: 4 %

( Entire home/apt Brooklyn ) --->  Q3 + 3*IQR = 481
       value_counts > 481
False                9307
True                  252
% of outliers: 3 %

( Entire home/apt Manhattan ) --->  Q3 + 3*IQR = 608
       value_counts > 608
False               12689
True                  510
% of outliers: 4 %

( Entire home/apt Queens ) --->  Q3 + 3*IQR = 403
       value_counts > 403
False                2058
True                   38
% of outliers: 2 %

( Entire home/apt Staten Island ) --->  Q3 + 3*IQR = 380
       value_counts > 380
False                 168
True                    8
% of outliers: 5 %

( Private room Bronx ) --->  Q3 + 3*IQR = 160
       value_counts > 160
False                 640
True                   12
% of outliers: 2 %

( Private room Brooklyn ) --->  Q3 + 3*IQR = 170
       value_counts > 170
False                9868
True                  264
% of outliers: 3 %

( Private room Manhattan ) --->  Q3 + 3*IQR = 276
       value_counts > 276
False                7649
True                  333
% of outliers: 4 %

( Private room Queens ) --->  Q3 + 3*IQR = 159
       value_counts > 159
False                3284
True                   88
% of outliers: 3 %

( Private room Staten Island ) --->  Q3 + 3*IQR = 180
       value_counts > 180
False                 184
True                    4
% of outliers: 2 %

( Shared room Bronx ) --->  Q3 + 3*IQR = 138
       value_counts > 138
False                  57
True                    3
% of outliers: 5 %

( Shared room Brooklyn ) --->  Q3 + 3*IQR = 110
       value_counts > 110
False                 385
True                   28
% of outliers: 7 %

( Shared room Manhattan ) --->  Q3 + 3*IQR = 210
       value_counts > 210
False                 453
True                   27
% of outliers: 6 %

( Shared room Queens ) --->  Q3 + 3*IQR = 127
       value_counts > 127
False                 189
True                    9
% of outliers: 5 %

( Shared room Staten Island ) --->  Q3 + 3*IQR = 213
       value_counts > 213
False                   9
No outliers

 Max % of outliers: 7 % in  ('Shared room', 'Brooklyn') 

As you can see, for each distribution we have reported the threshold and the number of listings with a price below the threshold (index = False) and the number of listings with a price above that threshold (index = True).
Then we reported the fraction of outliers for each distribution.
So while the distribution with the most extreme outlier is (Private room, Queens) with a max/threshold value of 62.8, the distribution with more outliers is the (Shared room, Brooklyn), where 7 % of listings are outliers.