Detect and Remove Outliers using Python
Outlier detection is a detection technique in which odd one is thrown out.There are different types of techniques and the above used is IQR technique(inter quartile range)IQR(Q3-Q1). It has two quartile ranges: Lower quartile range Upper quartile range Here anything that lies outside lower and upper quartile range then it is considered as outlier detection. There are other types of outlier detection : 1.Inter quartile range 2.z-score 3.DBSCAN and many more. It is used in real life examples as well and one of them is brain tumor detection and cancer detection as well.
import pandas as pd import numpy as np BIKE = pandas.read_csv("Bike.csv") numeric_col = ['temp','hum','windspeed'] categorical_col = ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']
numeric_col = ['temp','hum','windspeed'] categorical_col = ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']
BIKE.boxplot(numeric_col)
<AxesSubplot:>
pip install matplotlib
Collecting matplotlib Using cached https://files.pythonhosted.org/packages/09/03/b7b30fa81cb687d1178e085d0f01111ceaea3bf81f9330c937fb6f6c8ca0/matplotlib-3.3.4-cp36-cp36m-manylinux1_x86_64.whl Collecting kiwisolver>=1.0.1 (from matplotlib) Using cached https://files.pythonhosted.org/packages/a7/1b/cbd8ae738719b5f41592a12057ef5442e2ed5f5cb5451f8fc7e9f8875a1a/kiwisolver-1.3.1-cp36-cp36m-manylinux1_x86_64.whl Collecting cycler>=0.10 (from matplotlib) Using cached https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl Collecting python-dateutil>=2.1 (from matplotlib) Using cached https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl Collecting numpy>=1.15 (from matplotlib) Using cached https://files.pythonhosted.org/packages/45/b2/6c7545bb7a38754d63048c7696804a0d947328125d81bf12beaa692c3ae3/numpy-1.19.5-cp36-cp36m-manylinux1_x86_64.whl Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 (from matplotlib) Using cached https://files.pythonhosted.org/packages/8a/bb/488841f56197b13700afd5658fc279a2025a39e22449b7cf29864669b15d/pyparsing-2.4.7-py2.py3-none-any.whl Collecting pillow>=6.2.0 (from matplotlib) Using cached https://files.pythonhosted.org/packages/df/74/4a981d12fa26b83c9230b67dee44d1361a372e0f22785f093969fd98b964/Pillow-8.3.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl Collecting six (from cycler>=0.10->matplotlib) Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl Installing collected packages: kiwisolver, six, cycler, python-dateutil, numpy, pyparsing, pillow, matplotlib Successfully installed cycler-0.10.0 kiwisolver-1.3.1 matplotlib-3.3.4 numpy-1.19.5 pillow-8.3.1 pyparsing-2.4.7 python-dateutil-2.8.1 six-1.16.0 Note: you may need to restart the kernel to use updated packages.
for x in ['windspeed']: q75,q25 = np.percentile(BIKE.loc[:,x],[75,25]) intr_qr = q75-q25 max = q75+(1.5*intr_qr) min = q25-(1.5*intr_qr) BIKE.loc[BIKE[x] < min,x] = np.nan BIKE.loc[BIKE[x] > max,x] = np.nan
BIKE.isnull().sum()
temp 0 hum 0 windspeed 3 cnt 0 season_1 0 season_2 0 season_3 0 season_4 0 yr_0 0 yr_1 0 mnth_1 0 mnth_10 0 mnth_11 0 mnth_12 0 mnth_2 0 mnth_3 0 mnth_4 0 mnth_5 0 mnth_6 0 mnth_7 0 mnth_8 0 mnth_9 0 weathersit_1 0 weathersit_2 0 weathersit_3 0 holiday_0 0 holiday_1 0 dtype: int64
BIKE = BIKE.dropna(axis = 0)
BIKE.isnull().sum()
temp 0 hum 0 windspeed 0 cnt 0 season_1 0 season_2 0 season_3 0 season_4 0 yr_0 0 yr_1 0 mnth_1 0 mnth_10 0 mnth_11 0 mnth_12 0 mnth_2 0 mnth_3 0 mnth_4 0 mnth_5 0 mnth_6 0 mnth_7 0 mnth_8 0 mnth_9 0 weathersit_1 0 weathersit_2 0 weathersit_3 0 holiday_0 0 holiday_1 0 dtype: int64