Data cleaning identifies “dirty” data and fixes it. In order to clean the data, we must know:

- What kind of data is in our dataset?

- What are the attributes and how are they related?

There are 4 types of attributes that we want to pay attention to:

- Nominal (labels): names of things, categories, tags, genres.

- Ordinal (ordered): Likert scales, high/medium/low, G/PG/PG-13/NC-17/R.

- Interval (order with differences) : dates, times, temperature.

- Ratio (order with difference/zero) : money, elapsed time, height/weight, age.

Common Data Cleaning Tasks

Here is a list of data cleaning tasks. The list of tasks is focused on structured data:

- Import & export of datasets

- Naming or renaming variables

- Changing the type of variables (also known as explicit coercion)

- Sorting on one or more variables, with duplicate keys or entire duplicate records

- Selecting columns from input dataset to output dataset

- Filtering of rows based on one or more conditions

- Creating new variables through functions of existing variables

- Conditional processing of variables (i.e the values of the new variable is based on the values of existing variables)

- Appending tables

- Joining tables (Inner Join, Left and Right Join, Full Outer Join)

- Transpose tables

- Summarize column or summarize column by groups

- Normalizing and standardizing columns (for continuous variables)

- Binning of continuous variables

- Imputing missing values in a variables

The above is the list of data cleaning tasks data analysts or data scientists need to be familiar with.

Importance of Data Cleaning

- Data cleaning plays an essential role in training a model which cannot be overstated. No matter what algorithm you use — if your data is bad, you will definitely get bad results. Professional data scientists know this and have revealed that data cleaning takes up to 70% of the time spent on a data science project.

- Better and cleaner data outperforms the best algorithms. If you are using a very simple algorithm on the cleanest data, you will surely get very impressive results. However, if you use the best algorithm on messy data, you would most likely not get the desired result.

Startup Consultancy