In this post I will five into topic: Pandas Data Cleaning And Preparation. You will learn how to clean and prepare your data for analysis in this part. You will learn how to deal with missing and duplicate data, as well as fundamental data transformations.
Table of Contents
Introduction
Data Cleaning
The act of preparing data for analysis by finding and fixing flaws and inconsistencies in the data is known as data cleaning. It is a critical stage in the data science pipeline that ensures the data is correct, consistent, and useable for analysis.
Tasks such as eliminating duplicate entries, filling in missing values, repairing data input mistakes, and standardising data formats can all be part of data cleaning. Data cleaning seeks to make data as clean and trustworthy as possible before it is utilised for further analysis or machine learning".
Data cleansing may be a time-consuming procedure, but it is critical to guarantee that the outcomes of any analysis or modelling are correct and reliable.
Why Data Cleaning Is Important?
Data cleaning is necessary to assure the quality and accuracy of the data used for analysis and modelling. Poor data quality can result in erroneous or inconsistent findings, as well as concerns with machine learning models such as overfitting.
Here are a few reasons why data cleaning is essential:
- Improves data quality: Data cleaning aids in the identification and correction of flaws and inconsistencies in data, hence enhancing its overall quality.
- Increases data reliability: Data cleaning improves data dependability by eliminating mistakes and inconsistencies, making it more suitable for analysis and modelling.
- Enhances data usability: Data cleaning makes data more useful for analysis and modelling by standardising data formats and filling in missing variables.
- Increases efficiency: Cleaning the data before analysing it can save time and money by making it more useable and decreasing the need for extra cleaning processes during the analysis.
- Enhances the performance of machine learning models: Clean data can help machine learning models perform better since they can learn better from clean and trustworthy data.
We can ensure that the data is correct, consistent, and useable for analysis and modelling by cleaning it, which can lead to better and more accurate findings.
Pandas Data Cleaning Tutorial
Pandas Data Cleaning Techniques
Several ways for cleaning (= Pandas data preprocessing ) data may be utilised with Python’s Pandas package. Among the most prevalent approaches are:
- Dropping missing values: This entails eliminating rows or columns from the dataset" that have missing or null values. The
dropna()
function may be used to do this. - Filling missing values: This entails replacing missing or null data with a default" value, such as the column’s mean or median, or values from another source. The
fillna()
function may be used to do this. - Replacing values: This entails substituting new values for specified values in a column". The
replace()
function may be used to do this. - Renaming columns: Renaming columns in a DataFrame" to make them more relevant or consistent is what this entails. The
rename()
function may be used to do this. - Removing duplicate data: This entails locating and eliminating duplicate values from the dataset". The
drop_duplicates()
function may be used to do this. - Standardizing data: This entails converting all values in a column to a standardised format or scale. This may be accomplished by combining the
apply()
method with a custom function. - Data validation: This entails determining if the data is valid in accordance with particular rules or limitations, and deleting or repairing erroneous data. This may be accomplished by combining the
apply()
method with a custom function. - Data transformation: This entails performing mathematical or statistical procedures on the data to alter its presentation or structure. This may be accomplished by utilising the
apply()
method with a built-in or custom function. - Data normalization: This entails converting the data to a standard scale or range, such as 0 to 1. This may be accomplished using either the min-max scaling approach or the z-score method.
- Data aggregation: This entails combining data by one or more columns and using a summary function to produce aggregate values, such as total or mean. This is possible with the
groupby()
andaggregate()
methods. - Data pivoting: This entails altering the data by constructing a new DataFrame" with one or more columns serving as the index and one or more columns serving as the values. This may be accomplished with the
pivot()
orpivot table()
methods. - Data filtering: This entails limiting the data to a subset depending on certain criteria, such as a value in a column or a range of values. This may be accomplished with the
query()
function or boolean indexing. - Data merging: Data from many DataFrames or sources are combined into a single DataFrame". This may be accomplished by utilising the
merge()
orjoin()
methods.
These are only a handful of the various data cleaning procedures that the Pandas library may perform. The approach to choose will be determined by the unique needs and aims of the data cleansing work at hand.
Pandas Data Cleaning And Preparation
Pandas Handling Missing Data
Missing data can arise for a variety of reasons, including measurement mistakes, data entry problems, or data gathering constraints. To deal with missing data, use Pandas‘ isnull()
and sum()
procedures to check for missing values and the dropna()
function to remove the rows or columns that have missing data.
Let’s take a look at the following example:
import pandas as pd import numpy as np # create a dataframe data = {'Name': ['Tom', 'Mike', 'Paul', 'N/A'], 'Age': [25, 30, np.nan, 35], 'Gender': ['M', 'M', 'F', 'N/A']} df = pd.DataFrame(data)
Which provide the following output od data:
Name Age Gender 0 Tom 25.0 M 1 Mike 30.0 M 2 Paul NaN F 3 N/A 35.0 N/A
Pandas Check For Missing Values
To check how many missing values are in you data you can simply use the isnull().sum()
methods. In the following example in Age
column we have null value, which indicates the 1
next Age in the below results:
print(df.isnull().sum()) Results: Name 0 Age 1 Gender 0 dtype: int64
Pandas Drop Rows With Missing Values
To drop all missing rows with missing values you can use .dropna()
function. As you can the row with index 2 was dropped (2 Paul NaN F)
:
df.dropna(inplace=True) print(df) Results: Name Age Gender 0 Tom 25.0 M 1 Mike 30.0 M 3 N/A 35.0 N/A
And now when you will execute again checking for missing values in Pandas DataFrame after cleaning you will get the following results:
print(df.isnull().sum()) Results: Name 0 Age 0 Gender 0 dtype: int64
As you can see now all the sum()
results are 0 (zero).
It means that our Pandas DataFrame" doesn’t have null values inside.
Pandas Handling Duplicate Data
Data input mistakes, data collecting faults, or data integration difficulties can all result in duplicate data. To manage duplicate data, use Pandas’ duplicated()
function to look for duplicate rows and the drop_duplicates()
method to remove them.
In the following example the duplicated row is the last one: Name=Tom, Age=25, Gender=M.
import pandas as pd # create a dataframe data = {'Name': ['Tom', 'Mike', 'Rachel', 'Tom'], 'Age': [25, 30, 35, 25], 'Gender': ['M', 'M', 'F', 'M']} df = pd.DataFrame(data) # check for duplicate rows duplicates = df.duplicated().sum() print(f"Found {duplicates} duplicates") # drop duplicate rows df.drop_duplicates()
When you will execute the above code you will get the results:
Found 1 duplicates Name Age Gender 0 Tom 25 M 1 Mike 30 M 2 Rachel 35 F
As you can see after deduplication the duplicated record form Name=Tom, Age=25, Gender=M
was dropped.
Summary
The term "Pandas Handling Missing Data"
refers to the process of discovering and addressing missing or null values in a dataset". Dropping missing values, filling in missing values with a default" value or a value from another source, and displaying and analysing the missing data are all examples of actions that may be performed. When dealing with missing data, the objective is to make the data as full and accurate as possible before using it for further analysis or modelling.
The term "Pandas Handling Duplicate Data"
refers to the process of discovering and addressing duplicate values in a dataset". This includes duties like deleting duplicate values, displaying and analysing duplicate data, and assuring data integrity and consistency. The purpose of dealing with duplicate data is to ensure that it is clean, accurate, and free of duplication before using it for further analysis or modelling.
Could You Please Share This Post?
I appreciate It And Thank YOU! :)
Have A Nice Day!