Data Cleaning in Python: Pandas Tricks Every Analyst Should Know

Author

BI & Analytics

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

Introduction

Data in its raw form is rarely clean. It often comes with imperfections like missing values, duplicates, inconsistent formatting, and even outliers. If these issues are not handled properly, they can distort your analysis, lead to incorrect business decisions, and even mislead machine learning models. Just like you wouldn’t cook with dirty ingredients, you shouldn’t analyze data without cleaning it first.

That’s where data cleaning comes in. It’s the foundational step of every data project, ensuring your insights are accurate and trustworthy.

Python’s Pandas library is a go-to tool for data cleaning. In this blog, we’ll dive into practical techniques and helpful tricks that every analyst should master to clean their data using Pandas efficiently.

Handling Missing Values

Missing data is one of the most common issues analysts face. It can occur due to system glitches, human error, or incomplete surveys. If not treated correctly, missing values can:

Skew statistical results (e.g., average income drops if many entries are missing).
Break machine learning models (some algorithms can’t handle missing data).
Lead to biased decisions or conclusions.

That’s why identifying and treating missing values is a top priority.

Detecting Missing Values

To check for missing values in a dataset, you can use the isnull() function:

				
					df.isnull().sum())

This tells you how many values are missing in each column. Your first step to understanding data quality.

Filling Missing Values

The way to handle missing data is to fill it with appropriate values. Depending on the situation, you can use different approaches.

Using median is preferred for numerical fields like Age to reduce the influence of outliers.
Filling with 0 is useful for fields like salary if a missing value means “no salary” (e.g., unpaid interns).

				
					# Fill missing ages with median, salary with 0
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Salary'].fillna(0, inplace=True)

For time-series data or when previous/next values are logical replacements:

Forward fill (ffill) takes the last known value and propagates it forward.
Backward fill (bfill) does the opposite, fills using the next known value.

				
					# Forward-fill and backward-fill remaining missing values
df.fillna(method='ffill', inplace=True)
df.fillna(method='bfill', inplace=True)

These techniques are helpful in log files, sensor data, or stock price data where values are often sequential.

Removing Duplicates

Duplicate records refer to repeated entries in a dataset that contain the same or nearly identical information. These can occur due to data entry errors, system glitches, or merging datasets from multiple sources without proper checks.

Duplication can:

Overestimate your sales performance.
Skew churn rates.
Break assumptions in data models that expect unique entries.

Cleaning duplicates ensures data integrity and prevents analysis errors.

Detecting Duplicates

To identify duplicate rows:

				
					df[df.duplicated()]

You can also check for duplicates in a specific column:

				
					df[df.duplicated(subset='Name')]

Removing Duplicates

You can remove duplicates with a single line:

				
					# Remove complete duplicate rows
df.drop_duplicates(inplace=True)

Or keep only the first instance based on a specific column:

				
					# Remove duplicates based on 'Name' column, keeping first occurrence
df.drop_duplicates(subset=['Name'], keep='first', inplace=True)

This ensures your dataset contains only unique and relevant records, giving you a true reflection of your data.

Standardizing Data Formats

Let’s say you’re trying to group users by name, but your data has entries like:

“john”, “John ”, “ JOHN”

They all refer to the same person, but due to inconsistent formatting, your analysis treats them as different individuals. This can wreck grouping, filtering, and summarizing operations.

Cleaning Text

To standardize text data:

				
					# Convert names to lowercase and strip extra spaces
df['Name'] = df['Name'].str.strip().str.lower()

This:

Removes extra spaces using .strip()
Converts everything to lowercase using .lower()

The result is clean, consistent text that’s ready for grouping or analysis.

Cleaning Dates

Dates often come in various formats like “12/01/2023” or “01-12-23”, and unless standardized, they cannot be sorted or filtered properly. Standardized dates are essential for trend analysis, filtering by time periods, and visualization.

				
					# Standardize date column to datetime format
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

This command attempts to convert the Date column into a standard datetime64 format that Python can recognize and work with.

errors=’coerce’ tells Pandas to replace invalid or unparseable dates with NaT (Not a Time), which can later be handled like a missing value.

Once converted, you can easily:

Extract year, month, or day using:

				
					df['Date'].dt.year

Filter rows based on dates e.g.:

				
					df[df['Date'] > '2023-01-01']

Resample time-series data (e.g., weekly or monthly)

In short, a clean date format unlocks the power of time-based insights.

Handling Outliers

Outliers are extreme values that lie far outside the range of the majority of your data. They can result from:

Data entry errors (e.g., entering ₹1,000,000 instead of ₹10,000)
Legitimate rare events (e.g., a CEO salary in employee data)
Measurement issues

Outliers can distort averages, confuse machine learning models, and hide patterns in data.

Detecting Outliers Using IQR

The Interquartile Range (IQR) method is a robust way to detect outliers:

				
					# Detecting rows where salary is an outlier
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove rows where salary is an outlier
df = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]

Q1 (25th percentile) and Q3 (75th percentile) define the middle 50% of your data.
IQR (Interquartile Range) is the spread between Q3 and Q1.
Values outside 5 * IQR are considered outliers.

Removing Outliers

				
					# Remove rows where salary is an outlier
df = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]

This keeps only the “reasonable” values in your salary column and removes extreme outliers that can distort your insights.

Note: Always review outliers before removing them. Some may be valid, like a real high-earning executive or an unusually large transaction.

Mapping and Replacing Values

Data inconsistencies, such as different labels for the same category (e.g., ‘M’ and ‘Male’), can cause incorrect aggregations, misleading visualizations, and bad modeling results.

Standardizing Categorical Values

				
					# Standardize gender values
df['Gender'].replace({'M': 'Male', 'F': 'Female'}, inplace=True)

This helps normalize values and ensures inconsistent values like ‘M’ and ‘Male’ is standardized.

Creating New Columns

You can also categorize data dynamically:

				
					#Create Experience Level Column
df['Experience_Level'] = df['Years_of_Experience'].apply(
    lambda x: 'Senior' if x > 5 else 'Junior'
)

This is powerful for:

Comparative analysis (e.g., average performance by experience level).
Targeted reporting.
Creating input features for machine learning models.

Splitting and Merging Columns

Sometimes data is combined into a single column but holds multiple types of information. A classic example: full names stored as “John Smith” in one field.

To personalize reports, match user records, or group by surname, you need separate first and last names.

				
					# Split and Merge Names (Optional)
if 'Name' in df.columns:
    name_parts = df['Name'].str.split(' ', n=1, expand=True)
    df['First_Name'] = name_parts[0].str.capitalize()
    df['Last_Name'] = name_parts[1].str.capitalize() if name_parts.shape[1] > 1 else ''
    df['Full_Name'] = df['First_Name'] + ' ' + df['Last_Name']

This ensures:

Clean, capitalized first and last names.
No data loss even when only one name is present.
Easier sorting, filtering, or merging based on names.

Merging columns back is also easy with string concatenation, as shown in Full_Name

Filtering and Selecting Data Efficiently

When working with large datasets, filtering helps you focus on relevant rows and discard unnecessary ones. This speeds up performance and makes your insights sharper.

Filtering Based on Conditions

				
					# Keep only rows where age is greater than 25
df = df[df['Age'] > 25]

You can apply any condition, filter by product category, customer location, date range, and more.

Filtering before analysis is like setting up your chessboard before the match, you want only the pieces that matter.

Conclusion

Cleaning data is a vital step in ensuring accurate and meaningful insights. Pandas provides powerful tools to handle missing values, remove duplicates, standardize formats, detect outliers, and refine datasets.

By mastering these data cleaning techniques, analysts can improve efficiency and enhance the quality of their analysis.

Bonus: To help you practice everything covered in this blog, I’ve attached both the messy data CSV file and the Python code script. Feel free to download them and try cleaning the dataset yourself, it’s a fun way to master these tricks!

Subscribe to our newsletter

Stay informed with the latest insights, industry trends, and expert tips delivered straight to your inbox. Sign up for our newsletter today and never miss an update!

We care about the protection of your data. Read our Privacy Policy.

Keep reading

Dig deeper into data development by browsing our blogs…

Get in Touch

Let us leverage your data so that you can make smarter decisions. Talk to our team of data experts today or fill in this form and we’ll be in touch.

Take a deeper dive

Locate Us

Follow Us

Contact Us

Take a deeper dive

Locate Us

Follow Us

Contact Us

Data Cleaning in Python: Pandas Tricks Every Analyst Should Know

Author

Priya Tanwar

Table of Contents

Categories

Sign up for our newsletter

Introduction

Handling Missing Values

Detecting Missing Values

Filling Missing Values

Removing Duplicates

Detecting Duplicates

Removing Duplicates

Standardizing Data Formats

Cleaning Text

Cleaning Dates

Handling Outliers

Detecting Outliers Using IQR

Removing Outliers

Mapping and Replacing Values

Standardizing Categorical Values

Creating New Columns

Splitting and Merging Columns

Filtering and Selecting Data Efficiently

Filtering Based on Conditions

Conclusion

Subscribe to our newsletter

Keep reading

Get in Touch