Data Science & Machine Learning
June 5, 2025 at 07:08 PM
Today, let's move to the next topic in the Data Science Learning Series:
📌 *Handling Duplicates in a Dataset*
🧠 *What Are Duplicates?*
Duplicates are rows in your dataset that appear identical — same values across all columns. They can distort analysis, especially in counting, aggregations, or machine learning models.
✅ *Why Remove Duplicates?*
• Skews statistical summaries
• Causes biased ML models
• Affects insights and visualizations
• Increases memory & computation unnecessarily
🛠 *How to Detect and Remove Duplicates (in Pandas)*
```python
import pandas as pd
df = pd.read_csv("data.csv")
Check for duplicates
print(df.duplicated())
Count duplicates
print(df.duplicated().sum())
Drop duplicates
df_cleaned = df.drop_duplicates()
Keep the last occurrence
df_cleaned = df.drop_duplicates(keep='last')
```
🔍 *Real-Life Example*
Imagine a sales report where one transaction was accidentally entered twice — that would inflate revenue if not removed.
🧪 *Tip:*
You can also check duplicates in specific columns:
`df.duplicated(subset=["CustomerID", "InvoiceDate"])`
This helps when checking for duplicate entries with slight differences elsewhere.
*React with ❤️ once you're ready for the quiz*
Data Science Learning Series: https://whatsapp.com/channel/0029Va8v3eo1NCrQfGMseL2D/998
Python Cheatsheet: https://whatsapp.com/channel/0029VaiM08SDuMRaGKd9Wv0L/1660
❤️
❤
👍
♥
😢
😮
🙏
😀
⌚
🇮🇳
137