Data Science & Machine Learning
June 5, 2025 at 07:08 PM
Today, let's move to the next topic in the Data Science Learning Series: 📌 *Handling Duplicates in a Dataset* 🧠 *What Are Duplicates?* Duplicates are rows in your dataset that appear identical — same values across all columns. They can distort analysis, especially in counting, aggregations, or machine learning models. ✅ *Why Remove Duplicates?* • Skews statistical summaries • Causes biased ML models • Affects insights and visualizations • Increases memory & computation unnecessarily 🛠 *How to Detect and Remove Duplicates (in Pandas)* ```python import pandas as pd df = pd.read_csv("data.csv") Check for duplicates print(df.duplicated()) Count duplicates print(df.duplicated().sum()) Drop duplicates df_cleaned = df.drop_duplicates() Keep the last occurrence df_cleaned = df.drop_duplicates(keep='last') ``` 🔍 *Real-Life Example* Imagine a sales report where one transaction was accidentally entered twice — that would inflate revenue if not removed. 🧪 *Tip:* You can also check duplicates in specific columns: `df.duplicated(subset=["CustomerID", "InvoiceDate"])` This helps when checking for duplicate entries with slight differences elsewhere. *React with ❤️ once you're ready for the quiz* Data Science Learning Series: https://whatsapp.com/channel/0029Va8v3eo1NCrQfGMseL2D/998 Python Cheatsheet: https://whatsapp.com/channel/0029VaiM08SDuMRaGKd9Wv0L/1660
❤️ 👍 😢 😮 🙏 😀 🇮🇳 137

Comments