Data Science & Machine Learning
June 11, 2025 at 11:05 AM
Today, let's move to the next topic in the Data Science Learning Series: 🔹 *Outliers (Conceptual + Pandas Code)* 🧠 *What Are Outliers* ? Outliers are data points that are significantly different from the rest of the dataset. *For example:* - A student scoring 100 while everyone else scored 40–60. - A product priced at ₹1,000,000 in a range of ₹500–₹5000. 🎯 *Why Outliers Matter* - They can skew averages and standard deviation - Mislead visualizations and trends - Affect machine learning model performance ✅ *Detecting Outliers* *1. Using Summary Stats* df['column'].describe() Check the mean, min, max, 25%, 75% — if max/min are way off from the rest, it's a signal. *2. Using IQR (Interquartile Range)* Q1 = df['column'].quantile(0.25) Q3 = df['column'].quantile(0.75) IQR = Q3 - Q1 # Filter out outliers filtered_df = df[~((df['column'] < (Q1 - 1.5 * IQR)) | (df['column'] > (Q3 + 1.5 * IQR)))] *3. Using Visualization* import seaborn as sns sns.boxplot(df['column']) # Outliers appear as dots beyond the whiskers 🔍 *Real-Life Example* In banking: A transaction of ₹20,000 while others are between ₹200–₹5000 might be fraud or error. In marketing: One campaign getting 10x more clicks than others — could be spam or a gold strategy. ✅ *What To Do With Outliers?* - Investigate: Sometimes they're valid (e.g., CEO salary). - Remove: If clearly an error or rare anomaly. - Cap: Replace extreme values with upper/lower thresholds (winsorization). - Use robust methods: Like median instead of mean. *React with ❤️ once you're ready for the quiz* Data Science Learning Series: https://whatsapp.com/channel/0029Va8v3eo1NCrQfGMseL2D/998 Python Cheatsheet: https://whatsapp.com/channel/0029VaiM08SDuMRaGKd9Wv0L/1660
❤️ 👍 😢 🙏 🚭 🫡 50

Comments