Data Science & Machine Learning
June 11, 2025 at 11:05 AM
Today, let's move to the next topic in the Data Science Learning Series:
🔹 *Outliers (Conceptual + Pandas Code)*
🧠 *What Are Outliers* ?
Outliers are data points that are significantly different from the rest of the dataset.
*For example:*
- A student scoring 100 while everyone else scored 40–60.
- A product priced at ₹1,000,000 in a range of ₹500–₹5000.
🎯 *Why Outliers Matter*
- They can skew averages and standard deviation
- Mislead visualizations and trends
- Affect machine learning model performance
✅ *Detecting Outliers*
*1. Using Summary Stats*
df['column'].describe()
Check the mean, min, max, 25%, 75% — if max/min are way off from the rest, it's a signal.
*2. Using IQR (Interquartile Range)*
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
# Filter out outliers
filtered_df = df[~((df['column'] < (Q1 - 1.5 * IQR)) |
(df['column'] > (Q3 + 1.5 * IQR)))]
*3. Using Visualization*
import seaborn as sns
sns.boxplot(df['column']) # Outliers appear as dots beyond the whiskers
🔍 *Real-Life Example*
In banking: A transaction of ₹20,000 while others are between ₹200–₹5000 might be fraud or error.
In marketing: One campaign getting 10x more clicks than others — could be spam or a gold strategy.
✅ *What To Do With Outliers?*
- Investigate: Sometimes they're valid (e.g., CEO salary).
- Remove: If clearly an error or rare anomaly.
- Cap: Replace extreme values with upper/lower thresholds (winsorization).
- Use robust methods: Like median instead of mean.
*React with ❤️ once you're ready for the quiz*
Data Science Learning Series: https://whatsapp.com/channel/0029Va8v3eo1NCrQfGMseL2D/998
Python Cheatsheet: https://whatsapp.com/channel/0029VaiM08SDuMRaGKd9Wv0L/1660
❤️
👍
❤
♥
😢
❣
🙏
🚭
🫡
50