Perform data transformation and cleaning
17 Data Analysis and Visualization – Data Transformation & Cleaning
Why Clean Data? 📊
Imagine you’re making a smoothie. If you throw in a rotten fruit, the whole drink tastes bad. The same happens with data – dirty data can spoil your analysis. Cleaning ensures your results are trustworthy and your visualisations are accurate.
Common Data Issues 🧹
- Missing values (blank cells)
- Duplicate records
- Inconsistent formatting (e.g., dates as 01/02/2023 vs 2023-02-01)
- Outliers (extremely high or low numbers)
- Wrong data types (numbers stored as text)
- Typographical errors (e.g., “Micheal” vs “Michael”)
Cleaning Techniques 🔄
- Remove Duplicates – keep only one copy of each record.
- Handle Missing Data
- Delete rows with too many blanks.
- Replace with a placeholder (e.g., “Unknown”).
- Impute values (average, median, or mode).
- Standardise Formats – e.g., convert all dates to
YYYY-MM-DD. - Correct Data Types – ensure numbers are numeric, dates are date objects.
- Identify & Treat Outliers
- Use box‑plots or the
IQRmethod: values$Q1-1.5\times IQR$or >$Q3+1.5\times IQR$are flagged. - Decide to keep, adjust, or remove.
- Use box‑plots or the
- Correct Typos – use lookup tables or fuzzy matching.
Transformation Methods 🔧
Transforming data means changing its shape or scale so it’s easier to analyse or visualise. Think of it like turning a raw block of wood into a polished table: you cut, sand, and finish it.
- Normalisation – scale values to a 0‑1 range: $$x_{\text{norm}} = \frac{x - \min(x)}{\max(x) - \min(x)}$$
- Standardisation – centre at 0, unit variance: $$z = \frac{x - \mu}{\sigma}$$
- Log Transformation – useful for skewed data: $$y = \log(x + 1)$$
- Encoding Categorical Variables – one‑hot or label encoding.
- Feature Engineering – create new columns (e.g., age from birthdate).
Example Workflow 🚀
- Load raw data into a spreadsheet or Python pandas DataFrame.
- Run
df.info()to spot missing values and data types. - Apply
df.drop_duplicates()to remove repeats. - Impute missing ages with the median:
df['age'].fillna(df['age'].median(), inplace=True) - Convert dates:
df['signup'] = pd.to_datetime(df['signup']) - Normalise sales figures:
df['sales_norm'] = (df['sales'] - df['sales'].min()) / (df['sales'].max() - df['sales'].min()) - Visualise cleaned data with a bar chart or scatter plot.
Quick Reference Table 📋
| Issue | Solution |
|---|---|
| Missing numeric value | Impute with mean or median |
| Duplicate rows | Remove duplicates |
| Date format mismatch | Standardise to ISO YYYY-MM-DD |
| Outlier in sales | Cap at Q3 + 1.5×IQR or remove |
Key Takeaways ✨
- Clean data is the foundation of reliable analysis.
- Use systematic steps: detect, correct, transform.
- Keep a log of all cleaning actions for reproducibility.
- After cleaning, visualisations will truly reflect the underlying patterns.
Revision
Log in to practice.
0 views
0 suggestions