Notes | Information Technology IT

17 Data Analysis and Visualization – Data Transformation & Cleaning

Why Clean Data? 📊

Imagine you’re making a smoothie. If you throw in a rotten fruit, the whole drink tastes bad. The same happens with data – dirty data can spoil your analysis. Cleaning ensures your results are trustworthy and your visualisations are accurate.

Common Data Issues 🧹

Missing values (blank cells)
Duplicate records
Inconsistent formatting (e.g., dates as 01/02/2023 vs 2023-02-01)
Outliers (extremely high or low numbers)
Wrong data types (numbers stored as text)
Typographical errors (e.g., “Micheal” vs “Michael”)

Cleaning Techniques 🔄

Remove Duplicates – keep only one copy of each record.
Handle Missing Data
- Delete rows with too many blanks.
- Replace with a placeholder (e.g., “Unknown”).
- Impute values (average, median, or mode).
Standardise Formats – e.g., convert all dates to YYYY-MM-DD.
Correct Data Types – ensure numbers are numeric, dates are date objects.
Identify & Treat Outliers
- Use box‑plots or the IQR method: values $Q1-1.5\times IQR$ or > $Q3+1.5\times IQR$ are flagged.
- Decide to keep, adjust, or remove.
Correct Typos – use lookup tables or fuzzy matching.

Transformation Methods 🔧

Transforming data means changing its shape or scale so it’s easier to analyse or visualise. Think of it like turning a raw block of wood into a polished table: you cut, sand, and finish it.

Normalisation – scale values to a 0‑1 range: $$x_{\text{norm}} = \frac{x - \min(x)}{\max(x) - \min(x)}$$
Standardisation – centre at 0, unit variance: $$z = \frac{x - \mu}{\sigma}$$
Log Transformation – useful for skewed data: $$y = \log(x + 1)$$
Encoding Categorical Variables – one‑hot or label encoding.
Feature Engineering – create new columns (e.g., age from birthdate).

Example Workflow 🚀

Load raw data into a spreadsheet or Python pandas DataFrame.
Run df.info() to spot missing values and data types.
Apply df.drop_duplicates() to remove repeats.
Impute missing ages with the median: df['age'].fillna(df['age'].median(), inplace=True)
Convert dates: df['signup'] = pd.to_datetime(df['signup'])
Normalise sales figures: df['sales_norm'] = (df['sales'] - df['sales'].min()) / (df['sales'].max() - df['sales'].min())
Visualise cleaned data with a bar chart or scatter plot.

Quick Reference Table 📋

Issue	Solution
Missing numeric value	Impute with mean or median
Duplicate rows	Remove duplicates
Date format mismatch	Standardise to ISO `YYYY-MM-DD`
Outlier in sales	Cap at `Q3 + 1.5×IQR` or remove

Key Takeaways ✨

Clean data is the foundation of reliable analysis.
Use systematic steps: detect, correct, transform.
Keep a log of all cleaning actions for reproducibility.
After cleaning, visualisations will truly reflect the underlying patterns.

Perform data transformation and cleaning

Listen to this Lesson