Perform data transformation and cleaning

17 Data Analysis and Visualization – Data Transformation & Cleaning

Why Clean Data? 📊

Imagine you’re making a smoothie. If you throw in a rotten fruit, the whole drink tastes bad. The same happens with data – dirty data can spoil your analysis. Cleaning ensures your results are trustworthy and your visualisations are accurate.

Common Data Issues 🧹

  • Missing values (blank cells)
  • Duplicate records
  • Inconsistent formatting (e.g., dates as 01/02/2023 vs 2023-02-01)
  • Outliers (extremely high or low numbers)
  • Wrong data types (numbers stored as text)
  • Typographical errors (e.g., “Micheal” vs “Michael”)

Cleaning Techniques 🔄

  1. Remove Duplicates – keep only one copy of each record.
  2. Handle Missing Data
    • Delete rows with too many blanks.
    • Replace with a placeholder (e.g., “Unknown”).
    • Impute values (average, median, or mode).
  3. Standardise Formats – e.g., convert all dates to YYYY-MM-DD.
  4. Correct Data Types – ensure numbers are numeric, dates are date objects.
  5. Identify & Treat Outliers
    • Use box‑plots or the IQR method: values $Q1-1.5\times IQR$ or > $Q3+1.5\times IQR$ are flagged.
    • Decide to keep, adjust, or remove.
  6. Correct Typos – use lookup tables or fuzzy matching.

Transformation Methods 🔧

Transforming data means changing its shape or scale so it’s easier to analyse or visualise. Think of it like turning a raw block of wood into a polished table: you cut, sand, and finish it.

  • Normalisation – scale values to a 0‑1 range: $$x_{\text{norm}} = \frac{x - \min(x)}{\max(x) - \min(x)}$$
  • Standardisation – centre at 0, unit variance: $$z = \frac{x - \mu}{\sigma}$$
  • Log Transformation – useful for skewed data: $$y = \log(x + 1)$$
  • Encoding Categorical Variables – one‑hot or label encoding.
  • Feature Engineering – create new columns (e.g., age from birthdate).

Example Workflow 🚀

  1. Load raw data into a spreadsheet or Python pandas DataFrame.
  2. Run df.info() to spot missing values and data types.
  3. Apply df.drop_duplicates() to remove repeats.
  4. Impute missing ages with the median: df['age'].fillna(df['age'].median(), inplace=True)
  5. Convert dates: df['signup'] = pd.to_datetime(df['signup'])
  6. Normalise sales figures: df['sales_norm'] = (df['sales'] - df['sales'].min()) / (df['sales'].max() - df['sales'].min())
  7. Visualise cleaned data with a bar chart or scatter plot.

Quick Reference Table 📋

Issue Solution
Missing numeric value Impute with mean or median
Duplicate rows Remove duplicates
Date format mismatch Standardise to ISO YYYY-MM-DD
Outlier in sales Cap at Q3 + 1.5×IQR or remove

Key Takeaways ✨

  • Clean data is the foundation of reliable analysis.
  • Use systematic steps: detect, correct, transform.
  • Keep a log of all cleaning actions for reproducibility.
  • After cleaning, visualisations will truly reflect the underlying patterns.

Revision

Log in to practice.

0 views 0 suggestions