top of page

Data Cleaning in SQL : A Comprehensive Guide for Handling Missing Values and Duplicates

Introduction:

Data cleaning is a crucial step in the data preparation process, ensuring the accuracy and reliability of your datasets. This article focuses on using SQL to clean data by addressing two common issues: Handling Missing Values and dealing with Duplicates. Follow along to discover essential SQL techniques for a cleaner, more robust dataset.


Data Cleaning in SQL : A Comprehensive Guide for Handling Missing Values and Duplicates

Handling Missing Values:

Identifying Missing Values:

Before addressing missing values, it's essential to identify where they exist in your dataset. Use the IS NULL or IS NOT NULL condition in SQL queries to filter out rows with missing values in specific columns.

-- To find rows with missing values in a particular column

SELECT *

FROM your_table

WHERE column_name IS NULL;


-- To find rows without missing values in a particular column

SELECT *

FROM your_table

WHERE column_name IS NOT NULL;


Replacing Missing Values:

Once identified, you can replace missing values with appropriate substitutes using the COALESCE or CASE statement.

-- Replace missing values with a default value

SELECT COALESCE(column_name, 'default_value') AS cleaned_column

FROM your_table;


-- Replace missing values based on a condition

SELECT

  CASE

    WHEN column_name IS NULL THEN 'default_value'

    ELSE column_name

  END AS cleaned_column

FROM your_table;


Dealing with Duplicates:

Identifying Duplicates:

Detecting duplicate rows involves using the GROUP BY clause along with aggregate functions like COUNT().

-- Find duplicate rows based on specific columns

SELECT column1, column2, COUNT(*)

FROM your_table

GROUP BY column1, column2

HAVING COUNT(*) > 1;

Removing Duplicates:

Eliminating duplicates can be done using the DISTINCT keyword or the ROW_NUMBER() window function

-- Remove duplicates using DISTINCT

SELECT DISTINCT column1, column2

FROM your_table;


-- Remove duplicates using ROW_NUMBER()

WITH ranked_rows AS (

  SELECT

    column1,

    column2,

    ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY some_order_column) AS row_num

  FROM your_table

)

SELECT column1, column2

FROM ranked_rows

WHERE row_num = 1;

Conclusion:

Effective data cleaning is essential for accurate analysis and decision-making. By leveraging SQL to handle missing values and duplicates, you can ensure your datasets are reliable and ready for insightful exploration.



Commentaires


bottom of page