Table of Contents
show
Data Quality,
- Real-world data is messy
- Missing Data
- Duplicate Data
- Invalid Data
- Noise
- Outliers
Missing Data

Duplicate Data

Invalid Data

Noise

Why Address Data Quality Issues?

Addressing Data Quality Issues
Addressing Data Quality Issues,

Removing Missing Data

Imputing Missing Data
Replace Missing values with something reasonable,

Ways to Impute Missing Data
Replace missing value with,
- Mean
- Median
- Most Frequent
- Sensible value based on application
Duplicate Data

Invalid Data
- Use external data source to get correct value
- Apply reasoning and domain knowledge to come up with reasonable value

Noise
- Filter out noise component
- May also filter out part of data, so care must be taken

Outliers
- Remove outliers if they are not focus of analysis
- Analyze more closely if they are focus of analysis (Ex: fraud detection)

Domain Knowledge
Required for addressing Data Quality issues effectively
Views: 0