Table of Contents
show
Data Quality,
- Real-world data is messy
- Missing Data
- Duplicate Data
- Invalid Data
- Noise
- Outliers
Missing Data
Duplicate Data
Invalid Data
Noise
Why Address Data Quality Issues?
Addressing Data Quality Issues
Addressing Data Quality Issues,
Removing Missing Data
Imputing Missing Data
Replace Missing values with something reasonable,
Ways to Impute Missing Data
Replace missing value with,
- Mean
- Median
- Most Frequent
- Sensible value based on application
Duplicate Data
Invalid Data
- Use external data source to get correct value
- Apply reasoning and domain knowledge to come up with reasonable value
Noise
- Filter out noise component
- May also filter out part of data, so care must be taken
Outliers
- Remove outliers if they are not focus of analysis
- Analyze more closely if they are focus of analysis (Ex: fraud detection)
Domain Knowledge
Required for addressing Data Quality issues effectively
Views: 0