1. Home
  2. Docs
  3. Advanced Python
  4. Introduction to Advanced ...
  5. Data Quality

Data Quality

Data Quality,

  • Real-world data is messy
  • Missing Data
  • Duplicate Data
  • Invalid Data
  • Noise
  • Outliers

Missing Data

Duplicate Data

Invalid Data

Noise

Why Address Data Quality Issues?

Addressing Data Quality Issues

Addressing Data Quality Issues,

Removing Missing Data

Imputing Missing Data

Replace Missing values with something reasonable,

Ways to Impute Missing Data

Replace missing value with,

  • Mean
  • Median
  • Most Frequent
  • Sensible value based on application

Duplicate Data

Invalid Data

  • Use external data source to get correct value
  • Apply reasoning and domain knowledge to come up with reasonable value

Noise

  • Filter out noise component
  • May also filter out part of data, so care must be taken

Outliers

  • Remove outliers if they are not focus of analysis
  • Analyze more closely if they are focus of analysis (Ex: fraud detection)

Domain Knowledge

Required for addressing Data Quality issues effectively

Views: 0

How can we help?

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments