Quality of Data
According to Gartner: Poor Quality data weakens an organization’s competitive standing and business objectives
data:image/s3,"s3://crabby-images/f0d53/f0d53667d6d930d162f3e68ec7066fc72672cc51" alt=""
Data Wrangling Process
Process of transforming and mapping data from one raw data into another form with the intent of making it more appropriate and valuable for various task
data:image/s3,"s3://crabby-images/b27a8/b27a83463874d9b8f8021718edb5cc1bb86b6b96" alt=""
Data Cleaning Workflow
A typical data cleaning workflow includes:
data:image/s3,"s3://crabby-images/60a28/60a28ddc92eb7cbae0807a504f08f4d4dd899012" alt=""
Inspection
It includes,
- Detecting issues and errors
- Validating against rules and constraints
- Profiling data to inspect source data
- Visualizing data using statistical methods
Inspection
Data Profiling
data:image/s3,"s3://crabby-images/7fc23/7fc2357da47734b6aa9968569d8df66157611c82" alt=""
Visualization
Visualizing the data using statistical methods can help to spot outliers
Cleaning
- Missing values can cause unexpected or biased results
- Duplicate data are data points that are repeated in your dataset
- Irrelevant data is data that is not contextual to the use case
- Data type conversion is needed to ensure that values in a field are stored as the data type of that field
- Syntax errors, such as white spaces, extra spaces , types and formats need to be fixed
- Outliers need to be examined for accuracy
Verification
It includes,
Inspecting results to establish effectiveness and accuracy achieved as a result of data cleaning
Handling Missing values
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Returns: DataFrame or None
DataFrame with NA entries dropped from it or None if inplace = True
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Determine if rows or columns which contain missing values are removed.
0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
how:{‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
Returns: DataFrame or None
Object with missing values filled or None if inplace = True
Value: scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
Method: {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.
data:image/s3,"s3://crabby-images/f91cc/f91cc94ba505b1543842fe0a18214acc8598d1bf" alt=""
# Replacing all NaN with a given value
df1= dataset.fillna('E')
print(df1)
data:image/s3,"s3://crabby-images/e446d/e446d55b9c0f1c1dd563171df1416affade3d407" alt=""
data:image/s3,"s3://crabby-images/346dc/346dce57afc82674b0eac514c139d30924bad3f6" alt=""
#b) Replacing NaN present in one specific column
dataset['stay'] = dataset['stay'].fillna(0)
dataset
data:image/s3,"s3://crabby-images/be839/be839e9dd257a3d8205d225a44329cab1898013f" alt=""
data:image/s3,"s3://crabby-images/778af/778afa157d11b44dfa3a05dd43005efe80f1fa53" alt=""
#b) using method attribute
df['Height']=df['Height'].fillna(method='backfill')
df['Weight']=df['Weight'].fillna(method='bfill')
df['Number of days']=df['Number of days'].fillna(method='pad')
df['stay']=df['stay'].fillna(method='ffill') # chcek for bfill; in all the instances the value of stay will be 12.0
df
data:image/s3,"s3://crabby-images/755b8/755b8a4e84ed4f4d1f206c7beaea70aac04ac473" alt=""
data:image/s3,"s3://crabby-images/963e7/963e78dc4d2e9de56b2ee01aa46bf2b84b7f4046" alt=""
#c) using axis attribute
df3 = df3.fillna(method="backfill", axis=1)
print(df3)
data:image/s3,"s3://crabby-images/424d7/424d70cac7bf1bc2215b2c1ce8ce226dd76538f4" alt=""
data:image/s3,"s3://crabby-images/93813/938131c0e55233f2753eb7ab50b868efbc690873" alt=""
# Use of limit attribute ; if axis = 0--> in one column only one NaN value is replaced if limit = 1
if axis = 1--> in one row only one NaN value is replaced if limit = 1
df4 = df4.fillna(method="backfill", axis=0, limit = 1)
df4
data:image/s3,"s3://crabby-images/2b81b/2b81bd3a542df0f5ccd4cf7bdb47a91b41667424" alt=""
data:image/s3,"s3://crabby-images/87ed2/87ed2234f74b3f4e6c5443d921b3cccc744a700e" alt=""
# Use of limit attribute ; if axis = 0--> in one column only one NaN value is replaced if limit = 1
if axis = 1--> in one row only one NaN value is replaced if limit = 1
df5 = df5.fillna(method="backfill", axis=1, limit = 1)
df5
data:image/s3,"s3://crabby-images/b1d8d/b1d8dde8c189df7540c2ad9a7c2070ac04e3d7cb" alt=""
data:image/s3,"s3://crabby-images/0e84d/0e84d3438ff30382cd00aae1af038c1d7dbac3bd" alt=""
#f)Â Use of inplace. By default inplace = false. If inplace=true, then original dataframe is changed. else it remains unchanged
df6= pd.read_csv("F:/SRIHER/2021-2022/Quarter - 3/Advacned Python/Module - 1/Dataset/d1.csv")
print (df6)
df6.fillna(method="backfill",inplace = True)
df6
data:image/s3,"s3://crabby-images/e3ecb/e3ecbb2e5bf8c3597220fd8803c5a24d382c8be2" alt=""
data:image/s3,"s3://crabby-images/533fc/533fc8327f423f1d1c9e0f6b6f60363dfd3d1d00" alt=""
#f)Â Use of inplace. By default inplace = false. If inplace=true, then original dataframe is changed. else it remains unchanged
df6= pd.read_csv("F:/SRIHER/2021-2022/Quarter - 3/Advacned Python/Module - 1/Dataset/d1.csv")
print (df6)
df6.fillna(method="backfill",inplace = True)
df6
data:image/s3,"s3://crabby-images/11ef3/11ef368625dd10f6caf48b1757426b237292b242" alt=""
data:image/s3,"s3://crabby-images/70c45/70c457739cc3b80b360270ff2cb4bf8d2263f875" alt=""
#g)Â Use of dictionary in value attribute
df7= pd.read_csv("F:/SRIHER/2021-2022/Quarter - 3/Advacned Python/Module - 1/Dataset/d1.csv")
print (df7)
values = {"Height": 0, "Weight": 1, "Country": 2, "Place": 3, "Number of days":4, "stay":5}
df7.fillna(value=values,inplace=True)
df7
data:image/s3,"s3://crabby-images/37dab/37dab8727466908e6e53863537e1fe0eccba1ce9" alt=""
data:image/s3,"s3://crabby-images/6d7f5/6d7f5085ae2ff62bbdee68f2a2a1461dbab490e8" alt=""
#g)Â Use of dictionary in value attribute with limit attribute
df8= pd.read_csv("F:/SRIHER/2021-2022/Quarter - 3/Advacned Python/Module - 1/Dataset/d1.csv")
print (df8)
values = {"Height": 0, "Weight": 1, "Country": 2, "Place": 3, "Number of days":4, "stay":5}
df8.fillna(value=values,inplace=True, limit=1)
df8
data:image/s3,"s3://crabby-images/34592/34592aac0ed997ebd7f10a5ff937820a883c1eae" alt=""
data:image/s3,"s3://crabby-images/016db/016db2e28b63ba5e86b25af19b58bd6109786689" alt=""
h) use of mean to impute the missing values
Fill NaN values in one column with mean:
df['col1'] = df['col1'].fillna(df['col1'].mean())
Fill NaN values in more than one column with mean:
df[['col1', 'col2']] = df[['col1', 'col2']].fillna(df[['col1', 'col2']].mean())
Fill NaN values in all column with mean:
df = df.fillna(df.mean())
data:image/s3,"s3://crabby-images/330bb/330bb9923ec06ea9c9f1a59a1bb7d84fae6fa821" alt=""
h) use of median to impute the missing values
Fill NaN values in one column with median:
df['col1'] = df['col1'].fillna(df['col1'].median())
Fill NaN values in more than one column with median:
df[['col1', 'col2']] = df[['col1', 'col2']].fillna(df[['col1', 'col2']].median())
Fill NaN values in all column with median:
df = df.fillna(df.median())
data:image/s3,"s3://crabby-images/e6acc/e6acca9df158a4c4ac9446f12710390dd1df9a2d" alt=""
data:image/s3,"s3://crabby-images/ff8ea/ff8eaeada3a39c4653313e468857640362fd1dda" alt=""
g) use of median to impute the missing values
df10['Weight']=df10['Weight'].fillna(df10['Weight'].median())
df10
Average number is a-0 32 35 36 39 n = 4 Divide by 2 4/2 = 2 Get the number at index 2 i.e, 35
data:image/s3,"s3://crabby-images/ba347/ba347015dcd87ed62b7a3e36330fd2807b159286" alt=""
Get the next number and compute average (35+36) / 2
data:image/s3,"s3://crabby-images/016b3/016b3c80a441ed249dad391d6162143ae3d205d3" alt=""
g) use of median to impute the missing values
df11[['Height','Weight']]=df11[['Height','Weight']].fillna(df11[['Height','Weight']].median())
df11
data:image/s3,"s3://crabby-images/d820f/d820ffc29be6be8e92e41003073b13bcf52c202d" alt=""
data:image/s3,"s3://crabby-images/15a99/15a99b6ef1eb1b8cb2e6371e3809c2ef98fab28a" alt=""
g) use of median to impute the missing values
df12=df12.fillna(df12.median())
df12
data:image/s3,"s3://crabby-images/65d9e/65d9ecfe1cdd73250c6fc9779d07f88db5db51c3" alt=""
1, 2, 3, 4 n=4 4/2 = 2 Element at 2nd loc is 2 Next element is 4 = (2+4) = 6/2 = 3
- Mode is the most frequent observation (or observations) in a sample.
- EX: [5,2,3,3,4,6] – Mode is 3, because 3 appears two times in the sample whereas the other elements only appear once
- The mode does not have to be unique.
- Some samples have more than one mode
- Example: [5,2,3,3,4,6,5] Mode is 3 and 5 because they both appears more often and both appear same number of times
- The mode is commonly used for categorical data
- Boolean – take two values: true or false, male or female
- Nominal – Take more than two values : American, African, Asian
- Ordinal – Take more than two values but the values have a logical order: few- some – many
data:image/s3,"s3://crabby-images/2e324/2e3242581bd33e4acb3940196ca16218a576762e" alt=""
i) use of mode to impute the missing values
df13=df13.fillna(df13.mode())
df13
data:image/s3,"s3://crabby-images/351ae/351ae53f45abaf20528848a077e4ebdaff1fc595" alt=""
Views: 0