1. Home
  2. Docs
  3. Advanced Python
  4. Python Data Analysis
  5. Pandas

Pandas

Pandas

  • Provides data analysis tools 
  • It can connect to and interact with a database
  • It can also read and write Excel Files

Pandas Module

After importing the pandas module,

To read the excel file:

import pandas as pd
df = pd.read_excel(‘my_file.xlsx’)

To read multiple sheets in excel file:

import pandas as pd
Xlf = pd.ExcelFile(‘my_file.xlsx’)
Df= Xlf.parse(‘my_sheet’)

Operations

Pandas Operations

Slicing

  • loc gets DataFrame rows & columns by labels/names
    • For loc[], if the label is not present it gives a key error.
  • iloc[] gets by integer Index/position.
    • For iloc[], if the position is not present it gives an index error.
  • DataFrame.loc[ ] is label-based to select rows and / or columns in pandas
  • It accepts single labels, multiple labels from the list, indexes by a range ( between two indexes labels)
  • START is the name of the row / column label
  • STOP is the name of the last row / column label
  • STEP as the number of indices to advance after each extraction
  1. If start row/ column is not provided, loc[] selects from the beginning
  2. By not providing stop, loc[] selects all rows / columns from the start label
  3. Providing both start and stop, selects all rows/columns in between

DataFrame.iloc[ ] is a index-based to select rows and / or columns. It accepts a single index, multiple indexes from the list, indexes by a range

iloc : integer locate

Selecting a single row

import pandas as pd
d1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset5.csv")
d1
d1.iloc[0]

Selecting a single row: (Alternate Way)

import pandas as pd
d1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset5.csv")
d1
d1.iloc[0,:]
import pandas as pd
d1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset5.csv")
d1
d1.iloc[0:1]

Retrieve the first column using iloc

import pandas as pd
d1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset5.csv")
d1
d1.iloc[:,1] # same as d1.iloc[:,1:1]

Select a single cell

import pandas as pd
d1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset5.csv")
d1
d1.iloc[0,2]

Selecting slice of rows

import pandas as pd
d1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset5.csv")
d1
d1.iloc[0:2]

Selecting slice of columns

import pandas as pd
d1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset5.csv")
d1
d1.iloc[:,0:2]

Retrieve subset of cells: retrieve rows from 0 to 3 (inclusive ) and columns till 2 (inclusive)

import pandas as pd
d1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset5.csv")
d1
d1.iloc[0:4,0:3]
import pandas as pd
d1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset5.csv")
d1
d1.iloc[-1:-3:-1,-1:-3:-1] # step size must be negative else cannot traverse

When step count is positive, proceed in right side, Else in left side

Select multiple rows

Select rows 1 and 3

import pandas as pd
d1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset5.csv")
d1
d1.iloc[[1,3]]

Select Multiple Columns

Example: Select the column name and course using iloc

import pandas as pd
d1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset5.csv")
d1
d1.iloc[:,[0,2]]

Selection based on condition

Select the students who are greater than 20

import pandas as pd
d1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset5.csv")
d1
d1['Age']>20
d1[d1['Age']>20]

Slicing using loc

Concatenating

pd.concat() #returns a dataframe

Concatenate two or more data frames into a new one

import pandas as pd
data1 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Restaurant - Week 1 Sales.csv")
data1
import pandas as pd
data2 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Restaurant - Week 2 Sales.csv")
data2

data 1 has 250 rows and data 2 has 250 rows. But the index is still to 249. because it keeps separate indexing for data1 [0..249] and data 2 [0…249

import pandas as pd
# to concatenate data1 and data2 both have same column name
pd.concat(objs=[data1,data2])

pd.concat()

By default, it is set to False. If it is set to True then, the index is continued after the instances of data1 ends

import pandas as pd
d1=pd.concat(objs=[data1,data2],ignore_index=True)
d1

Dataframeobject.append()

import pandas as pd
data1.append(data2)
import pandas as pd
data1.append(data2,ignore_index=True)

Use of the Parameter keys

import pandas as pd
vd = pd.concat(objs=[data1,data2],keys=["Week-1","Week-2"])
vd
import pandas as pd
vd = pd.concat(objs=[data1,data2],keys=["Week-1","Week-2"])
print(vd.loc[('Week-1')]) # same as vd.loc[('Week-1',)]
import pandas as pd
vd = pd.concat(objs=[data1,data2],keys=["Week-1","Week-2"])
print(vd.loc[('Week-2',)])
# to display the index of 245 belongs to Week-2
print(vd.loc[('Week-2',245)])

Joining

Joining two data frames when keys are in common. Based on that types of joins are given as:

Inner Join

# Find the customers who came on both weeks
customer_both_weeks=data1.merge(data2, how="inner",on="Customer ID")
customer_both_weeks
print(data1[data1["Customer ID"]==155])
print("----------------")
print(data1["Customer ID"]==155)

Changing the default suffixes using the parameter suffixes

customer_both_weeks_cs=data1.merge(data2, how="inner",on="Customer ID",suffixes=[":Week - 1",":Week - 2"])
customer_both_weeks_cs

Find the Customers who came on both weeks and order same food

# Find those customers who came on both weeks and order same food
customer_both_weeks_same_food = data1.merge(data2,how="inner",on=["Customer ID", "Food ID"])
customer_both_weeks_same_food
All_customers_both_weeks = data1.merge(data2,how="outer",on=["Customer ID"],suffixes=[ " - Week 1", " - Week 2"])
All_customers_both_weeks
I = data1.merge(data2,how="outer",on=["Customer ID"],suffixes=[ " - Week 1", " - Week 2"],indicator=True)
I

Outer Join

print(I["_merge"].value_counts())

Displaying the instances either in left dataframe or in right data frame and not in both

masker = I['_merge'].isin(['left_only','right_only'])
I[masker]

Left Join

lj = data1.merge(data4, how="left",on="Food ID")
print(lj)
#to print the items in sorted order
lj = data1.merge(data4, how="left",on="Food ID",sort=True) #default value of sort is False
print(lj)

Right Join

Return all rows from the right dataframe, and any rows with matching keys from the left dataframe.

Considers instances from data4, for each matching food id in data 3, , it joins the data. Else keep the value as NaN

rj = data1.merge(data4, how="right",on="Food ID")
print(rj)

left_on and right_on parameters

  • When two columns actually refer to same value but with different column names
  • When the dataframes have different column names  as a key then there is a need to use left_on and right_on parameter
  • left_on parameter specifies: the column name that serves as a key in left data frame
  • right_on parameter specifies: the column name that serves as a key in right data frame
#left_on : represents column name in left data frame
#right_on : represents column name in rigth data frame
lr = data2.merge(data3,how="left",left_on="Customer ID", right_on="ID")
lr
#to drop one common column use drop()
lr1 = data2.merge(data3,how="left",left_on="Customer ID", right_on="ID").drop("ID",axis='columns')
lr1
# to sort the values: user sort parameter
lr2 = data2.merge(data3,how="left",left_on="Customer ID", right_on="ID",sort=True).drop("ID",axis='columns')
lr2

Merging by left_index and right_index parameters

data3.head(5)

To specify the user defined index, use index_col parameter while reading the csv file in read_csv

data3 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Restaurant - Customers.csv",index_col="ID")
data3
data4 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Restaurant - Foods.csv",index_col="Food ID")
data4
nd=data1.merge(data3, how="left",left_on="Customer ID",right_index=True)
nd
all_data=nd.merge(data4,how="left",left_on="Food ID",right_index=True)
all_data

In the restaurant example, Find the order in the which the customers arrive on week 1 and week

Solution:

The order is represented by index. In both data frames we should consider the index and based on that merging happens

order_ = data1.merge(data2,how="left", left_index=True, right_index=True, suffixes = [ " : Week 1 ", " : Week 2"])
order_

Merging: Summary

  • Two columns with same name: use on
  • Two columns with different name: use left_on, right_on
  • Merging column with index : left_on, right_index (vice versa)
  • Merging index with index: (left_index, right_index)

Joining: .join() method

When two data frames has to be joined vertically based on the index

data5 = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Restaurant - Week 1 Satisfaction.csv")
data5
sat = data1.merge(data5, how="left", left_index=True,right_index=True)
sat.head(5)
sat1 = data1.join(data5)
sat1.head(5)

Join two dataframes vertically, when two dataframes share exactly same index

Merging: pd.merge() method

  • So far, merge method has been called on pandas object
  • merge() method can be invoked on pandas library as pd.merge()
cus_info = pd.merge(data1,data3,how="left",left_on="Customer ID", right_on="ID")
cus_info.head(3)
cus_info1=data1.merge(data3,how="left",left_on="Customer ID", right_on="ID")
cus_info1.head(3)

Views: 1

How can we help?

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments