1. Home
  2. Docs
  3. Advanced Python
  4. Python Data Analysis
  5. GroupBy

GroupBy

GroupBy

  • Pandas groupby is used for grouping the data according to the categories and apply a function to the categories
  • Helps to aggregate data efficiently
  • It is mostly used with aggregate functions (count, sum,min,max,mean) to get the statistics based on one or more column values
  • Also called Split-Apply-Combine process
    • The groupby() function splits the data based on some criteria
    • The aggregate function is applied to each of the groups
    • The groups are combines together to create the result DataFrame
import pandas as pd
data = {'item' :['tea','coffee','sugar','tea','coffee','sugar','tea','coffee','sugar','tea','sugar','tea'],
        'year':[2010,2010,2010,2011,2011,2011,2012,2012,2012,2011,2012,2011],
        'sales':[1200,1050,500,1500,1200,1000,1230,1300,1420,300,1420,300]}
df = pd.DataFrame(data,columns=['item','year','sales'])
print("Data:\n",df)
print(type(df))

Why groupby() function

  • To select the data based on groups and understand aggregated data on a group level.
  • Though pandas allows us to iterate over every row in a dataframe, it is generally very slow to do so.
  • Fortunately Pandas has a groupby() function to speed up such task.
  • The idea behind the groupby() function is that it takes some dataframe, splits it into chunks based on some key values, applies computation on those chunks, then combines the results back together into another dataframe.
  • In Pandas, this is referred as split-apply-combine pattern

groupby()

  • Split data into groups based on one column
  • Pandas object can be split into a group in many ways.

Syntax:

Dataframeobject.groupby(‘colname’);
  • groupby() actually returns a tuple; where the first value is the value of the key we are trying to group by ; and second one is projected dataframe
  • Split data into groups based on one column
Dataframeobject.groupby(‘colname’,’colname1’);
#number of instances in dataframe
print("Length of dataframe:",len(df))
print("Length of groups:",len(group1))
#to print number of unique groups
print("Number of unique items:\n",group1['item'].nunique())
#group by single column
group1=df.groupby(['item'])
group1
print(group1.size())
print("Using dataframe object :\n",df["item"].value_counts())
# to get first row in each group
print(group1.first())
# to get last row in each group
print(group1.last())

To get the dataframe of a particular group

#using get_group to select particular group
group1.get_group('sugar')

Methods on the groupby object and DataFrame Columns

# to print max
print("Maximum in each group:\n",group1.max()) 
# max() looks the nearest column of item based on which the group is created; here nearest column of item is
#year,so the maximum value from year is selected and displayed for each group
# max and min function can be applied over columns with string values 
print("Minimum in each group:\n",group1.min())
#sum function can be applied over solumns with numerical values
print("Sum across each group:\n",group1.sum())

To compute the sum for a particular column using dataframe object and groupby object

Getting sum values for specific group

# using the groupby object
print(group1["year"].sum()) # possible to pass list of strings separated by commas
print("--------------------------")
# to get the total number of sales for each group
print(group1["sales"].sum())

The return type of these methods are series
#find the max sales 
print("Max sales:\n",group1["sales"].max())
print("--------------------------")
#find the min sales
print("Min sales: \n",group1["sales"].min())
print("--------------------------")
# find the average sales 
print("Average sales:\n",group1["sales"].mean())
--------------------------
Max sales:
 item
coffee    1300
sugar     1420
tea       1500
Name: sales, dtype: int64
--------------------------
Min sales: 
 item
coffee    1050
sugar      500
tea        300
Name: sales, dtype: int64
--------------------------
Average sales:
 item
coffee    1183.333333
sugar     1085.000000
tea        906.000000
Name: sales, dtype: float64

Methods on the groupby object and DataFrame Columns

#using multiple columns
print(group1[["year","sales"]].sum()) # returns a dataframe

Grouping by Multiple Column

Index in the datframe

#group by multiple column
#create a dataframe
data2 = pd.DataFrame( {"Country": ['India','India','USA','USA','India'],
                      "Company":['TN','LA','TN','LA','TN'],
                      "Revenue":[4500,2500,6200,4200,8000],
                      "Profit":[6500,8500,2500,2000,2500]
                     })
group2 = data2.groupby(["Country","Company"])
group2.groups
{('India', 'LA'): Int64Index([1], dtype='int64'),
 ('India', 'TN'): Int64Index([0, 4], dtype='int64'),
 ('USA', 'LA'): Int64Index([3], dtype='int64'),
 ('USA', 'TN'): Int64Index([2], dtype='int64')}

Number of instances

print(group2.size())
group2.sum() # for each numerical column sum is found
#calculating sum for particular column
print(group2['Revenue'].sum())

The .agg() Method

It is called on groupby object

Column name passed in a dictionary as a key with value as a operation

group3 = data2.groupby('Country')
#performing collection of operations on each column
#agg function returns a dataframe
df1 = group3.agg( { "Revenue":"sum",
                   "Profit":"sum"})  
print(df1)

Passing list as a argument with the operations to be performed

#for each column find the size, sum and mean
df2 = group3.agg(['size','sum','mean'])
df2

Passing list of operations as a value for the key, column Revenue

df3 = group3.agg( { "Revenue":["sum","mean"],
                   "Profit":"sum"})  
print(df3)

Specifying group1 variable within the subscript specifying column name

Lets to apply the aggregate function over the specified column

print("For slaes:\n",group1['sales'].agg(['count','min']))

Iterating over the Groups

#print the group elements by iterating throgh
for itemname,groupname in group1:
    print('{}:'.format(itemname))
    print(groupname)

Display the highest revenue company in each country

#Extract the max profit of the company in each country
print(group3["Profit"].max())
# alternative way:
#Display the highest revenue company in each country and store it in dataframe
df_todisplay = pd.DataFrame(columns = data2.columns)
#df_todisplay
for gr,gr_df in group3:
    comp_high_revenue = gr_df.nlargest(1,"Revenue")
    df_todisplay = df_todisplay.append(comp_high_revenue)
df_todisplay

largest(attr1,attr2)- takes two arguments; the first argument represents the integer number which represents the number of largest elements

The second argument represents the column name on which to pick the largest value

#Display the highest revenue for each company
group4 = data2.groupby("Company")
df_high_revenue_country = pd.DataFrame(columns = data2.columns)
for gr,gr_df in group4:
    comp_high_revenue = gr_df.nlargest(1,"Revenue")
    df_high_revenue_country = df_high_revenue_country.append(comp_high_revenue)
df_high_revenue_country

Specify the column name to get mean

# to compute mean for each group
for itemname,groupname in group1:
    print('{}:'.format(itemname))
    print("The item,", itemname,"has average sales of:", groupname['sales'].mean())
    print()
coffee:
The item, coffee has average sales of: 1183.3333333333333

sugar:
The item, sugar has average sales of: 1085.0

tea:
The item, tea has average sales of: 906.0

Call the function max() to get max value in the couln specified

#to compute max sales for each item
for itemname,groupname in group1:
    print('{}:'.format(itemname))
    print("The item,", itemname,"has average sales of:", groupname['sales'].max())
    print()
coffee:
The item, coffee has average sales of: 1300

sugar:
The item, sugar has average sales of: 1420

tea:
The item, tea has average sales of: 1500

Qualitative and Quantitative Data Analysis

Quantitative vs. Qualitative

Examples

Views: 0

How can we help?

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments