1. Home
  2. Docs
  3. Advanced Python
  4. Python Data Analysis
  5. Pandas Data Structures

Pandas Data Structures

Pandas Data Structures

Series

A series is a one-dimensional array-like object containing an array of data (of any numpy data type) and an associated array of data labels called its index

The different ways to create a series:

  • Create an Empty series in pandas
  • Create a series from array without indexing
  • Create a series from array with indexing
  • Create a series from dictionary
  • Create a series from scalar value
  • Create a series from list in pandas Create Series from multi list

How to access the elements in Series?

  • Accessing data from series with position:
  • Accessing data from series with labels or index

To detect missing data

  • isnull()
  • notnull()

Both the series object and its index have a name attribute, which integrates

s11.name = 'Marks' #Assigning column name for series
s11.index.name='student Names' #Assign name for index
s11.index = ['aaa','bbb','xxx','ccc'] #Assign values for index

Use of method between()

Syntax: 

Series.between(left, right, inclusive=True)

Parameters:

  • left: A scalar value that defines the left boundary
  • right: A scalar value that defines the right boundary
  • inclusive: A Boolean value which is True by default. If False, it excludes the two passed arguments while checking.

Return Type: 

A Boolean series which is True for every element that lies between the argument values.

  • This method does not work for strings
  • Only for 1D

Use of between() method

Find the employee whose salary is between 12000 to 2000 both inclusive
data_c = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset6.csv")
data_c
#Find the employee whose salary is between 12,000 to 20,000 both inclusive
Emp = data_c['Salary'].between(12000, 20000, inclusive=True)
print(Emp) # Emp is a Series with boolean values
print(data_c[Emp])
# Equaivalent Statement of above: print(data_c[data_c['Salary'].between(12000,20000)])
0    False
1     True
2    False
3     True
4     True
5     True
Name: Salary, dtype: bool
    Name  Age   Designation  Salary
1  Rudra    23    Assistant   20000
3   Tony    24        Clerk   12000
4   John    23  Office Asst   13500
5  James    23        Steno   14000
    Name  Age   Designation  Salary
1  Rudra    23    Assistant   20000
3   Tony    24        Clerk   12000
4   John    23  Office Asst   13500
5  James    23        Steno   14000

Data Frame

  • A data frame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, Boolean)
  • The dataframe has both a row and column index;
  • It can be thought of as a dict of series (one for sharing the same index)

Create a data frame using dictionary

While converting dataframe to dictionary by default all the keys of the dict object becomes columns, and the range of numbers 0,1,2,…,n is assigned as row index

Create a data frame using from_dict()

Syntax

DataFrame.from_dict(data,orient=‘columns’,dtype=None,columns= None)

data:

It takes dict, list, set,ndarray, Iterable or DataFrame as input

An empty dataframe will be created if it is not provided. The resultant column order follows the insertion order

Orient: (optional)

If the keys of the dict should be the rows of the DataFrame, then set orient –index, else set it to column (Default)

Dtype: (optional)

Data type to force on resulting DataFrame. Only a single data type is allowed. If not given, then it’s inferred from the Data

Columns: (Optional)

  • Only be used in case of orient=“index” to specify column labels in the resulting DataFrame.
  • Default column labels are range of integer i.e. 0,1,2,…,n
  • If we use the columns parameter with orient = ‘columns’ then it throws ValueError
  • Create data frame using read_excel: refer code
  • Create dataframe using read_csv: refer code

Basic Functions in data frame

  • info() method prints the information about the dataframe
  • The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values
  • It does not return any values. Rather it prints the values
data_pt = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset9.csv")
data_pt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Gender  5 non-null      object
 2   Salary  5 non-null      int64 
 3   Age     5 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 288.0+ bytes
  • describe() method returns description of the data  in the Data Frame
  • If the dataFrame contains numerical data, the description contains
    • Count   – the number of not-empty values
    • Mean – average ( mean value
    • Std – standard deviation
    • Min – the minimum value
    • 25% – the 25% percentile
    • 50% – the 50% percentile
    • 75% – the 75% percentile
    • Max – the maximum value
data_pt.describe()

         Salary	          Age
count	5.000000	5.000000
mean	15400.000000	29.000000
std	12381.437719	7.582875
min	5000.000000	23.000000
25%	7000.000000	24.000000
50%	10000.000000	25.000000
75%	20000.000000	32.000000
max	35000.000000	41.000000
data = [[10, 20, 0], [10, 10, 10], [10, 20, 30]]
df = pd.DataFrame(data)
print(df.nunique()) # by default searches in the column and return the unique values
0    1
1    2
2    3
dtype: int64
print(df.nunique(axis="columns"))  #searches in the row and return the unique values
0    3
1    1
2    3
dtype: int64

value_counts() is used to get a series containing counts  of unique values

Syntax

Series.value_counts(self, normalize = False, sort = True, ascending = False, bins= None, dropna = True)

where()

Pandas where() method is used to check a dataframe for one or more condition and return the result accordingly.

By default, the rows not satisfying the condition are filled with NaN value

DataFrame.where(condition, other=nan, inplace=False, axis = None)
  • Cond: one or more condition to check data frame for
  • Other: Replace rows which don’t satisfy the condition with user defined object. Default is NaN
  • Inplace: Boolean value, Makes changes in data frame itself if True
  • axis: axis to check (rows or columns)
data_f = pd.read_csv("F:/SRIHER/2021-2022/Quarter - 3/Advanced Python/Module - 2/Dataset/Dataset6.csv")
data_f
# create a filter
#using single condition: filter only the instances having the designation as Assistant
f1 = data_f['Designation']=='Assistant'
data_f1=data_f.where(f1,  inplace=False)# other can be given like this: other="-",
print(data_f1)
    Name  Age  Designation   Salary
0    NaN   NaN         NaN      NaN
1  Rudra  23.0   Assistant  20000.0
2   Siva  21.0   Assistant  10000.0
3    NaN   NaN         NaN      NaN
4    NaN   NaN         NaN      NaN
5    NaN   NaN         NaN      NaN

sort_values()

Pandas sort_values() function sorts a data frame in Ascending or Descending order of passed Column

DataFrame.sort_values(by,axis=0, ascending = True, inplace=False, kind=‘quicksort’, na_position=‘last)

Parameters

  • by: single / list of column names to sort Data Frame by
  • axis: 0 or “index” for rows and 1 or “ columns” for columns
  • Ascending: Boolean value which sorts Data frame in ascending order if True
  • Inplace: Boolean value. Makes the changes in passed data frame itself if True
  • kind: String which can have three inputs (‘quicksort’, ‘mergesort’ or ‘heapsort’) of algorithm used to sort data frame
  • na_position: Takes two string input ‘last’ or ‘first’ to set postion of Numm values. Default is last
data_f2 = data_f.sort_values('Name')
print(data_f2)
       Name  Age   Designation  Salary
0  Abhinesh    31  Tech Leader   40000
5     James    23        Steno   14000
4      John    23  Office Asst   13500
1     Rudra    23    Assistant   20000
2      Siva    21    Assistant   10000
3      Tony    24        Clerk   12000

corr()

  • DataFrame.corr() method is used to find the pairwise correlation of all columns in dataframe
  • Any na values are automatically exclude
  • For any non numeric data type columns in the data frame it is ignored
DataFrame.corr(self,method = ‘pearson’)

Parameters:

  • Method:
  • pearson: standard correlation coefficient
  • Kendall: Kendall Tau correlation coefficient
  • Spearman: Spearman rank correlation

Returns:

  • count: y: DataFrame

Pearson coefficient

data_co= pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset7.csv")
data_co
data_co.corr()

get_dummies()

The datasets often include categorical variables

data_co1= pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset8.csv")
data_co1
	Name	Age	Height	Weight	Gender	Studying
0	XXX	21	160	62.1	M	Y
1	YYY	19	158	65.4	F	N
2	ZZZ	21	141	72.1	M	Y
3	ABC	24	155	74.0	F	N
4	ACX	24	159	73.0	F	Y

Examples:

  • Marital status (“married,”single”, “divorced”)
  • Smoking status (“smoker”, “non-smoker”)
  • Eye color (“blue”, “green”, “hazel”)
  • Level of education (“high school”, “Bachelor’s degree”, “Master’s degree”)

While dealing with machine learning algorithms, there may need to convert categorical variables to dummy variables, which are numeric variables that are used to represent categorical data

data_changed = pd.get_dummies(data_co1,columns=['Gender'],drop_first=False)
data_changed
data_changed1 = pd.get_dummies(data_co1,columns=['Gender'],drop_first=True)
data_changed1
  • data: The name of the pandas DataFrame
  • prefix: A string to append to the front of the new dummy variable column
  • columns: The name of the column(s) to convert to a dummy variable
  • drop_first: Whether or not to drop the first dummy variable column

Creating multiple dummy variables

data_changed2 = pd.get_dummies(data_co1,columns=['Gender','Studying'],drop_first=True)
data_changed2

pivot_table

To summarize data which includes various statistical concepts

To calculate the percentage of a category in a pivot table, we calculate the ratio of category count to the total count

  • pivot_table requires a data and an index parameter
  • data is the Pandas dataframe you pass to the function
  • index is the feature that allows you to group your data. The index feature will appear as an index in the resultant table

Syntax:

pd.pivot_table(dataframeobject, index=‘col name’]
data_pt = pd.read_csv("F:/Advanced Python/Module - 2/Dataset/Dataset9.csv")
data_pt
pt = pd.pivot_table(data_pt,index=['Gender'])
pt
import numpy as np
pt1 = pd.pivot_table(data_pt,index=['Gender'],aggfunc = {'Salary':np.sum,'Age':np.median})

Views: 0

How can we help?

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments