DeZyre - live hands on training
  • Home
  • Mini Projects
  • Blog
  • Sign In
  • FREE PROJECT RECIPES

Pandas Tutorial Part-2

  • Back to tutorial home
  • About
  • Videos
  • Blogs
  • Topics
  • Request Info

Learn how you can build Data Science Projects


Using groupby() : Split-Apply-Combine

In Data Analysis workflows, operations like data loading, cleaning and merging are usually followed by summarizations using some grouping variable(s). This includes summary statistics over variables, or groups within variables, within-group transformations (like variable standardization), computing pivot-tables and by-group analyses. Pandas DataFrames have a .groupby() method that works in the same way as the SQL group by.

This process involves three steps

  • Splitting the data into groups based on the levels of a categorical variable. This is generally the simplest step.
     
  • Applying a function to each group individually. There are 3 classes of functions we might consider:
    • Aggregate – estimate summary statistics (like counts, means) for each group. This will reduce the size of the data.
    • Transform – within group standardization, imputation using group values. The size of the data will not change.
    • Filter – ignore rows that belong to a particular group
    • A combination of these 3
       
  • Combining the results into a Series or Dataframe

The image below shows a graphical explanation of this process: We split by ‘x’, apply the function ‘mean’ to each group formed and then append the results

The Split-Apply-Combine Strategy

With pandas, we can implement this strategy as

  • Split
    • A DataFrame can be split up by rows(axis=0) or columns(axis=1) into groups.
    • We use pd.groupby() to create a groupby object
  • Apply
    • A function is applied to each group using .agg() or .apply()
  • Combine
    • The results of applying functions to groups are put together into an object
    • Note: Data types of returned objects are handled gracefully by pandas

We create a groupBy object by calling the groupby() function on a data frame, passing a list of column names that we wish to use for grouping. These objects,

  • have a .size() method, which returns the count of elements in each group.
  • can be subsetted using column names (or arrays of column names) to select variables for aggregation
  • have optimized methods for general aggregation operations like -
    • count, sum
    • mean, median, std, var
    • first, last
    • min, max
  • specialized methods like .describe() apply to these objects

By far, the most important GroupBy methods are .agg(), .transform(), and .apply()

Syntax:

# Create a groupBy object
gb_obj = my_df.groupby(‘col_x’)

# Summarize each group
gb.obj.my_func()

Example:

# Create a toy dataset with 2 categorical and 2 numeric variables
df = DataFrame({'k1': list('abcd' * 25),
                'k2': list('xy' * 25 + 'yx' * 25),
                'v1': np.random.rand(100),
                'v2': np.random.rand(100)})
print df.head(15)


  • Grouping by ONE key

This results in a summarized data frame indexed by levels of the key.

# Since k1 has 4 categories, this will return 4 rows
print '\n', df.groupby('k1').mean()

# Since k2 has 2 categories, this will return 2 rows
print '\n', df.groupby('k2').sum()
  • Grouping by TWO keys

This will result in a summarized data frame with a hierarchical index.

# A dataframe with a hierarchical index formed by a combination of the levels
print df.groupby([df['k1'], df['k2']]).sum()
  • Column-wise aggregations – optimized statistical methods

For simple statistical aggregations (of numeric columns of a DataFrame) we can call methods like mean and sum

# Summing a Series
df['v1'].groupby(df['k1']).sum()

# Summing all Series of a DataFrame
print df.groupby('k2').mean()

Or you can pass the name of a function as a string with the .agg() method

# Summing a Series
df['v1'].groupby(df['k1']).agg('sum')

# Finding the mean of all grouped series of a DataFrame
print df.groupby(df.k1).agg('mean').add_prefix('mu_')
  • The .agg( ) method

Learn Data Science by working on interesting Data Science Projects for just $9

When we have a groupBy object, we may choose to apply one or more functions to one or more columns, even different functions to individual columns. The .agg() method allows us to easily and flexibly specify these details. It takes as arguments the following –

  1. list of function names to be applied to all selected columns
  2. tuples of (colname, function) to be applied to all selected columns
  3. dict of (df.col, function) to be applied to each df.col
  4. Apply >1 functions to selected column(s) by passing names of functions to agg() as a list
# Apply min, mean, max and max to v1 grouped by k1
df['v1'].groupby(df['k1']).agg(['min', 'mean', 'max'])

# Apply min and max to all numeric columns of df grouped by k2
print df[['v1', 'v2']].groupby(df['k2']).agg(['min', 'max'])
     # Hierarchical index will be created

# We can call .stack on the returned object!
print '\n', df[['v1', 'v2']].groupby(df['k2']).agg(['min', 'max']).stack()

 

5. We can supply names for the columns in the (new) aggregated DataFrame to the agg() method, in a list of tuples

# Provide names for the aggregated columns
df[['v1', 'v2']].groupby(df['k1']).agg([('smallest','min'), ('largest', 'max')])

6. We can supply DataFrame column names and which functions to apply to each, in a dictionary

# Apply max and min to v1; and mean and sum to v2; all grouped by k1
df[['v1', 'v2']].groupby(df['k1']).agg({'v1': ['max', 'min'], 'v2': ['mean', 'sum']})

The .apply( ) method

This method takes as argument the following:

  • a general or user defined function
  • any other parameters that the function would take
#  Retrieve the top N cases from each group
def topN(data, col, N):
     return data.sort(columns=col, ascending=False).loc[:, col].head(5)
print df.groupby(df['k2']).apply(topN, col='v1', N=5)

Combining Multiple Datasets - merge()

The merge()  function in pandas is similar to the SQL join operations, it links rows of tables using one or more keys.

Syntax:

merge(df1, df2,
     how='left', on='key', left_on=None, right_on=None,
     left_index=False, right_index=False,
     sort=True, copy=True,
     suffixes=('_x', '_y'))

The syntax includes specifications of the following arguments:

  • Which column to merge on;
    • the on='key' if the same key is in the two DFs,
    • or left_on='lkey', right_on='rkey' if the keys have different names in the DFs

Note: To merge on multiple keys, pass a list of column names

  • The nature of the join;
    • the how=  option, with left, right, outer
    • By default, the merge is an inner join
  • Tuple of string values to append to overlapping column names to identify them in the merged dataset
    • the suffixes= option
    • defaults to ('_x', '_y')
  • If you wish to merge on the DataFrame index,
    • pass left_index=True or right_index=True or both.
  • Whether to Sort the result DataFrame by the join keys in lexicographical order or not;
    • The sort= option;
    • Defaults to True,  setting to False will improve performance substantially in many cases

Examples

# Let's define a few toy datasets to use as examples

df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': np.random.randn(7)})
df2 = DataFrame({'key': ['a', 'b', 'd'],
                 'data2': np.random.randn(3)})

df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data3': np.random.randn(7)})
df4 = DataFrame({'rkey': ['a', 'b', 'd'],
                 'data4': np.random.randn(3)})

print df1, '\n\n', df2, '\n\n', df3, '\n\n', df4

Default merge with no parameters

pd.merge(df1, df2)

We note that:

  • It is an inner join by default (output key is the intersection of input keys)
  • Merge happens on the column 'key' which is common to both datasets;
  • We could've written pd.merge(df1, df2, on='key') to the same effect
pd.merge(df1, df4)

# Yields an error because there are no matching column names to merge on

Specifying which columns to merge on (if keys have different names in datasets)

pd.merge(df3, df4, left_on='lkey', right_on='rkey')
# still an inner join!

 Specifying which type of join

pd.merge(df1, df2, how='outer')
# the merged dataset will have a union of the keys, imputing NaNs where values aren't found
pd.merge(df1, df2, how='left')
# value 'c' is absent in df2, so there will be a NaN in column data2

Specifying Suffixes

# Add a column with the same name to df1 and df2
df1['colx'] = np.random.randn(7)
df2['colx'] = np.random.randn(3)

# Specifying suffixes to identify columns with the same name
pd.merge(df1, df2, on='key', suffixes=['_l', '_r'])

Merging on columns and index

# Set lkey to be the index of df3
df3.set_index(lkey, inplace=True)
  • Note: Do this only once. Re-running set_index will produce errors.
  • You'll have to reset index before you can set it again.

We specify that

  • for the df2 we will use the column 'key' and
  • for the df4, we will use its index to merge
pd.merge(df2, df3, how='left', left_on='key', right_index=True)

  • Promotianal Price
  • Microsoft Track
    Microsoft Professional Hadoop Certification Program
  • Hackerday

Online courses

  • Hadoop Training
  • Spark Training
  • Data Science in Python
  • Data Science in R
  • Data Science Training
  • Hadoop Training in California
  • Hadoop Training in New York
  • Hadoop Training in Texas
  • Hadoop Training in Virginia
  • Hadoop Training in Washington
  • Hadoop Training in New Jersey
  • Hadoop Training in Dallas
  • Hadoop Training in Atlanta
  • Hadoop Training in Chicago
  • Hadoop Training in Canada
  • Hadoop Training in Charlotte
  • Hadoop Training in Abudhabi
  • Hadoop Training in Dubai
  • Hadoop Training in Detroit
  • Hadoop Training in Edison
  • Hadoop Training in Germany
  • Hadoop Training in Fremont
  • Hadoop Training in Houston
  • Hadoop Training in Sanjose

Pandas Tutorial Part-2 Blog

  • Data Cleaning in Python
  • Python Pandas Dataframe Tutorials
  • Recap of Hadoop News for September 2018
  • Introduction to TensorFlow for Deep Learning
  • Recap of Hadoop News for August 2018
  • AWS vs Azure-Who is the big winner in the cloud war?

Other Tutorials

Neural Network Training Tutorial

Python List Tutorial

MatPlotLib Tutorial

Neural Network Tutorial

Performance Metrics for Machine Learning Algorithms

SciPy Tutorial

Principal Component Analysis Tutorial

Pandas Tutorial Part-3

Pandas Tutorial Part-1

Introduction to Data Science with R

Big Data and Hadoop Training Courses in Popular Cities

  • Microsoft Big Data and Hadoop Certification
  • Hadoop Training in Texas
  • Hadoop Training in California
  • Hadoop Training in Dallas
  • Hadoop Training in Chicago
  • Hadoop Training in Charlotte
  • Hadoop Training in Dubai
  • Hadoop Training in Edison
  • Hadoop Training in Fremont
  • Hadoop Training in San Jose
  • Hadoop Training in New Jersey
  • Hadoop Training in New York
  • Hadoop Training in Atlanta
  • Hadoop Training in Canada
  • Hadoop Training in Abu Dhabi
  • Hadoop Training in Detroit
  • Hadoop Trainging in Germany
  • Hadoop Training in Houston
  • Hadoop Training in Virginia
  • Hadoop Training in Washington
  • Contact Us
  • Mini Projects
  • Free Recipes
  • Blog
  • Tutorials
  • Privacy Policy
  • Disclaimer
Copyright 2019 Iconiq Inc. All rights reserved. All trademarks are property of their respective owners.