**Using groupby() : Split-Apply-Combine**

**Using groupby() : Split-Apply-Combine**

In Data Analysis workflows, operations like data loading, cleaning and merging are usually followed by summarizations using some grouping variable(s). This includes summary statistics over variables, or groups within variables, within-group transformations (like variable standardization), computing pivot-tables and by-group analyses. Pandas DataFrames have a .groupby() method that works in the same way as the SQL group by.

This process involves three steps

**Splitting**the data into groups based on the levels of a categorical variable. This is generally the simplest step.

**Applying**a function to each group individually. There are 3 classes of functions we might consider:- Aggregate – estimate summary statistics (like counts, means) for each group. This will reduce the size of the data.
- Transform – within group standardization, imputation using group values. The size of the data will not change.
- Filter – ignore rows that belong to a particular group
- A combination of these 3

**Combining**the results into a Series or Dataframe

The image below shows a graphical explanation of this process: We split by ‘x’, apply the function ‘mean’ to each group formed and then append the results

The Split-Apply-Combine Strategy

With pandas, we can implement this strategy as

**Split**- A DataFrame can be split up by
*rows(axis=0)*or*columns(axis=1)*into groups. - We use
*pd.groupby()*to create a groupby object

- A DataFrame can be split up by
**Apply**- A function is applied to each group using
*.agg()*or*.apply()*

- A function is applied to each group using
**Combine**- The results of applying functions to groups are put together into an object
- Note: Data types of returned objects are handled gracefully by pandas

We create a groupBy object by calling the *groupby()* function on a data frame, passing a list of column names that we wish to use for grouping. These objects,

- have a
*.size()*method, which returns the count of elements in each group. - can be subsetted using column names (or arrays of column names) to select variables for aggregation
- have optimized methods for general aggregation operations like -
- count, sum
- mean, median, std, var
- first, last
- min, max

- specialized methods like
*.describe()*apply to these objects

By far, the most important GroupBy methods are *.agg(), .transform(), and .apply()*

Syntax:

# Create a groupBy object gb_obj = my_df.groupby(‘col_x’) # Summarize each group gb.obj.my_func()

Example:

# Create a toy dataset with 2 categorical and 2 numeric variables df = DataFrame({'k1': list('abcd' * 25), 'k2': list('xy' * 25 + 'yx' * 25), 'v1': np.random.rand(100), 'v2': np.random.rand(100)}) print df.head(15)

**Grouping by ONE key**

This results in a summarized data frame indexed by levels of the key.

# Since k1 has 4 categories, this will return 4 rows print '\n', df.groupby('k1').mean() # Since k2 has 2 categories, this will return 2 rows print '\n', df.groupby('k2').sum()

**Grouping by TWO keys**

This will result in a summarized data frame with a hierarchical index.

# A dataframe with a hierarchical index formed by a combination of the levels print df.groupby([df['k1'], df['k2']]).sum()

**Column-wise aggregations – optimized statistical methods**

For simple statistical aggregations (of numeric columns of a DataFrame) we can call methods like mean and sum

# Summing a Series df['v1'].groupby(df['k1']).sum() # Summing all Series of a DataFrame print df.groupby('k2').mean()

Or you can pass the name of a function as a string with the .agg() method

# Summing a Series df['v1'].groupby(df['k1']).agg('sum') # Finding the mean of all grouped series of a DataFrame print df.groupby(df.k1).agg('mean').add_prefix('mu_')

**The .agg( ) method**

**Learn Data Science by working on interesting Data Science Projects for just $9**

When we have a groupBy object, we may choose to apply one or more functions to one or more columns, even different functions to individual columns. The *.agg()* method allows us to easily and flexibly specify these details. It takes as arguments the following –

- list of function names to be applied to all selected columns
- tuples of (colname, function) to be applied to all selected columns
- dict of (df.col, function) to be applied to each df.col
- Apply >1 functions to selected column(s) by passing names of functions to agg() as a list

# Apply min, mean, max and max to v1 grouped by k1 df['v1'].groupby(df['k1']).agg(['min', 'mean', 'max']) # Apply min and max to all numeric columns of df grouped by k2 print df[['v1', 'v2']].groupby(df['k2']).agg(['min', 'max']) # Hierarchical index will be created # We can call .stack on the returned object! print '\n', df[['v1', 'v2']].groupby(df['k2']).agg(['min', 'max']).stack()

5.** **We can supply names for the columns in the (new) aggregated DataFrame to the agg() method, in a list of tuples

# Provide names for the aggregated columns df[['v1', 'v2']].groupby(df['k1']).agg([('smallest','min'), ('largest', 'max')])

6. We can supply DataFrame column names and which functions to apply to each, in a dictionary

# Apply max and min to v1; and mean and sum to v2; all grouped by k1 df[['v1', 'v2']].groupby(df['k1']).agg({'v1': ['max', 'min'], 'v2': ['mean', 'sum']})

**The .apply( ) method**

**The .apply( ) method**

This method takes as argument the following:

- a general or
**user defined function** - any other
**parameters**that the function would take

# Retrieve the top N cases from each group def topN(data, col, N): return data.sort(columns=col, ascending=False).loc[:, col].head(5) print df.groupby(df['k2']).apply(topN, col='v1', N=5)

__Combining Multiple Datasets - __*merge()*

__Combining Multiple Datasets -__

*merge()*The** merge()** function in pandas is similar to the SQL join operations, it links rows of tables using one or more keys.

Syntax:

merge(df1, df2, how='left', on='key', left_on=None, right_on=None, left_index=False, right_index=False, sort=True, copy=True, suffixes=('_x', '_y'))

The syntax includes specifications of the following arguments:

**Which column to merge on;**- the
*on='key'*if the same key is in the two DFs, - or
*left_on='lkey'*,*right_on='rkey'*if the keys have different names in the DFs

- the

**Note:** To merge on multiple keys, pass a list of column names

**The nature of the join;**- the how= option, with left, right, outer
- By
**default**, the merge is an__inner join__

- Tuple of string values to append to
**overlapping column names**to identify them in the merged dataset- the suffixes= option
**defaults**to ('_x', '_y')

- If you wish
**to merge on the DataFrame index**,*pass left_index=True or right_index=True or both*.

- Whether to
**Sort the result DataFrame by the join keys**in lexicographical order or not;- The sort= option;
- Defaults to True, setting to False will improve performance substantially in many cases

Examples

# Let's define a few toy datasets to use as examples df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': np.random.randn(7)}) df2 = DataFrame({'key': ['a', 'b', 'd'], 'data2': np.random.randn(3)}) df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data3': np.random.randn(7)}) df4 = DataFrame({'rkey': ['a', 'b', 'd'], 'data4': np.random.randn(3)}) print df1, '\n\n', df2, '\n\n', df3, '\n\n', df4

**Default merge with no parameters**

pd.merge(df1, df2)

We note that:

- It is an inner join by default (output key is the intersection of input keys)
- Merge happens on the column 'key' which is common to both datasets;
- We could've written
*pd.merge(df1, df2, on='key')*to the same effect

pd.merge(df1, df4)

# Yields an error because there are no matching column names to merge on

**Specifying which columns to merge on (if keys have different names in datasets)**

pd.merge(df3, df4, left_on='lkey', right_on='rkey') # still an inner join!

** Specifying which type of join**

pd.merge(df1, df2, how='outer') # the merged dataset will have a union of the keys, imputing NaNs where values aren't found pd.merge(df1, df2, how='left') # value 'c' is absent in df2, so there will be a NaN in column data2

**Specifying Suffixes**

# Add a column with the same name to df1 and df2 df1['colx'] = np.random.randn(7) df2['colx'] = np.random.randn(3) # Specifying suffixes to identify columns with the same name pd.merge(df1, df2, on='key', suffixes=['_l', '_r'])

**Merging on columns and index**

# Set lkey to be the index of df3 df3.set_index(lkey, inplace=True)

- Note: Do this only once. Re-running set_index will produce errors.
- You'll have to reset index before you can set it again.

We specify that

- for the df2 we will use the column 'key' and
- for the df4, we will use its index to merge

pd.merge(df2, df3, how='left', left_on='key', right_index=True)