Forget about using operations such as

and **df["new_col"] =**

to create new columns. Here is why you should be using the **df.new_col =**

method — it returns you a DataFrame object, which allows you to continue your chaining operation to further manipulate your DataFrame. Unlike the **.assign()**

method, the two infamous operation above return you a **.assign()**`None`

which means you cannot possibly chain your operation further.

If you are not convinced, then let me bring back the old nemesis —

. Pretty sure each of us has bumped into this one at some point in time.**SettingWithCopyWarning**

Enough of the warning, I want to unsee ugly red boxes in my notebook!

Using

, let us add a few new columns such as **.assign()**

, **ratio_casual_registered**

, and **avg_temp****ratio_squared**

`(bike`

.assign(ratio_casual_registered = bike.casual.div(bike.registered),

avg_temp = bike.temp.add(bike.atemp).div(2),

ratio_squared = lambda df_: df_.ratio_casual_registered.pow(2))

)

In short, here’s what the method above does:

- We can create as many new columns as we want using the

method, separated by the delimiter comma.**.assign()** - The lambda function when creating the column

serves to get access to the most recent DataFrame after we added the column**ratio_squared**

. Say, we do not use a lambda function to get access to the most recent DataFrame**ratio_casual_registered**

, but instead continue with**df_**

, we would get an error as the original DataFrame does not have the column**bike.ratio_casual_registered.pow(2)**

, even after adding it in the**ratio_casual_registered**

method before creating**.assign()**

. If you can’t wrap your head around this concept to decide whether or not to use lambda function, my suggestion is just use one!**ratio_squared** - Bonus! I leave some not-so-common way to perform arithmetic operations using methods.

Well, the

method is not uncommonly used, but they are necessary to get us started before we delve deeper into the next methods. One thing that often goes unnoticed and left unspoken of is that the the **.groupby()**

method has a lazy nature. By that, it means that the method is lazily evaluated. In other words, it does not evaluate right away, that is why you often see **.groupby()**

right after calling the method **<pandas.core.groupby.generic.DataFrameGroupBy object at 0x14fdc3610>****.groupby()**

From Pandas DataFrame documentation², the value to feed in the parameter

could be a **by**** mapping, function, label, pd.Grouper or list of such. **Nonetheless, the most common one you probably encounter is to group by columns names (list of Series name separated by comma). After the

**.groupby()**

operation, we could perform operation such as **.mean()**

, **.median()**

, or applying custom function using **.apply()**

.The value of the specified columns that we feed into the

`parameters in the`

by`method would become the index of the result. If we specify grouping more than 1 column, then we will obtain a MultiIndex.`

.groupby()

`(bike`

.groupby(['season', 'weathersit'])

.mean(numeric_only=True) #alternative version: apply(lambda df_: df_.mean(numeric_only=True))

.atemp

)

Here, we grouped our DataFrame by the column

, and **season**

. Then, we calculate the mean value and subset only the column **weathersit**

.**atemp**

If you are meticulous enough to dig the Pandas documentation², you might encounter both methods

and **.agg()**

. You might be wondering what is the difference and when to use which? Save your time! They are the same, **.aggregate()**

is merely an alias for **.agg()**

.**.aggregate()**

has a parameter **.agg()**

, which literally takes in a function, string function name, or list of functions. By the way, you can aggregate different functions over the columns as well! Let’s continue our example above!**func**

`#Example 1: Aggregating using more than 1 function`

(bike

.groupby(['season'])

.agg(['mean', 'median'])

.atemp

)#Example 2: Aggregating using different function for different columns

(bike

.groupby(['season'])

.agg(Meann=('temp', 'mean'), Mediann=('atemp', np.median))

)

With

, the result we obtain is of reduced dimensionality as compared to the initial dataset. In simple terms, your data dimension shrinks with lesser number of rows and columns, containing the aggregate information. If what you want is to summarize the grouped data and obtain aggregated values, then **.agg()**

is the solution.**.groupby()**

With

, we also start with the intention of doing aggregation of information. However, instead of creating a summary of information, we want the output to have the same shape as the original DataFrame, without shrinking the size of the original DataFrame.**.transform()**

Those of you who have exposure to database systems like SQL may find the idea behind

similar to that of Window Function. Let’s see how **.transform()**

works on the above example!**.transform()**

`(bike`

.assign(mean_atemp_season = lambda df_: df_

.groupby(['season'])

.atemp

.transform(np.mean, numeric_only=True))

)

As seen above, we created a new column with column name—

where we fill in the column with the aggregate (mean) of the **mean_atemp_season**

column. Thus, whenever **atemp**

is 1, then we have the same value for **season**

. Notice the important observation here is that we retain the original dimension of the dataset plus one additional column!**mean_atemp_season**

Here’s a bonus for those obsessed with Microsoft Excel. You might be tempted to use

to create summary table. Well of course, this method works too! But here’s a two cent, **.pivot_table()**

is more versatile and used for a broader range of operations beyond just reshaping, such as filtering, transformation, or applying group-specific calculations.**.groupby()**

Here’s how to use

in short. You specify the column(s) you want to aggregate in the argument **.pivot_table()**

. Next, specify the **values****index** of the summary table you want to create using a subset of the original DataFrame. This can be more than one column and the summary table will be DataFrame of MultiIndex. Next, specify the **columns** of the summary table you want to create using a subset of the original DataFrame that has not been selected as the index. Last but not least, don’t forget to specify the

! Let’s take a quick look!**aggfunc**

`(bike`

.pivot_table(values=['temp', 'atemp'],

index=['season'],

columns=['workingday'],

aggfunc=np.mean)

)

Roughly speaking, the method

can be viewed as grouping and aggregation specifically for time-series data, where**.resample()**

The index of the DataFrame or Series is a datetime-like object.

This allows you to group and aggregate data based on different time frequencies, such as hourly, daily, weekly, monthly, etc. More generally,

can take in **.resample()****DateOffset, Timedelta or str **as the rule to perform resampling. Let’s apply this to our previous example.

`def tweak_bike(bike: pd.DataFrame) -> pd.DataFrame:`

return (bike

.drop(columns=['instant'])

.assign(dteday=lambda df_: pd.to_datetime(df_.dteday))

.set_index('dteday')

)

bike = tweak_bike(bike)

(bike

.resample('M')

.temp

.mean()

)

In short, what we do above is drop the column

, overwrite the **instant**

column with the **dteday**

column being converted from **dteday**`object`

type to

type, and finally setting this **datetime64[ns]**

column as the index of the DataFrame.**datetime64[ns]**

`(bike`

.resample('M')

.temp

.mean()

)

Here, we obtain a descriptive statistics summary (mean) of the feature

with monthy frequency. Try and play with the **temp**

method using differency frequency such as **.resample()**

, **Q**

, **2M**

and so on,**A**

We are nearing the end! Let me show you why

is both powerful and useful. But before that, let’s get back to one of the example above where we want to find the mean temperature across different season and weather situation by using **.unstack()**

and **.groupby()****.agg()**

`(bike`

.groupby(['season', 'weathersit'])

.agg('mean')

.temp

)

Now, let’s visualise this using a line chart produced minimally by chaining the methods

and **.plot**

to the code above. Behind the scene, Pandas leverages on Matplotlib plotting backend to do the plotting task. This gives the following result, which none of us wanted since the x-axis of the plot is grouped by the MultiIndex, making it more difficult to interpret and less meaningful.**.line()**

Compared the plot above and below after we introduce the

method.**.unstack()**

`(bike`

.groupby(['season', 'weathersit'])

.agg('mean')

.temp

.unstack()

.plot

.line()

)

In short, what the method

does is to unstack the inner most index of the MultiIndex DataFrame, which in this case, is **.unstack()**

. This so-called un-stacked index becomes the columns of the new DataFrame, which allows our plotting of line plot to give more meaningful outcome for comparison purposes.**weathersit**

You can also unstack the outer-most index instead of the inner-most index of the DataFrame, by specifying the argument

as part of the **level=0**

method. Let’s see how we can achieve this.**.unstack()**

`(bike`

.groupby(['season', 'weathersit'])

.agg('mean')

.temp

.unstack(level=0)

.plot

.line()

)

From my observation, you almost never see common folks implement this method in their Pandas code when you search online. For one reason,

somehow has its own mysterious unexplainable aura that makes it not friendly to beginners and intermediates-alike. When you go to Pandas documentation², the short explanation you will find is **.pipe()**** “Apply chainable functions that expect Series or DataFrames”**. I think this explanation is a little confusing and not really helpful, provided if you have never work with chaining before.

In short, what

offers you is the ability to continue your method chaining technique using a function, in the event where you can’t manage to find a straightforward solution to perform an operation to return a DataFrame.**.pipe()**

The method

takes in a function, by that, you can define a method outside the chain and then refer to the method as an argument to the **.pipe()**

method.**.pipe()**

With

`.pipe()`

, you can pass a DataFrame or Series as the first argument to a custom function, and the function will be applied to the object being passed, followed by any additional arguments specified afterwards.

Most of the time, you will see a one-liner lambda function inside the

method for the purpose of convenience (i.e. get access to the most recent DataFrame after some modification steps in the chaining process).**.pipe()**

Let me illustrate using a simplified example. Let’s say we want to get insights on the following question: *“For the year 2012, what is the proportion of working day per season, relative to the total working day of that year?”*

`(bike`

.loc[bike.index.year == 2012]

.groupby(['season'])

.workingday

.agg(sum)

.pipe(lambda x: x.div(x.sum()))

)

Here, we use

to inject function into our chaining method. Since after performing **.pipe()**

, we cannot just continue chaining with **.agg(sum)**

, the following code will not work since we lost access to the latest state of the DataFrame after some modification through the chaining process.**.div()**

`#Doesn't work out well!`

(bike

.loc[bike.index.year == 2012]

.groupby(['season'])

.workingday

.agg(sum)

.div(...)

)

Tips: If you can’t find a way to continue chaining your methods, try think of how

`.pipe()`

can help! Most of the time, it will!