The Underrated Gems Pt.1: 8 Pandas Methods That Will Make You a Pro | by Andreas Lukita | Jul, 2023


Forget about using operations such as df["new_col"] = and df.new_col = to create new columns. Here is why you should be using the .assign() method — it returns you a DataFrame object, which allows you to continue your chaining operation to further manipulate your DataFrame. Unlike the .assign() method, the two infamous operation above return you a None which means you cannot possibly chain your operation further.

If you are not convinced, then let me bring back the old nemesis — SettingWithCopyWarning. Pretty sure each of us has bumped into this one at some point in time.

Image by Author

Enough of the warning, I want to unsee ugly red boxes in my notebook!

Using .assign(), let us add a few new columns such as ratio_casual_registered, avg_temp, and ratio_squared

(bike
.assign(ratio_casual_registered = bike.casual.div(bike.registered),
avg_temp = bike.temp.add(bike.atemp).div(2),
ratio_squared = lambda df_: df_.ratio_casual_registered.pow(2))
)

In short, here’s what the method above does:

  1. We can create as many new columns as we want using the .assign() method, separated by the delimiter comma.
  2. The lambda function when creating the column ratio_squared serves to get access to the most recent DataFrame after we added the column ratio_casual_registered. Say, we do not use a lambda function to get access to the most recent DataFrame df_, but instead continue with bike.ratio_casual_registered.pow(2), we would get an error as the original DataFrame does not have the column ratio_casual_registered, even after adding it in the .assign() method before creating ratio_squared. If you can’t wrap your head around this concept to decide whether or not to use lambda function, my suggestion is just use one!
  3. Bonus! I leave some not-so-common way to perform arithmetic operations using methods.

Well, the .groupby() method is not uncommonly used, but they are necessary to get us started before we delve deeper into the next methods. One thing that often goes unnoticed and left unspoken of is that the the .groupby() method has a lazy nature. By that, it means that the method is lazily evaluated. In other words, it does not evaluate right away, that is why you often see <pandas.core.groupby.generic.DataFrameGroupBy object at 0x14fdc3610> right after calling the method .groupby()

From Pandas DataFrame documentation², the value to feed in the parameter by could be a mapping, function, label, pd.Grouper or list of such. Nonetheless, the most common one you probably encounter is to group by columns names (list of Series name separated by comma). After the .groupby() operation, we could perform operation such as .mean(), .median(), or applying custom function using .apply().

The value of the specified columns that we feed into the by parameters in the .groupby() method would become the index of the result. If we specify grouping more than 1 column, then we will obtain a MultiIndex.

(bike
.groupby(['season', 'weathersit'])
.mean(numeric_only=True) #alternative version: apply(lambda df_: df_.mean(numeric_only=True))
.atemp
)

Here, we grouped our DataFrame by the column season, and weathersit. Then, we calculate the mean value and subset only the column atemp.

Image by Author

If you are meticulous enough to dig the Pandas documentation², you might encounter both methods .agg() and .aggregate(). You might be wondering what is the difference and when to use which? Save your time! They are the same, .agg() is merely an alias for .aggregate().

.agg() has a parameter func, which literally takes in a function, string function name, or list of functions. By the way, you can aggregate different functions over the columns as well! Let’s continue our example above!

#Example 1: Aggregating using more than 1 function
(bike
.groupby(['season'])
.agg(['mean', 'median'])
.atemp
)

#Example 2: Aggregating using different function for different columns
(bike
.groupby(['season'])
.agg(Meann=('temp', 'mean'), Mediann=('atemp', np.median))
)

Image by Author

With .agg(), the result we obtain is of reduced dimensionality as compared to the initial dataset. In simple terms, your data dimension shrinks with lesser number of rows and columns, containing the aggregate information. If what you want is to summarize the grouped data and obtain aggregated values, then .groupby() is the solution.

With .transform(), we also start with the intention of doing aggregation of information. However, instead of creating a summary of information, we want the output to have the same shape as the original DataFrame, without shrinking the size of the original DataFrame.

Those of you who have exposure to database systems like SQL may find the idea behind .transform() similar to that of Window Function. Let’s see how .transform() works on the above example!

(bike
.assign(mean_atemp_season = lambda df_: df_
.groupby(['season'])
.atemp
.transform(np.mean, numeric_only=True))
)
Image by Author

As seen above, we created a new column with column name—mean_atemp_season where we fill in the column with the aggregate (mean) of the atemp column. Thus, whenever season is 1, then we have the same value for mean_atemp_season. Notice the important observation here is that we retain the original dimension of the dataset plus one additional column!

Here’s a bonus for those obsessed with Microsoft Excel. You might be tempted to use .pivot_table() to create summary table. Well of course, this method works too! But here’s a two cent, .groupby() is more versatile and used for a broader range of operations beyond just reshaping, such as filtering, transformation, or applying group-specific calculations.

Here’s how to use .pivot_table() in short. You specify the column(s) you want to aggregate in the argument values. Next, specify the index of the summary table you want to create using a subset of the original DataFrame. This can be more than one column and the summary table will be DataFrame of MultiIndex. Next, specify the columns of the summary table you want to create using a subset of the original DataFrame that has not been selected as the index. Last but not least, don’t forget to specify the aggfunc! Let’s take a quick look!

(bike
.pivot_table(values=['temp', 'atemp'],
index=['season'],
columns=['workingday'],
aggfunc=np.mean)
)
Image by Author

Roughly speaking, the method .resample() can be viewed as grouping and aggregation specifically for time-series data, where

The index of the DataFrame or Series is a datetime-like object.

This allows you to group and aggregate data based on different time frequencies, such as hourly, daily, weekly, monthly, etc. More generally, .resample() can take in DateOffset, Timedelta or str as the rule to perform resampling. Let’s apply this to our previous example.

def tweak_bike(bike: pd.DataFrame) -> pd.DataFrame:
return (bike
.drop(columns=['instant'])
.assign(dteday=lambda df_: pd.to_datetime(df_.dteday))
.set_index('dteday')
)
bike = tweak_bike(bike)
(bike
.resample('M')
.temp
.mean()
)

In short, what we do above is drop the column instant, overwrite the dteday column with the dteday column being converted from object type to datetime64[ns] type, and finally setting this datetime64[ns] column as the index of the DataFrame.

Image by Author
(bike
.resample('M')
.temp
.mean()
)
Image by Author

Here, we obtain a descriptive statistics summary (mean) of the feature temp with monthy frequency. Try and play with the .resample() method using differency frequency such as Q, 2M, A and so on,

We are nearing the end! Let me show you why .unstack() is both powerful and useful. But before that, let’s get back to one of the example above where we want to find the mean temperature across different season and weather situation by using .groupby() and .agg()

(bike
.groupby(['season', 'weathersit'])
.agg('mean')
.temp
)
Image by Author

Now, let’s visualise this using a line chart produced minimally by chaining the methods .plot and .line() to the code above. Behind the scene, Pandas leverages on Matplotlib plotting backend to do the plotting task. This gives the following result, which none of us wanted since the x-axis of the plot is grouped by the MultiIndex, making it more difficult to interpret and less meaningful.

Image by Author

Compared the plot above and below after we introduce the .unstack() method.

(bike
.groupby(['season', 'weathersit'])
.agg('mean')
.temp
.unstack()
.plot
.line()
)
Image by Author

In short, what the method .unstack() does is to unstack the inner most index of the MultiIndex DataFrame, which in this case, is weathersit. This so-called un-stacked index becomes the columns of the new DataFrame, which allows our plotting of line plot to give more meaningful outcome for comparison purposes.

Image by Author

You can also unstack the outer-most index instead of the inner-most index of the DataFrame, by specifying the argument level=0 as part of the .unstack() method. Let’s see how we can achieve this.

(bike
.groupby(['season', 'weathersit'])
.agg('mean')
.temp
.unstack(level=0)
.plot
.line()
)
Image by Author
Image by Author

From my observation, you almost never see common folks implement this method in their Pandas code when you search online. For one reason, .pipe() somehow has its own mysterious unexplainable aura that makes it not friendly to beginners and intermediates-alike. When you go to Pandas documentation², the short explanation you will find is “Apply chainable functions that expect Series or DataFrames”. I think this explanation is a little confusing and not really helpful, provided if you have never work with chaining before.

In short, what .pipe() offers you is the ability to continue your method chaining technique using a function, in the event where you can’t manage to find a straightforward solution to perform an operation to return a DataFrame.

The method .pipe() takes in a function, by that, you can define a method outside the chain and then refer to the method as an argument to the .pipe() method.

With .pipe(), you can pass a DataFrame or Series as the first argument to a custom function, and the function will be applied to the object being passed, followed by any additional arguments specified afterwards.

Most of the time, you will see a one-liner lambda function inside the .pipe() method for the purpose of convenience (i.e. get access to the most recent DataFrame after some modification steps in the chaining process).

Let me illustrate using a simplified example. Let’s say we want to get insights on the following question: “For the year 2012, what is the proportion of working day per season, relative to the total working day of that year?”

(bike
.loc[bike.index.year == 2012]
.groupby(['season'])
.workingday
.agg(sum)
.pipe(lambda x: x.div(x.sum()))
)

Here, we use .pipe() to inject function into our chaining method. Since after performing .agg(sum), we cannot just continue chaining with .div(), the following code will not work since we lost access to the latest state of the DataFrame after some modification through the chaining process.

#Doesn't work out well!
(bike
.loc[bike.index.year == 2012]
.groupby(['season'])
.workingday
.agg(sum)
.div(...)
)

Tips: If you can’t find a way to continue chaining your methods, try think of how .pipe() can help! Most of the time, it will!



Source link

Leave a Comment