How To Use the loc Pandas Method to Efficiently To Work With Your DataFrame | by Byron Dolon | Jun, 2023


Basically, the loc method in Pandas allows you to select a subset of rows or columns of the target DataFrame based on a given condition.

There are a few different inputs you can pass to loc. For example, when you want to select a slice of the DataFrame based on its index you can use the same syntax in Python when you’re working with a list like: [start:stop]. However, in this piece, we’ll primarily focus on using loc with a conditional statement. If you’ve used SQL before, this is similar to writing the WHERE part of a query to filter your data.

In general, using loc in this fashion will look like this:

df.loc[df["column"] == "condition"]

This will return a subset of your data in which the column is equal to the condition.

Next, let’s move on to some practical examples of using the loc method during exploratory data analysis to see what else you can do with it.

Which universities only offer in-person attendance?

First, let’s see how we can use loc to select part of your data to use in further analysis.

If the data was already clean, you would think that to answer the question, you can just use a groupby on the column to count the number of institutions that offer in-person attendance. Doing this in Pandas would look like this:

df.groupby("Distance / In-Person")["Institution"].count()

Unfortunately, the values for the “Distance / In-Person” column aren’t very clean. There are some issues with the whitespace and some institutions offer both distance and in-person attendance, although the way that’s recorded isn’t standardized.

The first thing we can do to clean this up also is to rename the column so it doesn’t have any spaces or special characters.

df = df.rename(columns={"Distance / In-Person": "distance_or_in_person"})

Next, we can verify that the change took place by selecting all the columns in the DataFrame.

df.columns

Now, all columns at least don’t have any spaces or special characters. You could further standardize this if you wanted to by changing all the other columns to lowercase, but we’ll skip that for now.

Earlier, we did a group by operation on the target column and counted the values for each institution. Another way to arrive at the same result is to use the value_counts method in Pandas. This returns a Series with the count of unique values of the target column you call it on.

df["distance_or_in_person"].value_counts()

You’ll notice in this case we didn’t have to call the “Institutions” column this time, but that’s because, in our original DataFrame, each row represents one institution.

Now, to clean up this column so that the values for institutions that offer both in-person and distance attendance are grouped into one value, we can make use of the loc column to filter the DataFrame on those values and assign the value of the distance_or_in_person column to a new value “Both”.

df.loc[
~df["distance_or_in_person"].isin(["In-Person", "Distance"]),
"distance_or_in_person"
] = "Both"

Here, we filter the current distance_or_in_person column that is not equal to “In-Person” or “Distance” using the ~ operator and then select the distance_or_in_person column. We then set it equal to “Both”, which updates the original DataFrame. We can verify the changes by checking the DataFrame again:

df.head()

Now, you’ll see the updated column will only contain three possible values, and we can again call value_counts to get the answer to our original question:

df["distance_or_in_person"].value_counts()

We now know that based on the cleaned data, 59 universities offer only in-person attendance.

With this new condition, if you wanted to know which specific institutions offered in-person attendance, we can again filter the DataFrame using the loc method and then use the tolist method to get all the values to a Python list:

df.loc[df["distance_or_in_person"] == "In-Person"]["Institution"].tolist()

We have a list of institutions now, but there are some special characters that we can remove. The “xa0” in Python represents a non-breaking space, which means we can get rid of it using the strip method in Pandas which gets rid of whitespace on either end of the string value.

We can edit our initial tolist code to clean up the final output like this:

df.loc[df["distance_or_in_person"] == "In-Person"]["Institution"].str.strip().tolist()

Now, we have a final list of universities that only offer in-person attendance!

What is the year range between the oldest and newest founded universities?

Next, let’s use the loc and a few other native Pandas methods to filter our DataFrame to answer a specific data analysis question.

We can first just take a look at that Founded column to see what we’re working with:

df["Founded"]

It looks like we have a column that’s just full of year values. Since we want to compare dates to each other, we could turn the column into a datetime type to make the analysis easier.

pd.to_datetime(df["Founded"])

However, using the to_datetime method on the column gives us a ParserError.

It looks like there’s a string that doesn’t match what we initially saw from the Founded column. We can check the row by using the loc method to filter the DataFrame on the value for founded that specifically equals to what we saw in the ParserError:

df.loc[df["Founded"] == "1948 and 2014"]

There’s one university apparently that has two different founded years. Also, now that we know the index of row (9), there’s also an example of using the loc method to filter the DataFrame on the index value specifically:

df.loc[9]

It looks like this is the only row in the DataFrame where the value for the “Founded” column has more than one year.

Depending on what you want to do with the data, you could try to clean the data by maybe picking one year (the first founded date) or maybe by creating two rows for this one institution so both founded dates or on separate rows.

In this case, since we only are working with this data to answer a simple question (what’s the range of the Founded date for the institutions in this data set), we can just remove this one row like this:

df.drop(9).head(10)  # removing the row

Checking the resulting DataFrame you can see the row with the index “9” that had multiple values for the Founded column is no longer in the table. You can get the drop to stick by reassigning the DataFrame after dropping the row:

df = df.drop(9)

Next, we can do the to_datetime method again on the Founded column and see what happens.

pd.to_datetime(df["Founded"], errors="coerce")

There’s actually another error that pops up here which is why I included errors="coerce" to ensure that if there were any other issues converting the string to a datetime type the value would just become null.

Finally, we can assign the datetime type version of the Founded column to a new column. Then, to check the earliest founded date of an institution, we can use the min method in Python:

df["founded_date"] = pd.to_datetime(df["Founded"], errors="coerce")
min(df["founded_date"])

You can see the earliest and latest timestamps of the founded_date column by using the min and max methods to get the year range between the oldest and newest universities.

It was here that I realized we can do this all much quicker if all we need to do is answer that one quick question. Instead of converting to a datetime type, we could just turn the column into an integer type and then subtract the max and min values from each other to get the range.

df["Founded"] = df["Founded"].astype("int")
max(df["Founded"]) - min(df["Founded"])

This outputs 719.

You shouldn’t always take the easy way and just convert the year column to an integer. In the case when you want to do some more complicated analysis or when you’re specifically working with time series date, there’s a lot of value-added if you clean your data properly and get the date column to a datetime type. However, if you do just need to do analysis quickly, it can save you the time and headache of looking up errors to simply find the quickest way of solving a problem instead of the “best” way to solve it.



Source link

Leave a Comment