Rust Polars: Unlocking High-Performance Data Analysis — Part 2 | by Mahmoud Harmouch | May, 2023


Drop

Removing unnecessary information from a data frame or series is essential in data analysis. Luckily, Polars offers multiple ways to do so effectively. One such method involves using the drop function, which allows you to eliminate specific rows or columns.

To use this method, specify the name/label of your target column as an argument for the drop method. It’s worth noting that by default, this function returns a new DataFrame object with only specified rows removed — leaving the original DataFrame intact. This can be especially helpful for beginners who may expect their initial dataset to change permanently after running specific functions. For instance, let’s take an example of a Fruit and Color-based DataFrame object where the column “Color” is no longer needed in further analyses:

let df: DataFrame = df!("Fruit" => &["Apple", "Apple", "Pear"],
"Color" => &["Red", "Yellow", "Green"]).unwrap();

We can use the drop function to remove the column with the label “Color” from this DataFrame:

let df_remain = df.drop("Color").unwrap(); 
println!("{}", df_remain);

// Output:

// shape: (3, 1)
// ┌───────┐
// │ Fruit │
// │ --- │
// │ str │
// ╞═══════╡
// │ Apple │
// │ Apple │
// │ Pear │
// └───────┘

A new DataFrame object, df_remain, now holds identical data to the original except for the “Color” column. Upon inspection of the initial data frame, we can confirm its information remains unaltered.

println!("{}", df); // the original DataFrame

// Output:

// shape: (3, 2)
// ┌───────┬────────┐
// │ Fruit ┆ Color │
// │ --- ┆ --- │
// │ str ┆ str │
// ╞═══════╪════════╡
// │ Apple ┆ Red │
// │ Apple ┆ Yellow │
// │ Pear ┆ Green │
// └───────┴────────┘

If you wish to make changes directly to the original DataFrame, consider using the drop_in_place function instead of drop. This method operates similarly to drop, but it alters the data frame without generating a new object.

let mut df: DataFrame = df!("Fruit" => &["Apple", "Apple", "Pear"],
"Color" => &["Red", "Yellow", "Green"]).unwrap();
df.drop_in_place("Color"); // remove the row with index 1 ("Color") from df
println!("{:?}", df);

// Output:

// shape: (3, 1)
// ┌───────┐
// │ Fruit │
// │ --- │
// │ str │
// ╞═══════╡
// │ Apple │
// │ Apple │
// │ Pear │
// └───────┘

In addition, you can also remove multiple columns by specifying their names as the arguments for the drop_many function:

let df_dropped_col = df.drop_many(&["Color", ""]);
println!("{:?}", df_dropped_col);

// Output:

// shape: (3, 1)
// ┌───────┐
// │ Fruit │
// │ --- │
// │ str │
// ╞═══════╡
// │ Apple │
// │ Apple │
// │ Pear │
// └───────┘

Finally, we can use the drop_nulls function to remove any rows that contain null or missing values:

let df: DataFrame = df!("Fruit" => &["Apple", "Apple", "Pear"],
"Color" => &[Some("Red"), None, None]).unwrap();
let df_clean = df.drop_nulls::<String>(None).unwrap();
println!("{:?}", df_clean);

// Output:

// shape: (1, 2)
// ┌───────┬───────┐
// │ Fruit ┆ Color │
// │ --- ┆ --- │
// │ str ┆ str │
// ╞═══════╪═══════╡
// │ Apple ┆ Red │
// └───────┴───────┘

By utilizing the is_not_nullmethod, we can create a non-null mask for any column in our DataFrame. This method returns a boolean mask that distinguishes between values containing null and those without. Once applied to a specific column, this creates a filter where each value corresponds with its respective row’s status as either null or not null. By using this effective technique to extract only rows meeting certain criteria, we can remove all instances of missing data from our new DataFrame with ease. For example, to create a null mask for the “Salary” column of a DataFrame, we can use the following code:

let df = df!("Name" => &["Mahmoud", "Ali", "ThePrimeagen"],
"Age" => &[22, 25, 36],
"Gender" => &["M", "M", "M"],
"Salary" => &[Some(50000), Some(60000), None]).unwrap();
let mask = df.column("Salary").expect("Salary must exist!").is_not_null();
println!("{:?}", mask.head(None));

// Output:

// shape: (3,)
// ChunkedArray: 'Age' [bool]
// [
// true
// true
// false
// ]

The code snippet creates an empty mask for the “Salary” column in the df DataFrame object. It also displays some initial values of the Boolean mask produced as output. This filter can be applied to extract data only from rows where there are non-null entries within the “Salary” column.

let filtered_data = df.filter(&mask).unwrap();
println!("{:?}", filtered_data);

// Output:

// shape: (2, 4)
// ┌─────────┬─────┬────────┬────────┐
// │ Name ┆ Age ┆ Gender ┆ Salary │
// │ --- ┆ --- ┆ --- ┆ --- │
// │ str ┆ i32 ┆ str ┆ i32 │
// ╞═════════╪═════╪════════╪════════╡
// │ Mahmoud ┆ 22 ┆ M ┆ 50000 │
// │ Ali ┆ 25 ┆ M ┆ 60000 │
// └─────────┴─────┴────────┴────────┘

Utilizing a null mask grants greater precision in managing the filtering process. This method proves beneficial when we aim to filter based on various conditions or combinations of null values across distinct columns. Nevertheless, it entails writing more code than simply using the drop_nulls function and may not be as efficient for large datasets.

To summarize this section, dropping rows is a common and handy operation for removing rows or columns from a data frame in Polars. There are several options for modifying the original data or dropping null values.



Source link

Leave a Comment