Data Aggregation in Python with Pandas: Analysing Geological Lithology Data | by Andy McDonald | Jun, 2023


Exploring lithology variations within the Zechstein Group of the Norwegian Continental Shelf

Image generated by the author using Midjourney (Paid subscription plan).

Using data aggregation techniques can help us transform an overwhelming and almost incomprehensible numeric dataset into something that is easily digestible and much more reader-friendly. The process of data aggregation involves summarising multiple data points into single metrics that can be used to provide a high-level overview of the data.

One way we can apply this process within petrophysics and geoscience is to summarise the lithological composition of geological formations that have been interpreted from well log measurements.

Within this short tutorial, we will see how we can take a large dataset consisting of 90 plus wells from the Norwegian Continental Shelf, and extract the lithology composition of the Zechstein Group.

To begin, we first need to import the pandas library, which will be used to load our datafile from CSV and carry out the aggregations.

import pandas as pd

Once the pandas library has been imported, we can then read the CSV file using pd.read_csv().

The data we are going to be using is from the combined XEEK and Force 2020 Machine Learning competition which aimed to predict lithology from well log measurements. The dataset we are using represents all of the available training data. Further details of this dataset can be found at the end of the article.

As this CSV file contains data that is separated by a semi-colon rather than by a comma, we need to pass in a colon to the sep parameter.

df = pd.read_csv('data/train.csv', sep=';')

We can then run this code to begin the loading process. As we have a large dataset (11 million + rows), this may take several seconds. But, once it has finished loading, we can view our dataframe by calling upon the df object. This will return our dataframe and show the first five and last five rows from it.

Dataframe containing well log data from the Norwegian Continental Shelf. Image by the author.

Within this dataset, the lithology data is stored within the FORCE_2020_LITHOFACIES_LITHOLOGY column. However, when we look closely at our data, we will see that the lithology values are encoded numerically. Unless you know the key, it will be hard to decipher what number represents what lithology.

Luckily for this dataset, we have the key and can create a dictionary with the key and lithology pairs.

lithology_numbers = {30000: 'Sandstone',
65030: 'Sandstone/Shale',
65000: 'Shale',
80000: 'Marl',
74000: 'Dolomite',
70000: 'Limestone',
70032: 'Chalk',
88000: 'Halite',
86000: 'Anhydrite',
99000: 'Tuff',
90000: 'Coal',
93000: 'Basement'}

To apply this to our dataset, we can use the pandas map() function, which will perform a lookup using our dictionary and then assign the correct lithology label to the numeric value.

df['LITH'] = df['FORCE_2020_LITHOFACIES_LITHOLOGY'].map(lithology_numbers)

Once this has run, we can view the dataframe again to make sure the mapping has been successful and a new LITH column has been added to the end of the dataframe.

Dataframe containing well log data from the Norwegian Continental Shelf. Image by the author.

As we have a rather large dataset with 11,705,511 rows, it would be good to focus on a particular geological group for our lithology compositional analysis.

In this case, we will subset the data and look at the Zechstein Group.

We can do this by using the query() method and passing in a simple string: GROUP == "ZECHSTEIN GP."

df_zechstein = df.query('GROUP == "ZECHSTEIN GP."')
df_zechstein.WELL.unique()

We can check how many wells we have in the subset by calling upon df_zechstein.WELL.unique(), which returns the following array containing 8 wells.

array(['15/9-13', '16/1-2', '16/10-1', '16/11-1 ST3', '16/2-16', '16/2-6',
'16/4-1', '17/11-1'], dtype=object)

As we are only interested in the lithologies, we can simply extract the Well name and the lithology columns. This will also make it easier to carry out the aggregation.

df_zechstein_liths = df_zechstein[['WELL', 'LITH']]
Well names and associated lithologies. Image by the author.

Now that we have the data in a format that can be worked with we can begin the aggregation process. For this, we are going to chain multiple pandas methods together in a single.

First, we will group the data by the WELL column using the groupby function. This essentially makes subsets of the dataframe for each unique well name within the WELL column

Next, we will count the occurrences of each lithology type within each group. The normalize=True part means it will give a proportion (between 0 and 1) rather than absolute counts. For example, if in one well (group), ‘Sandstone’ occurs 5 times and ‘Shale’ occurs 15 times, the function will return 0.25 for ‘Sandstone’ and 0.75 for ‘Shale’ instead of 5 and 15.

Finally, we need to rearrange our resulting dataframe so that the row index contains the well names and the columns contain the lithology names. If a well did not have any instances of a certain lithology then they are filled with zero, due to fill_value=0.

summary_df = df_zechstein_liths.groupby('WELL').value_counts(normalize=True).unstack(fill_value=0)
summary_df

What we get back is the following dataframe with the decimal proportions of each lithology within each of the wells.

Lithology composition of the Zechstein Group in decimal form. Image by the author.

If we want to view these as percentages, we can change how they are displayed using the following code:

summary_df.style.format('{:.2%}')

When we run the code, we get back the following dataframe which provides a more readable table and can be incorporated into a report.

Lithology composition of the Zechstein Group after aggregation and changing the numbers to display as percentages. Image by the author.

Applying this styling does not change the actual values. They will still be stored as their decimal equivalents.

If we do want to change the values permanently to percentages, we can do so by multiplying the dataframe by 100.

summary_df = summary_df * 100
summary_df
Lithology composition of the Zechstein Group after aggregation and converting to percentages. Image by the author.

Once the data is in this format, we can use it to create something similar to the infographic below, which shows the percentages for the lithologies for each of the wells.

Infographic created using matplotlib showing the percentage variation for each lithology within each well. Created by Andy McDonald
An infographic showing the percentage variation for each lithology within each well. Created with matplotlib by the author.

Within this short tutorial, we have seen how we can take a large collection of well log data (from 90 plus wells) and extract and summarise a particular geological group. This allows us to understand the lithological composition of the geological group in an easy-to-read and understandable format that can be incorporated into a report or presentation.



Source link

Leave a Comment