Can data science find Bigfoot?


An analytical approach to tracking down one of the world’s most elusive cryptids

Photo by Jon Sailer on Unsplash

Watching Bigfoot researchers chasing — sometimes quite literally — circumstantial evidence, noises, and half-glimpsed shadows through the forest at night certainly makes for some great TV.

I love killing time with a bit of TV, and a show called “Expedition Bigfoot” has kept me entertained on many a lunch break. However, this show isn’t the run-of the-mill cryptid documentary as it claims to use a “data algorithm” to first identify a target search zone before sending in human researchers to track down the hairy subject.

Obviously my ears pick up when I hear something like this, and short of packing an overnight bag, I start to wonder what I can do to help the search. So here we go, “helping”!

While this is going to be quite a lighthearted article, I’m going to take a look at some data, covering:

  1. Exploratory data analysis (EDA) with Pandas Profiling
  2. Exploratory data analysis the old-fashioned way
  3. Clustering with OPTICS

The data comes from Timothy Renner¹, who in turn sourced the observations from the Bigfoot Field Researchers Organization².

A quick peek at the data:

Image by author

There are loads of interesting fields, for example:

  • Location features such as state, county, and latitude and longitude of the sighting.
  • Weather features containing not only temperature, cloud cover, and precipitation but temperature ranges, precipitation details, and wind speed.
  • Esoteric information like phase of the lunar cycle.
  • A full text description of the events leading up to the sighting, and details of the sighting itself.

Pandas-Profiling is a great tool to explore data with. With just a few lines of code you can get a really informative interactive report which summarises important features of your data. Let’s see it in action.

# create the report
from pandas_profiling import ProfileReport
profile = ProfileReport(df)
profile.to_file(os.path.join(folder,'bfro_reports_eda.html'))

We can take a look at the classification feature in more depth:

Image by author

Quite a detailed breakdown of what will turn out to be an important feature — more on that later.

We can also visualise interactions — this visual explaining the relationship between wind speed and air pressure:

Image by author

Pandas-Profiling also includes functionality to help us understand our data through various measures of correlation:

Image by author

Of course, no EDA would be complete without a summary of missingness. Here we have a few great exhibits — first up, a chart of missingness by column.

Image by author

… as well as correlation between missing values:

Image by author

So much great info, with such little effort.

Sometimes we need to get our hands dirty to understand our data a bit more. Let’s do that, and discuss as we go along.

A changing trend

I’m interested to see how Bigfoot sightings trend over time. Before we do, let’s have a quick look at the types of reports contained in the database.

Summarising the BFRO website³, Class A reports involve clear sightings in circumstances where misinterpretation or misidentification of other animals can be ruled out with greater confidence.

Sightings where a possible Bigfoot was observed in any circumstance which did not afford a clear view of the subject are considered to be Class B sightings. These include sightings at a large distance and sightings in poor lighting conditions.

Class C reports are those with the high potential for inaccuracy. Most indirect reports, or stories with an untraceable source fall into this class.

Importantly, Class B reports are not considered less credible or less important than Class A reports — both types are deemed credible enough by the BFRO to show to the public. However, Class C reports are kept in BFRO archives but are very rarely listed publicly; there are exceptions for published or locally documented incidents from before 1958 and sightings mentioned in non-tabloid newspapers or magazines.

Let’s take a look at annual sightings over time, and separate out Class A and Class B sightings.

Image by author

A strong downward trend over time, with the mix of Class A and Class B sightings remaining relatively consistent. The drivers of the general downward trend could be difficult to identify:

  1. Changing societal norms may stop witnesses reporting potential sightings — in this day and age, how many people believe Bigfoot is out there?
  2. In a similar vein, how many people know about the Bigfoot Field Research Office and its database? Of those people, how many would spend time completing a submission? I’d wager not too many, and a number which would perhaps decrease over time.
  3. There may be pragmatic reasons for seeing fewer and fewer reports in the database. A big driver of this is the ease with which a report can be made: in a digital world optimised for ease, how long would you spend filling in a detailed form with accurate information?

Back to the data. It’s worth remembering that we’re talking about absolute numbers here. Ideally we’d factor in some sort of measure of exposure to give us better insight: a feature which gives an indication of the likelihood of sighting a Bigfoot. For example, expressing the number of annual sightings as a fraction of the number of annual excursions into forested areas would be a more interesting measure to investigate.

As it stands, the absolute number of sightings is a direct function of how often people venture out and could be unreliable: seeing more Bigfoot out and about may be due to more people being out and about rather than any changes in Bigfoot population or people’s ability to spot the cryptid.

Seasonal mix

In time series data like this, we can expect some variation at regular intervals; this is known as seasonality⁴.

Any seasonal behaviour of Bigfoot will naturally create a seasonal impact in the data. For instance, winter hibernation will cause fewer winter observations. Seasonal effects like snow storms, rain, and the amount of daylight hours will affect our ability to actually see Bigfoot. The biggest driver of seasonality of observation is likely the impact the seasons have on potential observers. Sightings will probably be correlated with summer camping trips, seasonal hunting periods, and romantic autumnal walks in the woods.

Now we probably don’t have enough data to visualise monthly seasonality over time without noise obscuring any effect, so we’ll look at the overall seasonal distribution of sightings.

Image by author

We see that most sightings occur in summer. We should expect this as good weather and loads of daylight allow people venture further and wider into the great outdoors. As a fan of a warm blanket and a hot cup of tea, it comes as no surprise to me that the fewest sightings occur in the coldest and darkest months. Perhaps Bigfoot is of the same opinion when it comes to tea and blankets.

We see this seasonal trend is consistent over time:

Image by author

Unfortunately, we don’t have much data around the time of the observation so can’t dig into that. I would imagine that things start to get a bit wild after the sun sets — that’s what happens on TV anyway.

So far our EDA has helped us understand that the number of Bigfoot sightings reported to the BFRO has declined over time, and that the vast majority of sightings occur in summer and autumn. These are insights are largely related to the time element of each report — let’s move on to looking at location.

Photo by Tabea Schimpf on Unsplash

Let’s use the latitude and longitude co-ordinate pairs to understand where the Bigfoot action is. Folium⁵ is an excellent Python package for creating map visualisations.

Interactive visualisation with Folium

First up, a heat map, where areas of high activity are highlighted in reds and yellows:

import folium
from folium.plugins import HeatMap,MiniMap

# data for plotting
filter_map = (df['latitude'].notna()) & (df['longitude'].notna())
df_map = df.loc[filter_map,['latitude','longitude','observed']].copy()

# create the map
heat_map = folium.Map(
location = [42,-97.37],
tiles = 'OpenStreetMap',
zoom_start = 4,
control_scale = True
)

# add the heat to the map
HeatMap(
data = df_map[['latitude','longitude','observed']],
min_opacity = 0.1
).add_to(heat_map)

# add mini map
heat_map.add_child(MiniMap(position = 'bottomleft',toggle_display = True))

Image by author

Like all good science fiction films, the action seems to be limited to the US (the lower 48 states mainly).

Washington state and Oregon look to be absolute hotbeds of activity; with Ohio and Florida representing the most active eastern areas. There seems to be very little activity in the middle of the country.

My geography isn’t great, but I do know that Washington and parts of Oregon are highly forested and large areas of Florida is wild (home to both the Everglades⁶ and Florida Man⁷).

Now if I were to guess, I’d say that Bigfoot would prefer forested or swampy habitats, so Washington and Florida seem like reasonable observations. So far, so sensible. Let’s move on to something more scientific — clustering data to identify Bigfoot hot spots.

Clustering with OPTICS

Clustering is an exercise which aims to divide a dataset into subsets, with observations in each subset being alike in some way.

In this case, we’re going to use latitude and longitude to identify clusters — or hot spots — of Bigfoot activity. We’re going to choose our clustering algorithm based on a variety of desired characteristics:

  • The algorithm needs to be able to reflect physical distances implied by co-ordinate pairs — i.e. it must be able to use the haversine metric.
  • We’ll need the algorithm to filter out “noise”: we expect that some sightings will be random and so should fall outside of a cluster. The algorithm needs to be able to sift out these points and exclude them from the clusters.
  • I have no idea how many Bigfoot hot spots there are; making a prior assumption could bias the analysis. So the algorithm should be able to determine the “best” number of clusters on its own — i.e. it should be unsupervised.
  • Ideally, we’d also be able to specify the minimum cluster size — or minimum number of sightings in the hot spot — in order for said cluster to be considered an area of high activity.

Ordering Points To Identify Clustering Structure (OPTICS) is a density-based clustering algorithm able to extract clusters of varying densities and shapes⁸. It is similar in nature to the DBSCAN algorithm but is designed to overcome the problem of detecting meaningful clusters in data of varying density⁹.

The scikit-learn documentation¹⁰ sums the concepts up well, assuming you’re familiar with DBSCAN that is:

The OPTICS algorithm shares many similarities with the DBSCAN algorithm, and can be considered a generalization of DBSCAN¹⁰

The key difference between DBSCAN and OPTICS is that the OPTICS algorithm builds a reachability graph, which assigns each sample both a reachability_ distance, and a spot within the cluster ordering_ attribute; these two attributes are assigned when the model is fitted, and are used to determine cluster membership¹⁰.

The reachability distances generated by OPTICS allow for variable density extraction of clusters within a single data set¹⁰.

Now if you’re not familiar with DBSCAN, I detail its workings here (along with explaining the haversine distance):

We’ll whip up a function to help us with the clustering. Nothing too fancy here:

  1. The function will take in our (filtered) dataset, an argument for minimum cluster size, and an argument for season.
  2. The function does some filtering based on the season .
  3. We initialise the OPTICS cluster, and do the clustering (remembering to convert to radians as we’re using the haversine distance metric). In this case, we only vary the minimum number of samples in the cluster, min_cluster_size .
  4. The function returns a dataset containing the centroids of the resulting clusters. We calculate the centroid rather crudely by taking the mean latitude and mean longitude of all points in the cluster, remembering not to include cluster -1 , as this is the label that OPTICS assigns to “noise” observations.
# function to do clustering
def get_cluster(data,cluster_size = 20,season = None):

# get data
if season is not None:
d = data.loc[data['season'] == season,:].copy()
else:
d = data.copy()

# cluster
cluster = OPTICS(
min_cluster_size = cluster_size,
metric = 'haversine',
algorithm = 'ball_tree',
n_jobs = -1
)

cluster.fit(
np.radians([x for x in zip(d['latitude'],d['longitude'])])
)

# get cluster centroids
d['cluster'] = pd.Series(cluster.labels_)
fields = ['latitude','longitude']
d_agg = d.loc[d['cluster'] >= 0,:].groupby('cluster')[fields].mean()

return d_agg

We’re just about ready to run the clustering. Before we do, let’s do some rough calculations to help with us setting the minimum cluster size.

I’m going to focus on the last 20 years of data and consider an area to be a hot spot if there is consistent reported activity. I think one reporting per year is a good place to start with if we weren’t going to stratify by season — i.e. set cluster_size above to be 20, with season = None. If we’re going to factor in seasonality, then we could scale that to reflect the seasonal distribution of sightings that we investigated above.

So that means:

  • Any clustering done on the “overall” data set will have a minimum cluster size of 20.
  • Summer, autumn, winter, and spring will have minimum cluster sizes of 7,6,3, and 3 respectively. This reflects that 37%, 30%, 16%, and 15% of observations which occur in each season (respectively, based on the EDA above).

Finally, time to cluster and visualise! First up is the “overall” clustering result.

Image by author

Here we see clusters in the Washington State and Portland area as suggested by the heat map. Interestingly there aren’t many hot spots in Ohio, and none at all in Florida.

Let’s see how the seasonal clusters look:

Image by author

Interesting!

It seems that in the western USA, Bigfoot is mainly seen in summer (red dots) and autumn (orange). Central areas look to have some spring time hot spots (green). If you live in the eastern USA, you might even be able to see Bigfoot in winter (blue).

If we zoom in a little we can zone in on the area where you’re most likely to run into Bigfoot year-round:

Image by author

For now we’ll ignore the fact that fairly urbanised areas appear to attract Bigfoot and move on to identifying landmarks where Bigfoot is spotted near.

Word cloud

Our data contains a field called location_details . No prizes for guessing exactly what this is.

A word cloud will allow us to use this field to understand the locations and landmarks most common in Bigfoot sightings without plotting frequencies or proportions — refreshing!

A word cloud is simply an image containing a set of words, with the size of each word indicating its importance or frequency. Word clouds are very simple exhibits and can provide a lot of insight with minimal effort. Let’s produce one using the wordcloud Python package.

from wordcloud import WordCloud

# get text
text = df.loc[:,'location_details'].str.cat(sep = ' ').lower()

# cloud
wordcloud = WordCloud(scale = 5,background_color = 'white').generate(text)

# show
plt.figure(figsize = (20,7.5))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

The only data manipulation we need to do is convert the column of descriptions into a single text field, and convert all the letters to lower case. That’s easily done with the cat and lower string methods. wordcloud takes care of most of the rest — producing a great-looking cloud:

Image by author

It looks like the majority of sightings involved a highway or lake of some sort. Not so many mountains, forests, and woods apparently.

We covered quite a bit of ground in this article. Let’s wrap up and summarise.

Data exploration

We saw how we can use Pandas-Profiling to cut through a lot of the manual tasks of exploratory data analysis, but also saw how sometimes getting our hands dirty is necessary if we want to understand certain elements of our data in greater detail.

Next time round, I would perhaps focus on doing some light preprocessing before running the profiling, as the report ended up treating the date field as a string rather than a date-time. A case of mistaken identity — thematically consistent.

I do wonder how the run time and interpretability of the report would scale with wider, high-dimensional data frames: more summaries, interactions and correlation calculations would surely come with a computational price; larger correlation visuals might make interpretation more difficult.

Visualisation

Folium allowed us to create great interactive visulisations fairly easily, creating a heat map and plotting the results of our various OPTICS clustering runs.

Although I didn’t demonstrate it here, Folium can also create and visualise clusters with very little effort. If I was going to take another run at this analysis — or something similar — taking advantage of this functionality before doing a more formal clustering exercise might be advantageous.

Clustering with OPTICS

Although I didn’t show it explicitly, I filtered out all sightings with a missing latitude or missing longitude. Without doing any analysis on the missingness in the coordinate pairs I can’t be sure if this introduced any bias, something we’d need to check if we were doing this analysis in anger.

This actually me leads on to quite an important nuance in the data — the accuracy of coordinate pairs. I would imagine that getting an exact GPS reading when you’ve just spotted a Bigfoot could be challenging, and so many of the latitude and longitude readings would be best estimates with reference to a landmark. There is likely little we could do to correct this, as any form of intervention would likely lead to an introduction of bias. This form of location “inaccuracy” is actually more common than you might imagine — e.g. the location of a road traffic accident might be taken from the nearest safe place.

Now it’s pretty clear that the clustering needs some refinement, not least because some of the cluster centroids landed up in quite urbanised areas. Not only would we want to vary the min_cluster_size in OPTICS, but we’d also want some control over the cluster area if we wanted to arrive at some “searchable” zones. I’m not aware of any parameters in the OPTICS algorithm which allows us to control cluster area so we we’d need to introduce some form of post-processing — something similar is covered in the article I linked to above.

If you’ve made it this far, thanks for reading! I hope you’ve enjoyed reading as much as I have enjoyed writing.

While this article centres on a data science approach to finding a mythical creature, please take it in the spirit in which I wrote it — a fun bit of analysis. Do I think that Bigfoot exists and is prowling around somewhere? No. Do I think it would be amazing if scientists did in fact find a new species of animal out there? Yes. Go science!

Brad



Source link

Leave a Comment