A Beginner-Friendly Introduction to Applied Science | by Emma Boudreau | Jun, 2023

Now that we have a basic grasp on data and the different types of features we might encounter on our travels, we can finally start analyzing some data. In order to make this as comprehensible as possible, I will be taking on some data in order to demonstrate these different concepts. The data I will be using is the trees.csv data distributed by Florida State University available here (GNU LGPL license). I will be using the Julia programming language, but the process is generally the same for most different solutions, the method calls or technique of change might vary depending on what is used, but generally the techniques and ultimate changes we are making are identical. We will start this process by reading in the data and doing some basic processing. This will require some packages in Julia:

using Pkg; Pkg.activate("ds")

Now let’s import, and I will use a sink argument in order to read in our DataFrame. A sink argument just allows for our data format module to infer how to read some data into a given type; in our case, DataFrames.

using DataFrames
using Statistics
using HypothesisTests

In most cases, this would be where we would clean our data and remove any missing values. Luckily, this data is incredibly clean and doesn’t even have missing values. The shape remains the same after a drop:

(31, 4)

(31, 4)

That being said, we can go straight to looking at our features.

We have four different features, Index, Girth, Height, Volume. Girth, Height, and Volume are all continuous features. Index is just a representation of the row, a label feature. In some cases, it can be confusing whether or not we are looking at a label or a feature. A great way to figure this out is to see how many unique values there are. Especially in cases where this is the same as the length of the data, this is an obvious indicator for a label.

println(length(Set(df[!, :Index])))
println(length(df[!, :Index]))

There are no categorical features in this data, but this gives us a great opportunity to create one! As stated prior, categorical features are often just a measurement of continuous features. For example, we could classify our trees by how tall they are. Using a simple comprehension, I will create a new feature which will determine a category based on the mean:

df[!, Symbol("Height Class")] = [begin
if x > hmu
elseif x < hmu
end for x in df[!, Symbol("Height (ft)")]

And now we have a new feature!

31-element Vector{String}:


Now we have a basic relationship with our data where we could actually find some insights. For example, it might be interesting to see if height correlates to width. For this, we now need to structure a test.

hypothesis testing

Now that we have established our features and recognize our data in the format it is presented, we might begin to formulate a hypothesis.

Considering our features, we will form the following hypothesis.

“ If the height of a tree is higher, then it is likely girth will also be higher.”

This is an interesting hypothesis because the answer is not immediately obvious. Of course, there is a lot of genetic diversity between trees and some trees grow tall while some trees grow wide. This data also only has 31 observations, which probably means we do not have a good enough generalization to accept this hypothesis. That being said, we can still perform a test! To start, we will create the sample we discussed earlier. We will use our new category for this, separating trees by whether or not they are taller or shorter. With Julia, I like to make my own neat little dispatch for this that just makes conditional masking a lot easier. Of course, it can still be done with filter! .

import Base: getindex

function getindex(df::DataFrame, bv::Vector{Bool})
points::Vector{Int64} = findall(x -> x == 0, bv)
dfcopy::DataFrame = copy(df)
delete!(dfcopy, points)

Now I will separate our the taller df:

mask = [x == "taller" for x in df[!, Symbol("Height Class")]]
tallerdf = df[mask]

Now we can perform a test to check if the girth inside of this new separated sample is statistically significant. If this were the case, we could potentially accept or reject our null hypothesis and recieve an answer. There is more to this; we have a certain level of confidence to any test, but for this brief overview we will only touch lightly on this. Now I will get our two samples out, note that they will need to be of the same length, so we will randomly subsample to population up to the length of our sample.

grow_samp = tallerdf[!, Symbol("Girth (in)")]
samples = [begin
n = rand(1:length(grow))
end for n in 1:length(grow_samp) ]

Our grow_samp is our group that is categorized as greater in height. Our samples are random samples from our population. Using these two, we can determine if there is any difference between the trees labeled as taller and the trees that are not in terms of girth. For this, I will use a One Sample T-test, also known as an independent T test. This is a standard hypothesis test and likely the easiest to learn. In statistics, everything starts with a parabolla. The majority of the population in this parabola lies in the center. Things which are statistically significant lie in abnormality; away from the rest of the data, to the sides — or, tails — of the parabola. In the example of the normal distribution, our mean sits center and two standard deviations from the mean set on each end. In either direction — negative or positive standard deviations from the mean.

The parabola itself is called a distribution, as it describes how our data was distributed. In most cases, given we live in 2023, we are likely going to be working with software that has already done the formulas for you. As a beginner, I would recommend getting familiar with Probability Density Functions, PDFs, and cumulative distribution functions, CDFs, but not necessarily learning the formulas for the functions. We are going to go into some detail here, but not much, but here are some articles that go more in depth on the topics:

One notable PDF one might want to get familiar with is that of the normal distribution. This is a simple formula that makes a lot of sense. In order to encourage thinking, let’s work our way backwards; what do we want in return?

We want a distribution function that will lay our data out based on probability where each input is scaled to its standard deviations from the mean. If we were to take a value from some data, how would we find out how many standard deviations from the mean we are? It takes 5 packing peanuts to ship our boxes, we had 200 but used 125, how many more boxes can we fill? This is a similar problem; first we get the difference between our value and the mean, and then we see how many standard deviations it is from the mean just as we subtract the used peanuts and divide by how many per box to see how many we might fill.



  • xbar is an observation from our sample,
  • mu is the sample mean,
  • and lowercase sigma is the standard deviation.

If you would like to learn more about what these symbols mean in the context of statistics, here is an article I wrote all about it:

What we are testing for whenever we test for correlation is whether or not our data is abnormal; whether or not it lies in those tails that are two standard deviations from the mean. It is important to remember exactly how these tests work; we are not testing for cause, we are testing for correlation. This does not mean that these trees are this wide because they are tall, they just are wider when they are taller. Another thing to consider here is our hypothesis:

If a given tree is taller, then it also likely has more girth.

Whenever we perform this test, we are not testing to see if something is true. We are testing to see if the opposite is true; our real question here is entirely different.

If a given tree is taller, then it still has normal girth.

Whenever we refuse to reject our hypothesis, we are not even refusing to reject what we originally set out to prove; we are just proving that the opposite is not true. This is called the null hypothesis, and should be in the back of your mind during an experiment. Now back to our expirement. As I briefly mentioned, there is a lot of math involved with the different functions for different distributions. The most approachable of testing distributions is probably the T distribution, this distributioin still has a rather complicated CDF, which is typically used for this type of one-sample test. That being said, we can use software libraries and the Julia ecosystem to our advantage. In languages like Python or R, there are similar packages which are capable of performing this sort of test. For this, I will use the OneSampleTTest from HypothesisTests .

OneSampleTTest(samples, grow_samp)

One sample t-test
Population details:
parameter of interest: Mean
value under h_0: 0
point estimate: -1.24
95% confidence interval: (-3.338, 0.8581)

Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value: 0.2256

number of observations: 15
t-statistic: -1.2675959523435136
degrees of freedom: 14
empirical standard error: 0.9782296935450965

While it is unfortunate that we do not have many observations here to work with, from this data there is not much statistical significance to this hypothesis. We have found that the opposite could potentially be true, even — though we would certainly need more data to find out. Our P value came out to a mere 0.225. Generally speaking, a P value below .05 is considered statistically significant. Our value certainly indicates that this is not the case, at least not in this data.

Source link

Leave a Comment