Let's investigate what sampling and bias analysis look like!
The code presented here is written in Julia, but don't worry about memorizing the code: what we want to understand here is the mechanism behind the concepts.
First, let's create a fictional population. As you remember from class, a population is a complete set of samples in a study group. It's like you can assess every single tree in your neighbourhood's park and count how many individuals of each species are there. The total of trees individuals in your park is your population.
Notice that the word "population" here does not necessarily mean the same thing in Ecology. In Ecology, sometimes a population is not equal to the total of individuals in an area, but in statistics, it is the absolute total of observations we can get (all individuals, groups, families, etc...).
To start our calculations, let's set a "seed". A "seed" in code is the initial state of the (pseudo)random generator algorithm that allows us to get always the same results in random simulations (which guarantees the reproducibility of our calculations). Read more about this here.
# Run this code chunk to make sure the functions and packages used here are of the same version.
usingPkg;
Pkg.activate();
Pkg.instantiate();
9.5s
usingRandom# calls a package to manage the random number generator
Random.seed!(1234); # sets the seed for our random number generator
0.0s
Nice! Now we can create our population and start digging into what "bias" means. For the purpose of this lesson, let's create a random set of numbers that will describe our population. Think of them as, for example, the height measure for each tree in your park, or the length of the beak of each individual bird species you are studying.
our_population=randn(10^3); # generate a normal distribution of 10ˆ3 random numbers
# Plotting our population
usingPlots, Statistics# Load packages
bins=collect(minimum(our_population):0.5:maximum(our_population)) # Define bins for histogram
h1=histogram(our_population, # Create a histogram for our_population...
color=:pink, linecolor=:match, # ... make it pink!...
label="Population") # ... and add a label to our plot.
vline!([mean(our_population)], # Add a line where the mean of our population is...
linewidth=5, # ... tweak the vertical line to be thicker...
linecolor="#0ABAB5", # ... make it tiffany!...
label="mean") # ... and add the label "mean".
33.9s
Ok, so our population can be described in terms of its mean (the tiffany line). Other metrics are:
usingStatsBase
summarystats(our_population)
1.8s
Now let's suppose we are conducting a study on this population and we need to measure a sample of it. A sample would be, for example, each time you measure the wing length of the birds in a single location. Each measure would be an observation.
But we didn't really pay attention that our capture instrument had a higher chance to capture individuals that were too big, and we could only sample 10% of our population. So we ended up with samples that looked like this:
# Define the mean of our_population
mean_pop=mean(our_population)
# Define the number of elements in our_population
n=length(our_population)
# Define the maximum probability value
max_prob=1.0
# Initialize an array of probabilities
probs=zeros(n)
# Loop over the elements of our_population
foriin1:n
# If the value of our_population is lower than its mean, set the probability to 0
ifour_population[i] <mean_pop
probs[i] =0.0
# Otherwise, set the probability to a gradually increasing value
Apparently, our sample got way more measurements on the right side of our plot, while our population has a higher frequency of measurements in the middle. What do you think has happened?
What will happen if we keep sampling using the same method we did this time? Let's find out!
Quiz time! Prepare your Plicker cards, scan the image below or click here to go to the quiz page.