Sampling and bias

Let's investigate what sampling and bias analysis look like!

The code presented here is written in Julia, but don't worry about memorizing the code: what we want to understand here is the mechanism behind the concepts.

First, let's create a fictional population. As you remember from class, a population is a complete set of samples in a study group. It's like you can assess every single tree in your neighbourhood's park and count how many individuals of each species are there. The total of trees individuals in your park is your population.

Notice that the word "population" here does not necessarily mean the same thing in Ecology. In Ecology, sometimes a population is not equal to the total of individuals in an area, but in statistics, it is the absolute total of observations we can get (all individuals, groups, families, etc...).

To start our calculations, let's set a "seed". A "seed" in code is the initial state of the (pseudo)random generator algorithm that allows us to get always the same results in random simulations (which guarantees the reproducibility of our calculations). Read more about this here.

using Random           # calls a package to manage the random number generator
Random.seed!(1234);    # sets the seed for our random number generator
0.0s

Nice! Now we can create our population and start digging into what "bias" means. For the purpose of this lesson, let's create a random set of numbers that will describe our population. Think of them as, for example, the height measure for each tree in your park, or the length of the beak of each individual bird species you are studying.

our_population = randn(10^3);            # generate a normal distribution of 10ˆ3 random numbers
# Plotting our population
using Plots, Statistics                  # Load packages
bins = collect(minimum(our_population):0.5:maximum(our_population))     # Define bins for histogram
h1 = histogram(our_population,           # Create a histogram for our_population...            
     color=:pink, linecolor=:match,      # ... make it pink!...
     label="Population")                 # ... and add a label to our plot.
vline!([mean(our_population)],           # Add a line where the mean of our population is...
  linewidth=5,                           # ... tweak the vertical line to be thicker...
  linecolor="#0ABAB5",                   # ... make it tiffany!...
  label = "mean")                        # ... and add the label "mean".
33.9s

Ok, so our population can be described in terms of its mean (the tiffany line). Other metrics are:

Now let's suppose we are conducting a study on this population and we need to measure a sample of it. A sample would be, for example, each time you measure the wing length of the birds in a single location. Each measure would be an observation.

But we didn't really pay attention that our capture instrument had a higher chance to capture individuals that were too big, and we could only sample 10% of our population. So we ended up with samples that looked like this:

# Plotting our sample
histogram(our_sample,bins = bins,      
  color="#0ABAB5",linecolor=:match,      # ... make it tiffany!...
  label="Sample",                        # ... add a label to our plot.
  xlims=xlims(h1),                       # Determine the limits of the X axis...
  ylims=ylims(h1))                       # ... and the limits of the Y axis.
vline!([mean(our_sample)],               # Add a line where the mean of our population is...
  linewidth=5,                           # ...tweak the vertical line to be thicker...
  linecolor=:pink,                       # ... make it pink!...
  label = "mean")                        # ... and add the label "mean".
1.5s

Do you think our small sample represents our population? Let's plot them together to investigate:

Apparently, our sample got way more measurements on the right side of our plot, while our population has a higher frequency of measurements in the middle. What do you think has happened?

What will happen if we keep sampling using the same method we did this time? Let's find out!

Quiz time! Prepare your Plicker cards, scan the image below or click here to go to the quiz page.
p = plot(layout = (3, 3), size = (800, 500), legend = false)
# Plot the data for each value of n
for (i, n) in enumerate(n_values)
    results = Float64[]
    for j in 1:n
        n_sample = sample(our_population, AnalyticWeights(probs), n_values[i])
        push!(results, mean(n_sample))
    end
    histogram!(p[i], results, color="#0ABAB5", alpha = 0.4, bins=50, linecolor=:match, ylabel="Frequency")
    vline!(p[i], [mean(our_population)], linewidth=2, linecolor=:red, label="Population Mean")
    vline!(p[i], [mean(results)], linewidth=2, linecolor=:blue, linestyle=:dash, label="Sample Mean")
end
# Display the plot
p
2.6s

But what if we can get more and more observations in each sample? Will the bias disappear or be reinforced?

Quiz time! Prepare your Plicker cards, scan the image below or click here to go to the quiz page.
n_values = [100, 200, 500, 1000]
p = plot(layout = (2, 2), size = (800, 500), legend = false)
# Plot the data for each value of n
for (i, n) in enumerate(n_values)
    results = Float64[]
    for j in 1:n
        n_sample = sample(our_population, AnalyticWeights(probs), n_values[i])
        push!(results, mean(n_sample))
    end
    histogram!(p[i], results, color="#0ABAB5", alpha = 0.4, bins=50, linecolor=:match, xlabel="Mean", ylabel="Frequency")
    vline!(p[i], [mean(our_population)], linewidth=2, linecolor=:red, label="Population Mean")
    vline!(p[i], [mean(results)], linewidth=2, linecolor=:blue, linestyle=:dash, label="Sample Mean")
end
# Display the plot
p
0.8s

What's the difference between the two situations?

Precision (do the values vary too much around the mean?)

vs

Accuracy (are the means too different from the population mean?)

Now, what should normally happen if we had true random samples?

# Increase number of observations in each sample
n_values = [10, 200, 200, 200, 500, 10000]
p = plot(layout = (2, 3), size = (800, 500), legend = false)
# Plot the data for each value of n
for (i, n) in enumerate(n_values)
    results = Float64[]
    for j in 1:n
        n_sample = sample(our_population, n_values[i])
        push!(results, mean(n_sample))
    end
    histogram!(p[i], results, color="#0ABAB5", alpha = 0.4, bins=50, linecolor=:match, xlabel="Mean", ylabel="Frequency")
    vline!(p[i], [mean(our_population)], linewidth=2, linecolor=:red, label="Population Mean")
    vline!(p[i], [mean(results)], linewidth=2, linecolor=:blue, linestyle=:dash, label="Sample Mean")
end
# Display the plot
p
2.5s
Quiz time! Prepare your Plicker cards, scan the image below or click here to go to the quiz page.
Runtimes (1)