Gracielle Higino / Feb 22 2023

Sampling and bias

Let's investigate what sampling and bias analysis look like!

The code presented here is written in Julia, but don't worry about memorizing the code: what we want to understand here is the mechanism behind the concepts.

First, let's create a fictional population. As you remember from class, a population is a complete set of samples in a study group. It's like you can assess every single tree in your neighbourhood's park and count how many individuals of each species are there. The total of trees individuals in your park is your population.

Notice that the word "population" here does not necessarily mean the same thing in Ecology. In Ecology, sometimes a population is not equal to the total of individuals in an area, but in statistics, it is the absolute total of observations we can get (all individuals, groups, families, etc...).

To start our calculations, let's set a "seed". A "seed" in code is the initial state of the (pseudo)random generator algorithm that allows us to get always the same results in random simulations (which guarantees the reproducibility of our calculations). Read more about this here.

# Run this code chunk to make sure the functions and packages used here are of the same version.using Pkg;Pkg.activate();Pkg.instantiate();

using Random           # calls a package to manage the random number generatorRandom.seed!(1234);    # sets the seed for our random number generator

Nice! Now we can create our population and start digging into what "bias" means. For the purpose of this lesson, let's create a random set of numbers that will describe our population. Think of them as, for example, the height measure for each tree in your park, or the length of the beak of each individual bird species you are studying.

our_population = randn(10^3);            # generate a normal distribution of 10ˆ3 random numbers# Plotting our populationusing Plots, Statistics                  # Load packagesbins = collect(minimum(our_population):0.5:maximum(our_population))     # Define bins for histogramh1 = histogram(our_population,           # Create a histogram for our_population...                 color=:pink, linecolor=:match,      # ... make it pink!...     label="Population")                 # ... and add a label to our plot.vline!([mean(our_population)],           # Add a line where the mean of our population is...  linewidth=5,                           # ... tweak the vertical line to be thicker...  linecolor="#0ABAB5",                   # ... make it tiffany!...  label = "mean")                        # ... and add the label "mean".

Ok, so our population can be described in terms of its mean (the tiffany line). Other metrics are:

using StatsBasesummarystats(our_population)

Now let's suppose we are conducting a study on this population and we need to measure a sample of it. A sample would be, for example, each time you measure the wing length of the birds in a single location. Each measure would be an observation.

But we didn't really pay attention that our capture instrument had a higher chance to capture individuals that were too big, and we could only sample 10% of our population. So we ended up with samples that looked like this:

# Define the mean of our_populationmean_pop = mean(our_population)# Define the number of elements in our_populationn = length(our_population)# Define the maximum probability valuemax_prob = 1.0# Initialize an array of probabilitiesprobs = zeros(n)# Loop over the elements of our_populationfor i in 1:n    # If the value of our_population is lower than its mean, set the probability to 0    if our_population[i] < mean_pop        probs[i] = 0.0    # Otherwise, set the probability to a gradually increasing value    else        probs[i] = (max_prob / (mean_pop * n)) * our_population[i] + (1.0 + (max_prob / mean_pop))    endend# Normalize the probabilitiesprobs /= sum(probs)# Create the weighted sampleour_sample = sample(our_population, AnalyticWeights(probs), 100)# Displaying summary stats of the biased samplesummarystats(our_sample)

# Plotting our samplehistogram(our_sample,bins = bins,        color="#0ABAB5",linecolor=:match,      # ... make it tiffany!...  label="Sample",                        # ... add a label to our plot.  xlims=xlims(h1),                       # Determine the limits of the X axis...  ylims=ylims(h1))                       # ... and the limits of the Y axis.vline!([mean(our_sample)],               # Add a line where the mean of our population is...  linewidth=5,                           # ...tweak the vertical line to be thicker...  linecolor=:pink,                       # ... make it pink!...  label = "mean")                        # ... and add the label "mean".

Do you think our small sample represents our population? Let's plot them together to investigate:

histogram(vec(our_population), bins = bins,                 color=:pink, linecolor=:match, alpha = 0.3,     label="Population")  histogram!(vec(our_sample), bins = bins,  color="#0ABAB5",linecolor=:match, alpha = 0.4,  label="Sample",  xlims=xlims(h1),  ylims=ylims(h1))  vline!([mean(our_population) mean(our_sample)],  linewidth=5,                       linecolor=["#0ABAB5" :pink], linestyle = [:solid :dash],    label = ["population mean" "sample mean"]) 

Apparently, our sample got way more measurements on the right side of our plot, while our population has a higher frequency of measurements in the middle. What do you think has happened?

What will happen if we keep sampling using the same method we did this time? Let's find out!

Quiz time! Prepare your Plicker cards, scan the image below or click here to go to the quiz page.

# keep sampling 100 individuals at a timen_values = repeat([100],9); 

p = plot(layout = (3, 3), size = (800, 500), legend = false)# Plot the data for each value of nfor (i, n) in enumerate(n_values)    results = Float64[]    for j in 1:n        n_sample = sample(our_population, AnalyticWeights(probs), n_values[i])        push!(results, mean(n_sample))    end    histogram!(p[i], results, color="#0ABAB5", alpha = 0.4, bins=50, linecolor=:match, ylabel="Frequency")    vline!(p[i], [mean(our_population)], linewidth=2, linecolor=:red, label="Population Mean")    vline!(p[i], [mean(results)], linewidth=2, linecolor=:blue, linestyle=:dash, label="Sample Mean")end# Display the plotp

But what if we can get more and more observations in each sample? Will the bias disappear or be reinforced?

Quiz time! Prepare your Plicker cards, scan the image below or click here to go to the quiz page.

n_values = [100, 200, 500, 1000]p = plot(layout = (2, 2), size = (800, 500), legend = false)# Plot the data for each value of nfor (i, n) in enumerate(n_values)    results = Float64[]    for j in 1:n        n_sample = sample(our_population, AnalyticWeights(probs), n_values[i])        push!(results, mean(n_sample))    end    histogram!(p[i], results, color="#0ABAB5", alpha = 0.4, bins=50, linecolor=:match, xlabel="Mean", ylabel="Frequency")    vline!(p[i], [mean(our_population)], linewidth=2, linecolor=:red, label="Population Mean")    vline!(p[i], [mean(results)], linewidth=2, linecolor=:blue, linestyle=:dash, label="Sample Mean")end# Display the plotp

What's the difference between the two situations?

Precision (do the values vary too much around the mean?)
vs
Accuracy (are the means too different from the population mean?)

Now, what should normally happen if we had true random samples?

# Increase number of observations in each samplen_values = [10, 200, 200, 200, 500, 10000]p = plot(layout = (2, 3), size = (800, 500), legend = false)# Plot the data for each value of nfor (i, n) in enumerate(n_values)    results = Float64[]    for j in 1:n        n_sample = sample(our_population, n_values[i])        push!(results, mean(n_sample))    end    histogram!(p[i], results, color="#0ABAB5", alpha = 0.4, bins=50, linecolor=:match, xlabel="Mean", ylabel="Frequency")    vline!(p[i], [mean(our_population)], linewidth=2, linecolor=:red, label="Population Mean")    vline!(p[i], [mean(results)], linewidth=2, linecolor=:blue, linestyle=:dash, label="Sample Mean")end# Display the plotp

Quiz time! Prepare your Plicker cards, scan the image below or click here to go to the quiz page.

Sampling and bias

Quiz time! Prepare your Plicker cards, scan the image below or click here to go to the quiz page.

Quiz time! Prepare your Plicker cards, scan the image below or click here to go to the quiz page.

Quiz time! Prepare your Plicker cards, scan the image below or click here to go to the quiz page.

Runtimes (1)