Sampling and bias [R]

Let's investigate what sampling and bias analysis look like! The code presented here is written in Julia. Click here for code in R.

First, let's create a fictional population. As you remember from class, a population is a complete set of samples in a study group. It's like you can assess every single tree in your neighbourhood's park and count how many individuals of each species are there. The total of trees individuals in your park is your population.

Notice that the word "population" here does not necessarily mean the same thing in Ecology. In Ecology, sometimes a population is not equal to the total of individuals in an area.

To start our calculations, let's set a "seed". A "seed" in code is the initial state of the (pseudo)random generator algorithm that allows us to get always the same results in random simulations (which guarantees the reproducibility of our calculations). Read more about this here.

set.seed(1234)    # sets the seed for our random number generator
0.3s

Nice! Now we can create our population and start digging into what "bias" means. For the purpose of this lesson, let's create a random set of numbers that will describe our population. Think of them as, for example, the height measure for each tree in your park, or the length of the beak of each individual bird species you are studying.

our_population <- rnorm(1000)            # generate a normal distribution of 10ˆ3 random numbers
# Load packages                   
library(ggplot2)
# Create a histogram for our_population
h1 <- ggplot() +
geom_histogram(aes(x=our_population), fill = "pink", color = "pink", alpha = .6) +
geom_vline(xintercept = mean(our_population), color = "#0ABAB5", size = 1.5, linetype = "dashed", show.legend = T) +
xlab("Values") + ylab("Frequency") +
ggtitle("Population")
h1
0.4s

Ok, so our population can be described in terms of its mean (the tiffany line). Other metrics are:

summary(our_population)
0.3s

Now let's suppose we are conducting a study on this population and we need to measure a sample of it. But we didn't really pay attention that our capture instrument had a higher chance to capture individuals that were too big (or too tall), and we could only sample 10% of our population. So we ended up with samples that looked like this:

# Define the mean of our_population
mean_pop <- mean(our_population)
# Define the maximum probability value
max_prob <- 1.0
# Initialize an array of probabilities
probs <- rep(0, length(our_population))
#Loop over the elements of our_population
for (i in 1:length(our_population)) {
    # If the value of our_population is lower than its mean, set the probability to 0
      if (our_population[i] < mean_pop) {
          probs[i] <- 0
    #Otherwise, set the probability to a gradually increasing value
      } else {
        probs[i] <- (1 + ((our_population[i] - mean_pop) / sd(our_population))^2)^(-1)
              }
       }
# Normalize the probabilities
probs <- probs/sum(probs)
# Create the weighted sample
our_sample <- sample(our_population, 100, prob=probs, replace=TRUE)
# Displaying summary stats of the biased sample
summary(our_sample)
0.5s
# Plotting our sample
ggplot() +
geom_histogram(aes(x=our_sample), fill = "#0ABAB5", color = "#0ABAB5", alpha = .6) +
geom_vline(xintercept = mean(our_sample), color = "pink", size = 1.5, linetype = "solid", show.legend = T) +
xlab("Values") + ylab("Frequency") +
ggtitle("Sample") +
scale_x_continuous(limits = c(min(our_population), max(our_population))) +
scale_y_continuous(limits = c(0,100))
0.4s

Do you think our small sample represents our population? Let's plot them together to investigate:

Apparently, our sample got way more measurements on the right side of our plot, while our population has a higher frequency of measurements in the middle. What do you think has happened?

What will happen if we keep sampling using the same method we did this time? Let's find out!

What will happen if we keep sampling using the same method we did this time? Let's find out!

What's the difference between the two situations?

Now, what should normally happen if we had true random samples?

Runtimes (1)