Sampling and bias [R]
Let's investigate what sampling and bias analysis look like! The code presented here is written in Julia. Click here for code in R.
First, let's create a fictional population. As you remember from class, a population is a complete set of samples in a study group. It's like you can assess every single tree in your neighbourhood's park and count how many individuals of each species are there. The total of trees individuals in your park is your population.
Notice that the word "population" here does not necessarily mean the same thing in Ecology. In Ecology, sometimes a population is not equal to the total of individuals in an area.
To start our calculations, let's set a "seed". A "seed" in code is the initial state of the (pseudo)random generator algorithm that allows us to get always the same results in random simulations (which guarantees the reproducibility of our calculations). Read more about this here.
set.seed(1234) # sets the seed for our random number generatorNice! Now we can create our population and start digging into what "bias" means. For the purpose of this lesson, let's create a random set of numbers that will describe our population. Think of them as, for example, the height measure for each tree in your park, or the length of the beak of each individual bird species you are studying.
our_population <- rnorm(1000) # generate a normal distribution of 10ˆ3 random numbers# Load packages library(ggplot2)# Create a histogram for our_populationh1 <- ggplot() +geom_histogram(aes(x=our_population), fill = "pink", color = "pink", alpha = .6) +geom_vline(xintercept = mean(our_population), color = "#0ABAB5", size = 1.5, linetype = "dashed", show.legend = T) +xlab("Values") + ylab("Frequency") +ggtitle("Population")h1Ok, so our population can be described in terms of its mean (the tiffany line). Other metrics are:
summary(our_population)Now let's suppose we are conducting a study on this population and we need to measure a sample of it. But we didn't really pay attention that our capture instrument had a higher chance to capture individuals that were too big (or too tall), and we could only sample 10% of our population. So we ended up with samples that looked like this:
# Define the mean of our_populationmean_pop <- mean(our_population)# Define the maximum probability valuemax_prob <- 1.0# Initialize an array of probabilitiesprobs <- rep(0, length(our_population))#Loop over the elements of our_populationfor (i in 1:length(our_population)) { # If the value of our_population is lower than its mean, set the probability to 0 if (our_population[i] < mean_pop) { probs[i] <- 0 #Otherwise, set the probability to a gradually increasing value } else { probs[i] <- (1 + ((our_population[i] - mean_pop) / sd(our_population))^2)^(-1) } }# Normalize the probabilitiesprobs <- probs/sum(probs)# Create the weighted sampleour_sample <- sample(our_population, 100, prob=probs, replace=TRUE)# Displaying summary stats of the biased samplesummary(our_sample)# Plotting our sampleggplot() +geom_histogram(aes(x=our_sample), fill = "#0ABAB5", color = "#0ABAB5", alpha = .6) +geom_vline(xintercept = mean(our_sample), color = "pink", size = 1.5, linetype = "solid", show.legend = T) +xlab("Values") + ylab("Frequency") +ggtitle("Sample") +scale_x_continuous(limits = c(min(our_population), max(our_population))) +scale_y_continuous(limits = c(0,100))Do you think our small sample represents our population? Let's plot them together to investigate:
Apparently, our sample got way more measurements on the right side of our plot, while our population has a higher frequency of measurements in the middle. What do you think has happened?
What will happen if we keep sampling using the same method we did this time? Let's find out!
What will happen if we keep sampling using the same method we did this time? Let's find out!
What's the difference between the two situations?
Now, what should normally happen if we had true random samples?