Multilevel Regression for Sockeye Salmon Spawner Abundance Across Years and Rivers
Introduction
Let the data speak for itself... but all the data must speak. - John P. Cunningham (2012)
Estimates of the number of sockeye salmon spawners that occupy in a river (or similar geographic category of spawning) involve repeated surveys during, typically, September to November. Techniques vary: counting spawners while walking beside a river, while in a helicopter or small airplane, while snorkeling down the river, and from analyzing video from drones. There are many other methods to estimate abundance: mark-recapture, conductivity sensors, counting fences, and eDNA. Other salmon species may be counted in a survey for sockeye, and each salmon species has a diversity of enumeration methods.
The analysis of repeated surveys of sockeye spawners is conventionally an area under the curve estimate (AUC) where the "curve" is sometimes just the area under straight lines joining the observations. The data may be supplemented with prior knowledge or subjective judgements about first, peak, and last presence of spawners. One improvement, with liberal assumptions about salmon behaviour, fits a Gaussian
The intention of this exercise is to show, using simulated data, that AUC estimates can be improved by considering multiple rivers and multiple years in a multilevel regression. The ideas are:
- there will be an effect in common due to years on the temporal pattern of abundance in a group of rivers. Marine migration might be early or late, rainfall to raise water levels sufficiently for salmon to enter the rivers may be early or late.
- a river can have a consistently different temporal pattern from other river. Spawners that home to that river can arrive earlier or later, and more or less spread in time.
- a river can have consistently more or fewer salmon compared to other rivers. Factors such as length of accessible river and the abundance of gravel suitable for spawning (for redds) affect salmon abundance.; and
- the abundance of spawners across all rivers varies by year. Implicit is that these spawning groups in rivers (salmon tribes?) are related within a population or conservation unit (metapopulation).
These ideas can be assembled into an equation
where
Gaussian might be a sufficient description for the temporal abundance of spawners despite being symmetrical and unbounded. Reality might be better described as a truncated and skewed Gaussian distribution, or by a distribution that is inherently bounded.
The beta distribution,
# Local Functions # add dimension names to an array. There has to be a better way. ArrayDimNames <- function (a, names){ # a is an array # names with length(dim(a)), one name for each dimension dims=dim(a); ndims=length(dims); if (length(dims) > 4) stop("ArrayDimNames: only defined for 3 or 4 dimensions"); result<- list(rep(names[1],dims[1]),rep(names[2],dims[2]), rep(names[3],dims[3]) ) if((length(dims) == 4)){ result<- list(rep(names[1],dims[1]),rep(names[2],dims[2]), rep(names[3],dims[3]),rep(names[4],dims[4]) ) } return(result) } # example: # ArrayDimNames(a=array(dim=c(3,3,3)), names=c("dog","fox","bat")) # [[1]] [1] "dog" "dog" "dog" # [[2]] [1] "fox" "fox" "fox" # [[3]] [1] "bat" "bat" "bat" # day of the year if not leap year doy=cumsum(c(0,31,28,31,30,31,30,31,31,30,31,30,31)); # 0 31 59 90 120 151 181 212 243 273 304 334 364 # used as doy(month) where month is 1:12. # truncated Normal distribution (tails trimmed). This begs to be recursive. RtruncNorm <- function(n=1,m=0,s=1,t=2){ # t is truncation, how many sd tolerable? default is 2 a <- rnorm(n); print(a); r <- max(abs(a)); print(r) while(r > t){ j <- a > t; print(j) # where exceeds a[j] <- rnorm(sum(j)) # try again where exceeds r <- max(abs(a)) # OK yet? } result <- a*s+m return(result) }
Examples of Samples
Temporal patterns from year and river effects.
Simulate data applies the normal distribution for the overall abundance of spawners with timing (mean,
The following code creates a three dimensional array of nday observations in each of nriver rivers in each of nyears years.
# mean and SD of salmon abundance after removing year # and river effects. Day 270 is September 30 (not leap year) background <- c(273, 7) # storage for simulated data nday <- nriver <- nyear <- 5; # remember: arrays stored by first index trueAbundance <- array(dim=c(nday, nriver, nyear)); dimnames(trueAbundance) <-list(day=1:nday,river=LETTERS[1:nriver], year=1991:(1991+nday-1)) # river size effect:proportion across all years, of spawners in that river river_size <- runif(nriver,1,10); # random uniform. river_size=river_size/sum(river_size); # proportion # temporal effects effects yearSD <- c(5,1.5) # effect on timing and spread riverSD <- c(5,1.5) yearEffect <- cbind(rnorm(nyear, 0, yearSD[1]), rnorm(nyear, 0, yearSD[2]) ) # 2 columns riverEffect <- cbind(rnorm(nriver, 0, riverSD[1]), rnorm(nriver, 0, riverSD[2]), river_size) # 3 columns dimnames(yearEffect) <- list(year=1991:(1990+nday), effect=c("timing", "spread")); dimnames(riverEffect) <- list(river=LETTERS[1:nriver], effect=c("timing", "spread", "size")); cat( ' Year Effect \n'); print(yearEffect); cat('\n River Effect \n'); print(riverEffect);
Examine total effect.
The random, additive effects of rivers (A, B, C,) and year (1991, 1992,) are calculated for timing and spread.
b1 <- matrix(nrow=nriver, ncol=nyear); dimnames(b1) <- list(river=LETTERS[1:nriver], year=1991:(1990+nday) ) b4 <- b3 <- b2 <- b1 cat("riverEffect+yearEffect. Timing (mean) \n") for(year in 1:nyear){ for(river in 1:nriver){ b1[river, year]=riverEffect[river,1]+yearEffect[year,1]; } } print(b1) cat("\n riverEffect+yearEffect. Spread (SD) \n") for(river in 1:nriver){ for(year in 1:nyear){ b2[river, year]=riverEffect[river,2]+yearEffect[year,2]; } } print(b2) # earliest and last abundance across all rivers and years # needed to determine sampling days b3 <- b1 - 1.5*(b2+background[2]); # first case revealed cat("\n Early due to river and year effects \n"); print(b3); b4 <- b1 + 1.5*(b2+background[2]); # last case revealed cat("\n Late due to river and year effects \n"); print(b4)
Generate sampling days
As a starting point, sampling days are the same for every river every year. The range of sampling days has to cover the range of days when salmon are abundant in any river. That is determined from -1 sd of abundance distribution for the most early case of river and year, and +1 sd for most delayed case.
# sampling days. Cover expected range across all years and rivers. j <- which.min(b3); # earliest case a1 <- b3[j]+background[1]; # value for this case cat('earliest case',a1, 'timing effect', b1[j],'spread effect', b2[j],'\n') j <- which.max(b4); # most delayed case a2 <- b4[j]+background[1]; cat('last case',a2,'timing effect', b1[j], 'spread effect', b2[j], '\n') days <- seq(a1, a2, length.out=nday); days <- floor(days) # integer days cat("\n Days of the year for observations:", days,"\n");
Abundance by day, one river, many years
The relative abundance of salmon in one river is taken as fixed and independent of years. This is true only when considering the potential abundance of spawners ("capacity" refers to some measure of output of salmon, not input of spawners). The spawning populations in a river may thrive or dwindle for many reasons: fishing, predators, disease,. These random effects within rivers across years will fall into the unexplained variance.
# total abundance by year totalYearly <- rnorm(nyear, 1e5, 0.25e5) # abundance by year # abundance by day by river by year. for (river in 1:nriver) { for (year in 1:nyear) { trueAbundance [,river,year] <- totalYearly[year]*riverEffect[river,3]* dnorm(days, background[1]+riverEffect[river,1]+ yearEffect[year,1], # mean background[2]+riverEffect[river,2]+ yearEffect[year,2]) # SD } } j <- is.na(trueAbundance) | trueAbundance < 0.1 trueAbundance[j] <- 0 par(tck=0.02, las=1, yaxs="i") color <- c("black","blue","green","orange","red"); plot(1,1,type="n", xlim=c(-5,5)+range(days), ylim=c(.9,1.1)*range(trueAbundance), xlab="Day of Year", ylab="Abundance"); axis(3,labels=F);axis(4,labels=F); for (river in 1:nriver) {for (year in 1:nyear) { lines(days, trueAbundance[,river, year], col=color[river] ) } } abline(v=days, lty="dotted", col="grey50");
Area under the curve: conventional
To establish a contrast to AUC by river by year, we calculate the integral of total salmon abundance from the linear interpolation method and by fitting a normal distribution. Adjusting the linear interpolation by considerations such as a date for unobserved peak spawning is not included (yet).
A spawning salmon may be observed in more than one survey, so a key parameter to translate AUC into salmon abundance is the duration of a spawner in a river. While that duration may be a haphazard estimate, it directly affects abundance estimates and needs to be included in these models. Potentially all of the spawners observed after peak abundance were also observed at peak abundance. This has led to simplifying the survey analyses to be maximum spawners observed. This value is included in the result because the relative abundance of spawners, from some uncalibrated index of abundance, such as maximum spawners observed across multiple surveys in one river and one year, may be sufficient for fisheries management.
Calculations without observation error
Knowing how the precision of estimates of total changes with increasing observation error seems valuable. The baseline is zero observation error. The following code provides functions for exploring the effects of observation error, which I have not yet done. (perhaps Gaussian Process regression?)
function for linear AUC
# linear interpolation with irregular intervals. # each observation applies to half the preceding inteval and half the following. AUC <- function(days,dat) { # x,y. vectors n=length(dat) if (length(days) != n) stop("AUC: lengths differ"); interval <- days[-1] - days[-n]; # second minus first, etc. length n-1 weight <- interval/2 # n-1 of these weight <- c(0,weight)+ c(weight,0); # for non-uniform intervals auc <- sum (weight * dat) # cat('weight', weight,'\n') return(auc) } test_dat <- c(1,3,5, 2,1 ); # not symmetrical test_day <- c(2,5,8,11,14); AUC(test_day, test_dat ) # 12
linear function to fit Gaussian distribution
The log of the Gaussian distribution is a parabola
where
After linear regression of
The Gaussian integral (LaPlace 1782, Gauss 1809) is the area under the curve of a normal distribution with
There is an extension applicable to the parabola obtained by regression with the log of abundance data. The antilog of the preceding fitted parabola (parameters
That integral is the total abundance, aka area under the curve, aka Gaussian scaling factor. See link.
Fitting to log abundance emphasizes the contribution to the sum of square of residuals from small abundance estimates compared to large. Are small abundance estimates more or less precise than large? That is true when there are too many salmon to count individually. If the precision of each abundance estimate can be quantified, then a weighted regression is a trivial extension. This is deferred, because it is more likely that relative precision is known (ranking, ordinal), which means fitting weights instead of knowing weights (Akenhead et al. 2016).
# the log of a normal distribution is a parabola AUCnormal <- function(dat, day){ day2 <- day^2 y<- log(dat) a <- lm (y ~ day + day2) return(a) } test_dat <- c(1,3,5,2,1); # not symmetrical test_day <-c(2, 5, 8, 11, 14); a <- AUCnormal(test_dat, test_day); p <- coefficients(a) print(summary(a)); print(p); y <- predict(a) # data=test_day, default predicted by regression par(tcl=0.2) plot(log(test_dat)~test_day, pch=20, yaxs="i", xlim=c(0,1.2*max(test_day)),ylim=range(y)*c(.8,1.2)); points(test_day,y) x= seq(min(test_day), max(test_day), length.out=20) y <- p[1]+ p[2]*x+p[3]*x^2 lines(x,y) # extract Gaussian parameters from regression parametersestimates mu <- -0.5* p[2]/p[3]; sigma <- sqrt(-0.5/p[3]); # be careful: regression y=a+bx+cx^2 represents math y=a+bx-cx^2 # so see c as -c, see? So, auc <- sqrt(pi/(-p[3]))*exp( (p[2]^2)/(4*(-p[3])) +p[1]); cat('\n mu =', mu, 'sigma =', sigma,'auc =',auc, '\n') abline(v= mu+sigma* c(-2,-1,0,1,2), lty='dotted',col='blue') # not logged plot(test_dat~test_day, pch=20, yaxs="i", xlim=c(0,1.2*max(test_day)),ylim=c(0, 1.2*max(test_dat))); y1 <- exp(y); # antilog lines(x, y1); mean0=sum(test_dat*test_day) / sum(test_dat); sd0= sqrt( sum(test_dat*(test_day-mean0)^2)/(sum(test_dat)-1)) # cat('\n simple weighted mean =',mean0,' with sd =', sd0,'\n') y2 <- dnorm(x,mean0,sd0) * auc; # crude normal scaled to total abundance lines(x,y2,col="blue"); abline(v= mean0+sd0* c(-2,-1,0,1,2), lty='dotted',col='blue')
Nonlinear regression (search) for normal distribution
This model minimizes the squares of residuals from trial values of parameters to estimate
# the model and the optimization in one function fitGaussian <-function(x,y,mu,sigma,scale){ f = function(p){ # p is c(mu,sigma,scale) d = p[3]*dnorm(x,mean=p[1],sd=p[2]) # predicted from trial values in p sum((d-y)^2) # minimize sum of squared deviations (or another distance) } optim(c(mu,sigma,scale),f, hessian=T) # search for minimum, starting from guesses. Return curvature of SSD manifold at best fit for parameters (Hessian) } a <- fitGaussian(test_day, test_dat, mu=7.8,sigma=3.5, scale=36); mean1 <- a$par[1]; sd1 <- a$par[2]; auc1 <- a$par[3]; SSres <- a$value; se <- sqrt(diag(solve(a$hessian))); r2 <- 1-SSres/sum((test_dat - mean(test_dat))^2); # SSreg/SSmean cat(sep='', 'mean=', round(mean1,2), '(', round(se[1],2), '), sd=', round(sd1,2), '(', round(se[2],2), '), auc=', round(auc1,2), '(', round(se[3],2), '), r2=', round(100*r2,0),'% \n' ); par(tcl=0.2); plot(test_dat~test_day, pch=20, yaxs="i", xlim=c(0,1.2*max(test_day)),ylim=c(0, 1.2*max(test_dat))); lines(x, auc1*dnorm(x, mean1, sd1)) abline(v= mean1+sd1* c(-2,-1,0,1,2), lty='dotted',col='blue')
Monte Carlo fit
As the first step to a Bayes-LaPlace approach to fitting a multilevel regression, the conventional one river in one year approach with Hamiltonian Monte Carlo. This uses the Stan language for Bayesian statistics via R package rstan.
Overfitting is a problem for prediction in ecological systems with a superabundance of habitat indicators. Leave-one-out statistics, where the predicted value of each data point, when it is omitted from the data used to fit the model, is a safeguard against overfitting. The latest development of this is "PSIS-LOO" (Betancourt et al., 2017), available via R package loo.
experience: scheduling, downloading, booting R_stan took ~70 seconds. The following test of rstan completed in 76 seconds, mainly compiling. Sampling took less than 1/10 second.
# library(rstan) # stancode <- ' # data {real y_mean;} # parameters {real y;} # model {y ~ normal(y_mean,1);}' # from R help: stan_model # mod <- stan_model(model_code = stancode) # fit <- sampling(mod, data = list(y_mean = 0)) # print(fit) # fit2 <- sampling(mod, data = list(y_mean = 5)) # print(fit2)
library(rstan) stan_Gauss_code <- ' functions { // Gaussian to predict temporal abundance of salmon. // faster than (1/(sigma*sqrt(2*pi)))* exp(-(1/2)*((x-mu)/sigma)^2) // sqrt(2) = 1.4142135623731, sqrt(2*pi)= 2.506628274631 vector Gauss(vector x, real mu, real sigma) { return (1/(sigma*2.506628274631))* exp(-square((x-mu)/(1.4142135623731*sigma))); } } data { int<lower=0> np; // count of prior (8) vector [np] prior; // timing: m,s; spread: m,s; auc: m,s int<lower=0> n; // count of data vector [n] x; // day of year for observation vector [n] y; // estimate of abundance for thatday } parameters { real <lower=0> timing; // mean of Gaussian real <lower=0> spread; // standard deviation of Gaussian real <lower=0> auc; // area under curve, total abundance real <lower=0> sigma; // sd of residuals } model { vector [n] ypred; // prior knowledge timing ~ normal(prior[1], prior[2]); spread ~ normal(prior[3], prior[4]); timing ~ normal(prior[5], prior[6]); sigma ~ gamma( prior[7], prior[8]); // prediction from trial parameters ypred = auc * Gauss(x, timing, spread); // Gauss returns a vector // likelihood of data given prediction y~normal(ypred, sigma); } ' stanc_ret <- stanc(model_code='stan_Gauss_code') # check fussy syntax, compile stan_Gauss <- stan_model (stanc_ret=stanc_ret) # create model
# create test data and initial guesses # background temporal distribution timing <- 280.0; spread <- 15.0; auc <- 2000.0; n = 8; # number of days observed # x is day of observation (day of Julian year) x <- floor(seq(from=timing - 2*spread,to=timing + 2*spread, length.out=n)); # y is estimates of abundance y0 <- auc*dnorm(x,timing, spread); # without observation error y <- round(y*rnorm(y0,1,0.2),0); # add 20% error, proportional to y. cat(' x =',x,'\n y =',y,'\n'); # ----- priors # guess for auc is linear AUC auc <- round(AUC(x,y), 0); # guess for timing is day weighted by count ptm <- round(sum(x*y)/sum(y), 0); # ptm abbreviates prior timing mean # guess for spread is weighted variance psm <- round(sqrt(sum( y* (x-ptm)^2)/sum(y)), 0) # ignore supposed bias plot(x, y0, pch=20, ylim=c(0,1.2*max(y)) ); points(x,y); lines(x,auc*dnorm(x,ptm,psm)); cat(sep='', 'calculated priors: timing=',ptm,', spread=',psm,', AUC =',auc,'\n') # sigma (sd of normal residuals) has prior gamma(shape=5, scale=1) k=exp(seq(-4,3,.25)); # interested in behaviour near zero plot( k,dgamma(k,shape=5, rate=1), type="l",ylim=c(0,0.25) ); lines(k,dgamma(k,shape=10,rate=1), col="red"); prior <- c( timing_m = ptm, # 1. about 280 (September 30 is day 273) timing_s = 7.0, # 2. range, 4 sd is 1 moon spread_m = psm, # 3. about 2 moons. spread_s = 3, # 4. range 1 to 4 moons. Avoid negative! auc_m = auc, # 5. simplest estimate auc_s = 0.25*auc, # 6. expect 0/5 to 1/5 of estimate sigma_shape = 10.0, # 7. high and wide. Y should be scaled: y/max(y) sigma_scale = 1.0) # 8. print(prior) # shows names dat <- list(np=8, prior=prior, n=8, x=x, y=y); # might need to initialize HMC # one chaing for testing # init1 <- list( # chain1=list(timing=ptm, spread=psm, auc=auc, sigma=10.0)); # print(as.data.frame(init1)); # four chains for fitting # init4 <- list( # chain1=list(timing=ptm*1.10, spread=psm*1.10, auc=auc*1.10, sigma=12.0), # chain2=list(timing=ptm,1.05, spread=psm*1.05, auc=auc*1.05, sigma=10.0), # chain3=list(timing=ptm*0.95, spread=psm*0.95, auc=auc*0.95, sigma=8.0), # chain4=list(timing=ptm*0.90, spread=psm*0.90, auc=auc*0.90, sigma=6.)); # print(as.data.frame(init4)); # I used these print statements in model block to debug Stan code # with (iter=1, chains=1); # // print("timing =",timing); // to check Stan code # // print("spread =",spread); # // print("auc =",auc); # // print("sigma =",sigma); # // print("x =",x); # // print("y =",y); # // print("ypred =",ypred);
## test my function in Stan function block # expose_stan_functions(stan_Gauss); # Gauss(c(2,3,4), 0, 1); # [1] 0.0539909665 0.0044318484 0.0001338302 # dnorm(c(2,3,4), 0,1) # [1] 0.0539909665 0.0044318484 0.0001338302
fit <- sampling(stan_Gauss, data=dat, chains=4, iter=2000); # init=init1); print(fit) pairs(fit, condition=0.2);
Experience
Stan is "strongly typed" to enforce rigor in coding, an attempt to avoid subtle bugs. Which takes some getting used to, a contrast to R. I found vector [n] x confusing compared to real x [n], the latter being an array, so you could have real [m] x [n] which would be n arrays of m values. And then row_vectors which are not vectors or arrays, nor rows of a matrix. I see the point: protect domain experts (salmon ecologists) from their vague notions of linear algebra.
This model does not work with the default improper priors; their use causes the sampling to get stuck from
Parameters were recovered accurately, but that is a trivial result because the true model was known. In practices, true model is unknown (infinitely complicated) and a serviceable while simplistic description will be discovered from exploring real data.
Many Rivers, Many Years
Multilevel Regression
A simple, linear, multivariate, multilevel regression model would be
where X is the matrix of independent variables and the fitted parameters
Some quotes, about how "partial pooling" shares information among groups, might be helpful at this point:
"An extreme approach would be to completely pool all the data and estimate a common vector of regression coefficients β. At the other extreme, an approach with no pooling assigns each group its own coefficient vector
"While complete pooling or no pooling of data across groups is sometimes called for, models that ignore the grouping structures in the data tend to underfit or overfit" (Gelman et al., 2013). Hierarchical [multilevel] modeling provides a compromise by allowing parameters to vary by group at lower levels of the hierarchy while estimating common parameters at higher levels. Inference for each group-level parameter is informed not only by the group-specific information contained in the data but also by the data for other groups as well. This is commonly referred to as borrowing strength or shrinkage." - http://mc-stan.org/rstanarm/articles/glmer.html
River and Year Effects on Timing and Spread
The background model for temporal distribution of salmon spawner abundance is a distribution (not yet identified, Gaussian by default) with two parameters: timing and spread. This is applied to all nyear years and all nriver rivers. The other part of the background is the overall abundance of salmon for each year ("run size"). The extent to which the rivers share a single run of salmon is a matter of subsequent ecological refinement; this analysis starts with the assumption of one run each year shared unevenly across rivers.
The departure from that background for any one river, the river effect on timing and spread, is two parameters. If the data for all rivers were pooled, there would only be
# example of double expontial probability density distribution LaPlace <- function(x, mu, sigma){ (1/(2*sigma)) * exp(-abs(x-mu)/sigma) } x <- seq(-5,5,0.5); mu <- 0. sigma <- 2. y <- LaPlace(x,mu, sigma); par(las=1,tcl=0.2, pch=20, yaxs="i"); plot(y~x, type='o', ylim=c(0.0, 1.1/(2*sigma)), ylab='Pr(x)' );
River effects are modeled the same way. This first model ignores any year X river effects (
Those examples indicate there will be habitat indicators that are covariates (predictors) for the stream and year effects on temporal distributions. It is very clear that spawner behaviour on the West Coast of Vancouver Island is affected by stream flows: spawners assemble in the ocean near a river (estuary) and enter after a rainfall event raises water levels in the river; spawners wait in pools in the river before spawning in response to a rainfall/flow event. This model does not include such behaviour and attendent habitat indicators, but it lays the foundation for such extensions. In fact, the objective of this exercise is to develop a framework to use what we know about salmon ecology to improve estimates of salmon abundance.
River and Year Effects on Abundance
The area under the curve of spawner abundance is not the same as the number of spawners because a spawner will reside in the spawning ground for many days. That residence time may be longer for females guarding redds than for males that drift downriver after spawning. A model that involved the accumulation of identified spawners would involve a cumulative distribution function, possibly preferable. But in this case the total annual abundance of spawners over all rivers in a year, is not the run size considered as the number of salmon indepent of multiple observations of the same fish. Determining number of fish from the sum of daily abundances of fish requires knowing the duration of a fish in a river. That will be an important calibration parameter, but probably complicated by the need for habitat indicators: perhaps warmer river temperatures mean shorter spawning ground residence.
Abundance is primarily a year effect. There is little prior knowledge, other than non-negative, in this model, but that will not be typical. Autocorrelations apply, such as known long trends in salmon abundance, and four-year generation cycles. The application of this model is likely to involve known abundances for all but the last year.
Independent of years (or average over all years) there is a river effect for abundance: the proportion of the total that goes to each river. Perhaps proportions vary due to spawning ground area and/or quality, or river length. The existence of sub-populations (or populations within a metapopulation) of salmon that home to one stream and not to others is an imporatn complication. This model assumes that any such sub-populations vary identically among years. The proportions by river sum to the entire abundance of salmon, therefore the fitted values for that vector of parameters ( with length nriver) are not independent, they must sum to unity. Such a vector is a simplex and a specific probability model is required to meet that constraint, the Dirichlet distribution (Gelman et al 2017, page ).
After considering temporal distributions and abundance by year and river proportion, the abundance in each year in each river (i.e. the AUC) is estimated as
where the timing
River proportion
Describing the relative abundance of spawners between rivers as the proportion of salmon that enter that river means the fitted estimates for those proportions across all river must sum to 1. That vector of non-negative parameters is a unit simplex, the Direchlet distribution (Stan Functions Reference v2.19 §23.1 p126) describes the multinomial probabilities for values within a simplex. The Direchlet parameter describes how even the probabilities are, with large values (>100) creating a even distribution and small values (< 1/00) creating extremes (one of ~1 and all others ~zero). Using 1 as the default looks like uniform random.
data { int <lower=1> nriver; real <lower=0> alpha; // large alpha means even proportions. } parameters { simplex [nriver] river_prop; // constrained 0<x<1 and sum=1. } model { river_prop ~ direchlet(alpha); // sampling for run proportions }
Model Overview
"In theory, a Bayesian model should include all relevant substantive knowledge and subsume all possible theories. In practice, it won’t." - Jeffrey B. Arnold https://jrnold.github.io/bayesian_notes/model-checking.html
Any model is one of many possible variants, so we need a basis for comparing models. An indicator of model fit based on prediction ability uses cross validation to leave one out (LOO) when fitting and the predict that excluded observation. Conceptually, the fitting is repeated for every data point, but there is a mathematical shortcut. The required information is the log likelihood of each data point; produced and saved in the generated quantities block in a Stan model. http://mc-stan.org/loo/reference/loo-package.html
Stan code
This code uses an overly simplified version of salmon abundance patterns, but tries to get the surrounding stats model right.
library(rstan) stan_MultiGauss_code <- ' functions { // Gaussian to predict temporal abundance of salmon. // faster than (1/(sigma*sqrt(2*pi)))* exp(-(1/2)*((x-mu)/sigma)^2) // sqrt(2) = 1.4142135623731, sqrt(2*pi)= 2.506628274631 /// NOT vector version real Gauss(real x, real mu, real sigma) { return (1/(sigma*2.506628274631)) * exp(-square((x-mu)/(1.4142135623731*sigma))); } } data { int<lower=0> np; // length of vector of priors, vector [np] prior; // does not include river_prop_prior int<lower=0> nriver; // how many rivers int<lower=0> nyear; // how many years int<lower=0> ndata; // how many data vector [ndata] x; // day of year for each observation vector [ndata] y; // estimate of abundance for each day in x int river_id [ndata]; // id of river in data. integer array int year_id [ndata]; // id of year vector [nriver] river_prop_prior; // Dirichelet, vector of ones, even } parameters { vector <lower=0> [nyear] run; // annual spawner abundance real <lower=0> timing; // all years and rivers real <lower=0> spread; // all years and rivers vector [nyear] year_timing; // by year, all rivers vector [nyear] year_spread; // by year, all rivers vector [nriver] river_timing; // by rivers, all years vector [nriver] river_spread; // by rivers, all years simplex [nriver] river_prop; // fraction of run real <lower=0> sigma; // stdev of Gaussian residuals } model { // local variables vector [ndata] ypred; // predicted int year; // index int river; // index real timing_pred; // background+yearEffect+riverEffect real spread_pred; // background+yearEffect+riverEffect // prior knowledge of probability distributions of parameters timing ~ normal(prior[1], prior[2]); // background spread ~ normal(prior[3], prior[4]); // background run ~ normal(prior[5], prior[6]); year_timing ~ double_exponential(0,prior[7]); // shared all years year_spread ~ double_exponential(0,prior[8]); river_timing ~ double_exponential(0,prior[9]); // shared all rivers river_spread ~ double_exponential(0,prior[10]); river_prop ~ dirichlet(river_prop_prior); // separate data vector sigma ~ gamma(prior[11], prior[12]); // wide // prediction from a sampled point in parameter space for (j in 1:ndata){ year = year_id[j]; river = river_id[j]; timing_pred = timing + year_timing[year] + river_timing[river]; spread_pred = spread + year_spread[year] + river_spread[river]; ypred[j] = run[year] * river_prop[river] * Gauss(x[j], timing_pred, spread_pred); } // accumulate log likelihood of data given this prediction y~normal(ypred, sigma); } ' #generated quantities { # vector[ndata] log_lik; # log_lik = normal_lpdf(ypred | y, sigma); YPRED DOES NOT EXIST. # } ' stanc_ret <- stanc(model_code='stan_MultiGauss_code') # parse stan_MultiGauss <- stan_model (stanc_ret=stanc_ret) # create model
Test Data
If the statistical model cannot recreate the parameters (within sampling variance) behind simulated data that exactly matches the model, then something is wrong. I started with 5 rivers and 5 years but moved to 10 of both. Guessing that application will be more like 20 of both. In reality, timing and spread will be responses to river flow events, clearly for WCVI, less so for Fraser River with protracted freshwater migration. Big rivers still have habitat events that control salmon migrations, such as high water temperature blocking Okanogan River for Osoyoos Lake sockeye waiting in slightly cooler Columbia River.
The most worrisome bit is the proportion of run that goes to each river, averaged across all years. That involves Direchelet priors for proportions (a simplex, sum is 1). And seems unrealistic: will not be constant, assumes populations spread across rivers instead of one population in one river so number of spawners will be independent of other rivers.
# data and initial guesses for testing stan_MultiGauss # dat[] will have 10 items: # int<lower=0> np; // length of vector of priors, # vector [np] prior; // does not include river_prop_prior # int<lower=0> nriver; // how many rivers # int<lower=0> nyear; // how many years # int<lower=0> ndata; // how many data # vector [ndata] x; // day of year for each observation # vector [ndata] y; // estimate of abundance for each day in x # int river_id [ndata]; // id of river in data. integer array # int year_id [ndata]; // id of year # vector [nriver] river_prop_prior; // Dirichelet, even when vector of ones # setup to simulate salmon abundance (before observing) nriver <- 10 nyear <- 10 timing <- 280.0 # day , background timing (mean of Gaussian) spread <- 10.0 # days, background spread (SD of Gaussian) year_timing_sd <- 4 year_spread_sd <- 2 river_timing_sd <- 4 river_spread_sd <- 2 year_timing <- sort(rnorm(nyear, 0, year_timing_sd)); year_spread <- rnorm(nyear, 0, year_spread_sd); river_timing <- sort(rnorm(nriver, 0, river_timing_sd), decreasing=T); river_spread <- rnorm(nriver, 0, river_spread_sd); # run size: 1k, 2k, ... 10k, so observations are >1 run <- 1000.0 * seq(3,7,length.out=nyear); # normal-ish, m-5k,sd=1k # river proportion of run (over all years) river_prop <- c(1:nriver)+ 3; river_prop <- river_prop/sum(river_prop); # 10 rivers: 5% to 15%. river_prop_prior= rep(1,nriver); cat('background timing =', timing, 'background spread =', spread, '\n year_timing = \n ', round(year_timing, 2), '\n year_spread = \n ', round(year_spread, 2), '\n river_timing = \n ', round(river_timing,2), '\n river_spread = \n ', round(river_spread,2), '\n run = \n' , run, '\n river_prop = \n' , round(river_prop,3), '\n'); # set day of observation in each river, always the same. nday <- 6; # surveys in one river in one year. from <- timing-2*spread; to <- timing + 2*spread; x1 <- floor(seq(from,to, length.out=nday)); # day of year for obs. x <- rep(x1, nyear*nriver); # in every river every year. # id: index into x and y for river and year ndata <- nday*nriver*nyear year_id <- rep(1:nyear, each= nriver*nday); # 60 of 1, 60 of 2,. length is 600 river_id <- rep(rep(1:nriver, each=nday), nyear); # 1,1,1,1,1,1,2,2,2 ... 10,10,10, 1,1,1 ... for 600 # y is simulated observations. order is year, river, day. array better. y=numeric(nyear*nriver*nday); jn <- 0 # index into data for (jy in 1:nyear){ for (jr in 1:nriver){ t <- timing + year_timing[jy] + river_timing[jr]; s <- spread + year_spread[jy] + year_spread[jr]; r <- run[jy] * river_prop[jr]; for (jd in 1:nday ){ jn <- jn + 1; y[jn] <- r * dnorm(x[jd], t, s); # true abundance } } } yobs <- y * rnorm(y, 1, 0.1) # 10% observation error. proportional. a <- cbind (year_id, river_id, x, round(y,2), round(yobs,2)); print(a[c(1:35, 565:600), ]);
plot some simulated years
par(las=1,tcl=0.2, pch=20, yaxs="i", mgp=c(0,.5,0), mar=c(0,0,.2,0), mfrow=c(5,1)); colr=c("black","blue","green","orange","red") years <-c(1,4,6,8,10) for(year in years){ b <- a[ (a[,1] == year), ]; ymax <- max(b[,4:5]) plot(y~x, type='n', ylim=c(0,ymax), xaxt="n",ylab='',xlab=''); axis(1,labels=F);axis(3,labels=F);axis(4,labels=F); k <- 0; for(j in seq(0,nriver*nday-1,nday)){ # 1 7 13 19 25 ... k=k+1; lines( b[j:(j+nday), 3], b[j:(j+nday), 4], col=colr[k] ); # 6 observations points(b[j:(j+nday), 3], b[j:(j+nday), 5], col=colr[k] ); } } #
plot some simulated rivers
par(las=1,tcl=0.2, pch=20, yaxs="i", mgp=c(0,.5,0), mar=c(0,0,.2,0), mfrow=c(5,1)); colr=c("black","blue","green","orange","red") for(river in c(1,4,6,8,10)) { b <- a[(a[,2] == river), ] plot(y~x, type='n', ylim=c(0.0, max(b[,4:5])), xaxt="n",ylab='',xlab=''); axis(1,labels=F);axis(3,labels=F);axis(4,labels=F); k <- 0; for(j in seq(0,nyear*nday-1,nday)){ # 0,6,12,,48,54 k=k+1; lines( b[j:(j+nday), 3], b[j:(j+nday), 4], col=colr[k] ); # 6 observations points(b[j:(j+nday), 3], b[j:(j+nday), 5], col=colr[k] ); } } #
Define prior knowledge
# ----- priors -------- # // prior knowledge of probability distributions of parameters prior <- numeric(12); # timing ~ normal(prior[1], prior[2]); // background prior[1:2] <- c(timing, 7); # too tight, loosen later # spread ~ normal(prior[3], prior[4]); // background prior[3:4] <-c(spread, 3); # too tight, loosen later # run ~ normal(prior[5], prior[6]); prior[5:6] <- c(5000,1000); # too tight, loosen later # year_timing ~ double_exponential(prior[7]); // shared all years # year_spread ~ double_exponential(prior[8]); # river_timing ~ double_exponential(prior[9]); // shared all rivers # river_spread ~ double_exponential(prior[10]); prior[7:10] <- c(year_timing_sd, year_spread_sd, river_timing_sd, river_spread_sd); # sigma ~ gamma(prior[11], prior[12]); // wide prior[11:12] <- c(10,1); # river_prop ~ dirichlet(river_prop_prior); // separate data vector river_prop_prior <- rep(1.0, nriver); # see preceding MC fit for prior of parameters' prob. distributions names(prior) <- c('timing_m', 'timing_sd', 'spread_m', 'spread_sd', 'run_m', 'run_sd', 'year_timing_exp', 'year_spread_exp', 'river_timing_exp', 'river_timing_exp', 'sigma_gamma_shape', 'sigma_gamma_scale'); print("prior"); print(prior); print("river_prop_prior"); print(river_prop_prior); # sigma (sd of normal residuals) has prior gamma(shape=5, scale=1) k=exp(seq(-4,3,.25)); # interested in behaviour near zero plot( k,dgamma(k,shape=5, rate=1), type="l",ylim=c(0,0.25), xlab='',ylab='Probability Density', main='shape = 5,10; rate=1' ); lines(k,dgamma(k,shape=10, rate=1), col="red"); dat <- list(np=12, prior=prior, nyear=nyear, nriver=nriver, ndata=ndata, x=x, y=yobs, year_id=year_id, river_id=river_id, river_prop_prior=river_prop_prior ); print("This is dat:"); print(dat);
# test turns library (rstan) s1 <- sampling( stan_MultiGauss, data=dat, chains=4, iter=4000, cores=4, verbose = F);
print (s1); plot(s1, pars=c('year_timing','year_spread','river_timing','river_spread'));
References
Vancouver format for PlosOne: Hou WR, Hou YL, Wu GF, Song Y, Su XL, Sun B, et al. cDNA, genomic sequence cloning and overexpression of ribosomal protein gene L9 (rpL9) of the giant panda (Ailuropoda melanoleuca). Genet Mol Res. 2011;10: 1576-1588. [delete this one]
Akenhead, SA, Irvine JR, Hyatt KD, Johnson SC, Grant SCH. Stock-recruit analyses of Fraser River sockeye salmon. N. Pac. Anadr. Fish Comm. Bull. 2016;6: 363–390. doi: 10.23849/npafcb6/363.390.
Betancourt MA. Conceptual Introduction to Hamiltonian Monte Carlo. arXiv:1701.02434.[Preprint]. 2018.
Cunningham JP. Gaussian Processes for machine learning. Invited lecture at the Machine Learning Summer School 2012, La Palma, Spain, Apr. 18, 2012. Video: https://www.youtube.com/watch?v=BS4Wd5rwNwE orcid.org/0000-0002-1805-9743
Gelman A, Hill J. Data Analysis Using Regression and Multilevel- Hierarchical Models. Cambridge, UK: Cambridge University Press; 2007.
Gelman A, Carlin JB, Stern HS , Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. 3d edition, London: Chapman & Hall/CRC Press; 2013.
Carpenter R, Gelman A, Hoffman M, Lee D, Goodrich B, Betancourt M, Brubaker M, Jiqiang G, Li P, Riddell A. Stan: A Probabilistic Programming Language. Journal of Statistical Software 2017; 76 (1): 1–32. doi: 10.18637/jss.v076.i01. ISSN 1548-7660.
Vehtari A, Gelman A, Gabry J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 2017; 27(5):1413–1432. doi:10.1007/s11222-016-9696-4. arXiv:1507.04544 [preprint].