6. Building a data object in R
2026-05-11
Source:vignettes/data-without-excel.Rmd
data-without-excel.RmdRceattle normally reads input data from an Excel workbook via
read_data(). The same data object can be built entirely in
R as a named list — useful when:
- Simulating data for testing or MSE operating models
- Generating inputs programmatically (e.g., multi-scenario loops or sensitivity analyses)
- Integrating Rceattle into pipelines that don’t use Excel
Everything passed to fit_mod() lives in this list, so
understanding its structure also helps you modify existing data objects
read from Excel. This vignette walks through each component. The result
can be validated with data_check() and fitted with
fit_mod(), or exported back to Excel with
write_data().
Required vs. optional fields
The list are optional and only needed when a particular feature is
turned on. Rceattle fills missing optional fields with safe
defaults and stops model runs under the conditions that actually require
them:
| Field | Required when |
|---|---|
comp_data |
(always optional) |
caal_data |
any(growth_model > 0) (estimating growth) |
emp_sel |
any fleet has Selectivity = "Fixed"
|
NByageFixed |
any(estDynamics > 0) |
env_data |
env linkages |
diet_data |
msmMode > 0 |
ration_data |
msmMode > 0 |
Bioenergetics scalars (Ceq, CA,
CB, Tco, …) |
msmMode > 0 |
In default single-species mode (msmMode = 0) you can
omit any of these by simply not setting the field on the list. The
single-species example below shows the full machinery, but feel free to
skip sections you don’t need.
Setup
library(Rceattle)
# Define core dimensions up front — these drive the size of all downstream arrays.
nspp <- 2 # number of species (or stocks)
nages <- 10 # maximum age classes (same for all species here)
nlengths <- 20 # number of length bins (for age-length keys / CAAL data)
nyrs <- 30 # number of hindcast years
years <- 1:nyrs
simData <- list()1. Species / model controls
These scalars and vectors describe the biological structure of each species and control which population model is used.
simData$nspp <- nspp
simData$styr <- 1 # first year of hindcast
simData$endyr <- nyrs # last year of hindcast
simData$projyr <- nyrs + 10 # last year of the projection period
# Character labels shown in plots and output tables
simData$spnames <- paste0("Species", 1:nspp)
# nsex: 1 = combined-sex model (male and female treated as one group),
# 2 = sex-structured model (separate selectivity and mortality by sex)
simData$nsex <- rep(1, nspp)
# Calendar month of spawning — used to compute the fraction-of-year weight
# applied to spawning stock biomass (e.g., 6 = mid-year spawning)
simData$spawn_month <- rep(6, nspp)
simData$nages <- rep(nages, nspp) # number of age classes per species
simData$minage <- rep(1, nspp) # minimum age (usually 1)
simData$nlengths <- rep(nlengths, nspp)
# estDynamics controls how population abundance is estimated:
# 0 = estimate population dynamics (standard statistical catch-at-age)
# 1 = fix N-at-age exactly to values in NByageFixed (e.g., external estimates)
# 2 = scale NByageFixed by a single estimated scalar per species
simData$estDynamics <- rep(0, nspp)
# Indices linking each species to its weight and growth datasets.
# Multiple weight schedules can exist (e.g., fleet-specific); these point to
# which one is used for total biomass (pop) and spawning biomass (ssb).
simData$pop_wt_index <- 1:nspp
simData$ssb_wt_index <- 1:nspp
simData$pop_age_transition_index <- rep(1, nspp)
# Length-weight relationship coefficients: W = alpha * L ^ beta.
# Used internally to convert length-based selectivity to weight when estimating growth.
simData$alpha_wt_len <- rep(0.00001, nspp)
simData$beta_wt_len <- rep(3, nspp)
# Prior or initial standard deviation for log-recruitment deviations. Controls the
# strength of the recruitment penalty in penalized-likelihood mode.
simData$sigma_rec_prior <- rep(1, nspp)
# Non-modelled ("other") prey biomass available to predators. Used in
# multi-species mode to prevent unrealistically high predation mortality when
# modelled prey are scarce. Setting this too low forces predators to over-eat
# modelled prey; too high dilutes the predation signal.
simData$other_food <- rep(1e5, nspp)2. Fleet control table
This is the most important configuration table. Each row defines one fleet (fishery or survey) and controls how it is modelled. Multiple fleets can target the same species.
# Fleet_type:
# "Fishery" -- contributes catch and catch-at-age/length composition data
# "Survey" -- contributes an abundance index and composition data
# "Off" -- fleet is present in the data but excluded from the likelihood
# Selectivity options (Selectivity column) — use the string name or integer code:
# "Fixed" (0) -- fixed at values in emp_sel; not estimated
# "Logistic" (1) -- two-parameter ascending logistic (a50, slope)
# "NonParametric" (2) -- freely estimated selectivity-at-age/length with smoothness penalty
# "DoubleLogistic" (3) -- ascending + descending dome-shaped logistic
# "DescendingLogistic" (4) -- descending logistic only (dome-shaped for oldest ages)
# "Hake" (5) -- age-specific logistic with spawning migration pattern
# "2DAR1" (6) -- 2D autoregressive random effects (age × year)
# "3DAR1" (7) -- 3D autoregressive random effects (age × year × fleet)
#
# Selectivity_dimension: "Age" or "Length"
# Length-based selectivity requires an age-to-length transition matrix (see below).
# Catchability options (Catchability column) — use the string name or integer code:
# "Fixed" (0) -- Q fixed at Q_prior; not estimated
# "Estimated" (1) -- freely estimated on log scale
# "Estimated-with-prior" (2) -- estimated with a lognormal prior (Q_prior, Q_sd_prior)
# "Analytical" (3) -- Ludwig & Walters / Martell (1994) analytical Q
# "PowerEquation" (4) -- power-equation Q (habitat area scaling)
# "Environmental" (5) -- mu_q + X * beta, where X is a covariate in env_data
# "AR1" (6) -- auto-regressive catchability (Rogers et al. 2024)
# NA -- not applicable (use for fisheries without CPUE)
# Comp_loglike / CAAL_loglike options:
# "Multinomial" -- standard multinomial
# "DirichletMultinomial" -- overdispersed; estimates a concentration parameter
# "MultinomialAFSC" -- AFSC variant of multinomial (marginal age + length)
simData$fleet_control <- data.frame(
Fleet_name = paste0(c("Survey", "Fishery"),
rep(paste(" Species", 1:nspp), each = 2)),
Fleet_code = 1:(nspp * 2), # unique integer ID per fleet
Fleet_type = rep(c("Survey", "Fishery"), nspp),
Species = rep(1:nspp, each = 2),
Month = 0, # 0 = mid-year; calendar month otherwise
# Selectivity_index links fleets that share the same selectivity curve.
# Fleets with the same index share all selectivity parameters.
Selectivity_index = 1:(nspp * 2),
Selectivity = "Logistic",
Selectivity_dimension = "Age",
N_sel_bins = NA, # used for NonParametric only
Sel_curve_pen1 = NA, # smoothness penalty (NonParametric)
Sel_curve_pen2 = NA,
# Time_varying_sel:
# 0 / "Off" = time-invariant (default)
# 1 / "IID" = penalized log-normal deviates (SD fixed at Time_varying_sel_sd_prior
# unless random_sel = TRUE in fit_mod)
# 2 / "AR1" = AR(1) (not yet implemented)
# 3 / "Block" = time blocks (specify via R-side mapping)
# 4 / "RandomWalk" = random walk
# 5 / "RandomWalkAscending" = random walk on ascending portion of double logistic
Time_varying_sel = 0,
Time_varying_sel_sd_prior = 1,
Bin_first_selected = 1, # youngest fully-selected age or length bin
Sel_norm_bin1 = NA, # bin to normalise selectivity from (optional)
Sel_norm_bin2 = NA,
# Composition likelihoods
Comp_loglike = "Multinomial",
Comp_weights = 1, # input effective sample size multiplier
CAAL_loglike = "Multinomial",
CAAL_weights = 1,
# Index units and weight/growth linkage
Weight1_Numbers2 = 1, # 1 = weight (mt), 2 = numbers
Weight_index = rep(1:nspp, each = 2),
Age_transition_index = 1,
# Q_index links fleets that share the same catchability parameter.
Q_index = 1:(nspp * 2),
Catchability = rep(c("Estimated", NA), nspp),
Q_prior = rep(c(1, NA), nspp),
Q_sd_prior = rep(c(0.2, NA), nspp),
# Time_varying_q:
# 0 / "Off" = time-invariant
# 1 / "IID" = penalized log-normal deviates (SD fixed at Time_varying_q_sd_prior
# unless random_q = TRUE in fit_mod)
# 2 / "AR1" = AR(1) (not yet implemented)
# 3 / "Block" = time blocks
# 4 / "RandomWalk" = random walk
Time_varying_q = rep(c(0, NA), nspp),
Time_varying_q_sd_prior = rep(c(1, NA), nspp),
# Estimate_index_sd:
# 0 = fix at Log_sd from index_data
# 1 = estimate a single SD
# 2 = analytical sigma (Ludwig & Walters 1994)
Estimate_index_sd = rep(c(0, NA), nspp),
Index_sd_prior = rep(c(1, NA), nspp),
# Estimate_catch_sd:
# 0 = fix at Log_sd from catch_data (treats catch as known)
# 1 = estimate
Estimate_catch_sd = rep(c(NA, 0), nspp),
Catch_sd_prior = rep(c(NA, 1), nspp),
# proj_F_prop: how future F is distributed across fishery fleets per species.
# Must sum to 1 across fishery fleets for each species. NA for surveys.
proj_F_prop = rep(c(NA, 1), nspp)
)3. Survey index data
One row per fleet-year combination for survey fleets.
Observation is the log-scale biomass index (or numbers if
Weight1_Numbers2 = 2). A negative Year value
tells Rceattle to predict the index without including it in the
likelihood, which is useful for model checking.
simData$index_data <- data.frame(
Fleet_name = rep(paste("Survey Species", 1:nspp), each = nyrs),
Fleet_code = rep(seq(1, nspp * 2, by = 2), each = nyrs),
Species = rep(1:nspp, each = nyrs),
Year = rep(years, nspp),
Month = 0,
# Selectivity_block: integer index into the selectivity/Q parameter block.
# Used when Time_varying_sel or Time_varying_q = 3 (time blocks).
Selectivity_block = 1,
Observation = rnorm(nyrs * nspp), # replace with real log-index values
Log_sd = 0.1 # observation SE on the log scale
)4. Catch data
One row per fishery fleet-year combination. Log_sd is
typically set small (0.01–0.05) when catch is treated as known; increase
it if catch is highly uncertain.
simData$catch_data <- data.frame(
Fleet_name = rep(paste("Fishery Species", 1:nspp), each = nyrs),
Fleet_code = rep(seq(2, nspp * 2, by = 2), each = nyrs),
Species = rep(1:nspp, each = nyrs),
Year = rep(years, nspp),
Month = 0,
Selectivity_block = 1,
Catch = abs(rnorm(nyrs * nspp)), # replace with real catch (mt)
Log_sd = 0.05
)5. Composition data (optional)
Age or length composition data. Age0_Length1 = 0 for age
compositions, 1 for length compositions.
Sex = 0 for combined sex, 1 for females,
2 for males, 3 for joint female+male.
Composition columns (Comp_1, Comp_2, …) can be
raw counts or proportions — Rceattle normalises internally.
Skip this section entirely (don’t set simData$comp_data)
if no composition data are available — clean_data() will
fill it with an empty data.frame.
comp_matrix <- matrix(abs(rnorm(nyrs * nspp * nages * 2)),
nrow = nyrs * nspp * 2, ncol = nages)
colnames(comp_matrix) <- paste0("Comp_", 1:nages)
simData$comp_data <- cbind(
data.frame(
Fleet_name = c(rep(paste("Survey Species", 1:nspp), each = nyrs),
rep(paste("Fishery Species", 1:nspp), each = nyrs)),
Fleet_code = rep(c(seq(1, nspp * 2, 2), seq(2, nspp * 2, 2)), each = nyrs),
Species = c(rep(1:nspp, each = nyrs), rep(1:nspp, each = nyrs)),
Sex = 0,
Age0_Length1 = 0,
Year = rep(years, 2 * nspp),
Month = 0,
Sample_size = 200
),
comp_matrix
)6. Conditional age-at-length (CAAL) data (optional)
CAAL data provide observed age frequencies within each length bin and
allow the model to jointly fit length and age compositions, and
optionally estimate the growth curve internally (see
vignette("model-parameterizations") and
?build_growth). One row per fleet-sex-year-length-bin
combination.
caal_data becomes required when growth
is being estimated internally (any(growth_model > 0));
otherwise it can be omitted.
caal_matrix <- matrix(abs(rnorm(nyrs * nspp * nages * 2 * nlengths)),
nrow = nyrs * nspp * 2 * nlengths, ncol = nages)
colnames(caal_matrix) <- paste0("CAAL_", 1:nages)
simData$caal_data <- cbind(
data.frame(
Fleet_name = c(rep(paste("Survey Species", 1:nspp), each = nyrs * nlengths),
rep(paste("Fishery Species", 1:nspp), each = nyrs * nlengths)),
Fleet_code = rep(c(seq(1, nspp * 2, 2), seq(2, nspp * 2, 2)),
each = nyrs * nlengths),
Species = c(rep(1:nspp, each = nyrs * nlengths),
rep(1:nspp, each = nyrs * nlengths)),
Sex = 0,
Year = rep(years, 2 * nspp * nlengths),
Length = rep(1:nlengths, 2 * nspp * nyrs),
Sample_size = 200
),
caal_matrix
)7. Biological inputs
Age-to-length transition matrix
The probability that a fish of age a is observed in length bin l. Required when fitting length compositions or estimating growth internally. Each row must sum to 1. The placeholder below uses an identity matrix (age a maps exactly to length bin a); replace with empirical or model-based growth matrices.
age_trans_list <- lapply(1:nspp, function(sp) {
tmp <- matrix(0, nrow = nages, ncol = nlengths)
diag(tmp[1:min(nages, nlengths), 1:min(nages, nlengths)]) <- 1
colnames(tmp) <- paste0("Length_", 1:nlengths)
cbind(data.frame(Age_transition_name = paste("Species", sp),
Age_transition_index = sp,
Species = sp,
Sex = 0,
Age = 1:nages), tmp)
})
simData$age_trans_matrix <- do.call(rbind, age_trans_list)Ageing error matrix
The probability that a fish of true age t is read as observed age o. An identity matrix (no error) is a conservative starting point. Ageing error matrices can be generated with standard tools (e.g., the AgeingError package) and substituted here.
age_error_list <- lapply(1:nspp, function(sp) {
tmp <- as.data.frame(diag(1, nages))
colnames(tmp) <- paste0("Obs_age", 1:nages)
cbind(data.frame(Species = sp, True_age = 1:nages), tmp)
})
simData$age_error <- do.call(rbind, age_error_list)Weight-at-age
Mean weight (metric tonnes) at each age. Setting
Year = 0 applies the same schedule to all years
(time-invariant). Providing multiple rows with different years enables
time-varying weight-at-age, which is used in empirical weight-at-age
models.
weight_matrix <- matrix((1:nages / nages)^3 * 0.01,
nrow = nspp, ncol = nages, byrow = TRUE)
colnames(weight_matrix) <- paste0("Age", 1:nages)
simData$weight <- cbind(
data.frame(Wt_name = paste("Species", 1:nspp),
Wt_index = 1:nspp,
Species = 1:nspp,
Sex = 0,
Year = 0), # Year = 0 → time-invariant
weight_matrix
)Maturity-at-age
Proportion mature at each age, used to compute spawning stock
biomass. Values should range from 0 to 1. For sex-structured models
(nsex = 2) this represents the proportion of females that
are mature.
mat_matrix <- matrix(1, nrow = nspp, ncol = nages)
colnames(mat_matrix) <- paste0("Age", 1:nages)
simData$maturity <- cbind(data.frame(Species = 1:nspp), mat_matrix)Sex ratio at age
Proportion female at each age. Used in combined-sex models
(nsex = 1) to weight spawning biomass contributions. In
two-sex models (nsex = 2) this is the proportion female at
recruitment.
sex_matrix <- matrix(0.5, nrow = nspp, ncol = nages)
colnames(sex_matrix) <- paste0("Age", 1:nages)
simData$sex_ratio <- cbind(data.frame(Species = 1:nspp), sex_matrix)Fixed selectivity (optional)
Pre-specified selectivity-at-age values for fleets whose selectivity
is fixed (not estimated). Required when any fleet has
Selectivity = "Fixed". Skip if all fleets estimate their
own selectivity (as in this example).
When you do need it, the table looks like:
# Example for one fleet whose selectivity is fixed at the supplied row of values
emp_sel <- data.frame(
Fleet_name = "Survey Species 1",
Fleet_code = 1,
Species = 1,
Sex = 0,
Year = 0, # 0 = applied to every year
t(matrix(c(0.1, 0.5, 1, rep(1, nages - 3)), ncol = 1,
dimnames = list(paste0("Comp_", 1:nages), NULL)))
)
simData$emp_sel <- NULLFixed numbers-at-age (optional)
Used when estDynamics > 0:
-
1= numbers-at-age fixed exactly to these values -
2= numbers-at-age scaled by a single estimated scalar per species -
3= numbers-at-age scaled by age-specific estimated scalars per species
Required when any(estDynamics > 0).
Skip if you’re estimating population dynamics from scratch (the default
estDynamics = 0).
When you do need it, the table has columns Species_name,
Species, Sex, Year, then
Age1 ... Age<nages> for the fixed numbers-at-age.
Natural mortality (M1 base)
Residual natural mortality-at-age before any predation component. In
single-species mode, total mortality M = M1. In multi-species mode, M =
M1 + M2 where M2 is predation mortality estimated from the bioenergetics
model. When switching to multi-species mode it is important to reduce M1
accordingly (see vignette("single-vs-multispecies")).
m_matrix <- matrix(0.2, nrow = nspp, ncol = nages)
colnames(m_matrix) <- paste0("Age", 1:nages)
simData$M1_base <- cbind(data.frame(Species = 1:nspp, Sex = 0), m_matrix)M1 can also be estimated rather than fixed. See
?build_M1 for options including estimating a single scalar,
an age-specific vector, or a temperature-dependent function.
Weight-at-age
Used for biomass calculations. Linked to species or fleets. Set Year = 0 to apply the same weight-at-age to all years (time-invariant). Age_1 … Age_nages columns hold mean weight (usually kg) at each age.
weight_matrix <- matrix(
(1:nages / nages)^3 * 0.01, # Simple power curve placeholder
nrow = nspp, ncol = nages, byrow = TRUE
)
colnames(weight_matrix) <- paste0("Age", 1:nages)
simData$weight <- cbind(
data.frame(
Wt_name = paste("Species", 1:nspp),
Wt_index = 1:nspp,
Species = 1:nspp,
Sex = 0,
Year = 0 # Year = 0 applies to all years (time-invariant)
),
weight_matrix
)Environmental covariate data (optional)
An optional time series used in environment-recruitment, -M1,
-catchability, or -growth relationships. Missing years are filled with
the column mean. Column names after Year can be anything
and are referenced by column order.
Skip this if no environmental covariates are used.
clean_data() will fall back to a Year-only data.frame with
no indices, and data_check() will only complain when a
feature actually needs an index (env-q catchability,
Ceq[sp] > 1, or any env linkage).
simData$env_data <- data.frame(
Year = 1:nyrs,
EnvData = 1 # constant = no environmental effect (placeholder)
)8. Multi-species bioenergetics parameters (optional in single-species mode)
These parameters govern the Wisconsin bioenergetics model used to
compute predation mortality (M2) in multi-species mode. They are unused
in single-species mode and can be omitted entirely —
switch_check() fills any missing scalars with safe
sentinels (Ceq = 1, Pvalue = 1,
fday = 365, others 0) so TMB’s
length-nspp DATA_VECTOR requirements are
satisfied.
In multi-species mode (msmMode > 0)
all of Ceq, Cindex,
Pvalue, fday, CA,
CB, Qc, Tco, Tcm,
Tcl, CK1, CK4 must be supplied as
length-nspp vectors; data_check() will list
any that are missing or wrong-length. Likewise diet_data
and ration_data are required.
See ?build_M1 and the CEATTLE technical documentation
for full details on each parameter.
# Ceq: consumption equation form following Hanson et al. (1997)
# 1 = Kitchell et al. (1977)
# 2 = Stewart et al. (1983) -- most common for gadids
# 3 = Kitchell et al. (1977) with temperature dependence on both limbs
# 4 = Thornton and Lessman (1978)
simData$Ceq <- rep(4, nspp)
simData$Cindex <- rep(1, nspp) # temperature index type
simData$Pvalue <- rep(1, nspp) # proportion of maximum consumption realised
simData$fday <- rep(1, nspp) # foraging days per year
# Wisconsin model coefficients (CA, CB) and temperature dependence parameters
simData$CA <- rep(1, nspp)
simData$CB <- rep(-1, nspp)
simData$Qc <- rep(1, nspp)
simData$Tco <- rep(1, nspp)
simData$Tcm <- rep(1, nspp)
simData$Tcl <- rep(1, nspp)
simData$CK1 <- rep(1, nspp)
simData$CK4 <- rep(1, nspp)
# Diet composition likelihood for each predator species
# 0 = Multinomial, 1 = Dirichlet-multinomial
simData$Diet_loglike <- rep(0, nspp)
simData$Diet_comp_weights <- rep(1, nspp)
# Ration data (optional in single-species mode): observed mean ration-at-age from
# bioenergetics studies. Required when msmMode > 0. Skip in single-species mode.
#
# When provided, columns are: Species, Sex, Year, Age1, ..., Age<nages>.
# Diet composition data: observed stomach content proportions (predator-prey pairs).
# Required when msmMode > 0; skip in single-species mode.
#
# When provided, columns are: Pred, Prey, Pred_sex, Prey_sex, Pred_age,
# Prey_age, Year, Sample_size, Stomach_proportion_by_weight.9. Validate and fit
plot_data() prints a summary of the data list.
plot_data(simData)
# Key fit_mod() arguments:
#
# estimateMode: 0 = full fit (hindcast + projection)
# 1 = hindcast only
# 2 = projection only (fix parameters, run HCR)
# 3 = evaluate only (MakeADFun, no optimisation) -- useful for debugging
#
# msmMode: 0 = single-species, 1 = Type II MSVPA, 2 = Type III MSVPA
#
# initMode: 0 = freely estimated initial N-at-age
# 1 = equilibrium, F_init = 0, no init deviations
# 2 = equilibrium, F_init = 0, with init deviations (default)
# 3 = non-equilibrium: F_init estimated, init devs included
# 4 = non-equilibrium: F_init scales R0
#
# phase: TRUE = use parameter phasing for improved convergence (recommended)
sim_run <- fit_mod(
data_list = simData,
estimateMode = 3, # change to 0 for a real fit
msmMode = 0,
initMode = 2,
random_rec = FALSE,
fit_control = fit_control(
phase = TRUE,
verbose = 1))10. Export back to Excel (optional)
write_data() exports the data list to an Excel workbook
in the standard Rceattle template format, which can then be edited and
re-read with read_data(). This makes it easy to share data
objects or move between R-based and spreadsheet-based workflows.
write_data(simData, file = tempfile(fileext = ".xlsx"))