---
title: "Dynamic Factor Model"
author: "Greg Veramendi"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Dynamic Factor Model}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>"
)
```

## Overview

An example of how to estimate the dynamics of a latent factor. Assume
that indicators are observed at two time points and you want to
estimate how the period-2 value depends on the period-1 value. You
need two things:

1. **A measurement system** that is *invariant* across time (same
   loadings, thresholds / residual SDs across the two periods) so the
   factor scale is comparable, and
2. **A structural equation** linking the period-2 factor to the
   period-1 factor: `f₂ = α + β·f₁ + ε`.

**factorana** supports this with a two-stage workflow, and two
helper functions make the setup a few lines:

- **`define_dynamic_measurement()`** builds the Stage 1 measurement
  model: a 2-factor independent system (or k-factor, for k periods)
  with loadings and residual sigmas tied across periods via
  equality constraints, but measurement intercepts left
  period-specific.
- **`build_dynamic_previous_stage()`** converts the Stage 1
  estimation result into a `previous_stage` object for Stage 2,
  carrying the anchor period's intercepts into every factor slot.
  This anchors the measurement level under `E[f₁] = 0` and lets the
  observed mean shift between periods identify the structural
  intercept α.

This vignette walks through a simulated example where `α`, `β`, and
`σ²_ε` are known, shows the recovery, and explains one subtlety of
the workflow (why the anchor-period intercepts are the right ones to
carry into Stage 2).

## Simulate data

One latent construct, three linear indicators per period, two periods.
The DGP:

$$
f_1 \sim N(0,\sigma^2_{f_1}), \qquad
f_2 = \alpha + \beta\,f_1 + \varepsilon, \quad \varepsilon \sim N(0,\sigma^2_\varepsilon)
$$

$$
Y_{t,m} = \tau_m + \lambda_m\,f_t + u_{t,m}, \quad u_{t,m} \sim N(0,\sigma_m^2)
$$

with measurement parameters `τ_m`, `λ_m`, `σ_m` shared across the two
periods.

```{r}
library(factorana)

set.seed(41)
n <- 1500

# Structural parameters (what we want to recover)
true_var_f1   <- 1.0
true_alpha    <- 0.4
true_beta     <- 0.6
true_sigma_e2 <- 0.5

# Shared measurement parameters
item_int   <- c(1.5, 1.0, 0.8)
item_load  <- c(1.0, 0.9, 1.1)   # first loading fixed to 1 in the model
item_sigma <- c(0.7, 0.75, 0.65)

f1  <- rnorm(n, 0, sqrt(true_var_f1))
eps <- rnorm(n, 0, sqrt(true_sigma_e2))
f2  <- true_alpha + true_beta * f1 + eps

gen_Y <- function(f, i) {
  item_int[i] + item_load[i] * f + rnorm(length(f), 0, item_sigma[i])
}

dat <- data.frame(
  intercept = 1,
  eval      = 1L,
  Y_t1_m1 = gen_Y(f1, 1), Y_t1_m2 = gen_Y(f1, 2), Y_t1_m3 = gen_Y(f1, 3),
  Y_t2_m1 = gen_Y(f2, 1), Y_t2_m2 = gen_Y(f2, 2), Y_t2_m3 = gen_Y(f2, 3)
)
```

The wide-format data has columns named `<prefix><item>`: `Y_t1_m1`,
`Y_t1_m2`, `Y_t1_m3` for wave 1, and `Y_t2_m1`, ... for wave 2.

## Stage 1: `define_dynamic_measurement()`

One function call builds the measurement model: a 2-factor
independent system (one factor per period) with loadings and sigmas
tied across periods, intercepts left free per period.

```{r}
dyn <- define_dynamic_measurement(
  data                 = dat,
  items                = c("m1", "m2", "m3"),
  period_prefixes      = c("Y_t1_", "Y_t2_"),
  model_type           = "linear",
  evaluation_indicator = "eval"
)
# The wrapper generates 5 equality constraints: 2 for loadings (items
# m2, m3; item m1's loading is fixed to 1 on its factor slot) and 3
# for sigmas.
length(dyn$equality_constraints)
```

Estimate Stage 1 with `estimate_model_rcpp()` exactly as for any
model system:

```{r}
ctrl <- define_estimation_control(n_quad_points = 8, num_cores = 1)
result_stage1 <- estimate_model_rcpp(
  model_system = dyn$model_system,
  data         = dat,
  control      = ctrl,
  optimizer    = "nlminb",
  parallel     = FALSE,
  verbose      = FALSE
)
result_stage1$convergence
```

Inspect the intercepts: the wave-1 intercepts should match the DGP
`τ_m` (because `E[f₁] = 0` by convention). The wave-2 intercepts
are shifted by `λ_m · E[f₂]`, which is the mean-drift artefact we
are about to discard.

```{r}
est <- result_stage1$estimates
tab <- data.frame(
  m        = 1:3,
  DGP_tau  = item_int,
  wave_1   = round(c(est["Y_t1_m1_intercept"],
                     est["Y_t1_m2_intercept"],
                     est["Y_t1_m3_intercept"]), 3),
  wave_2   = round(c(est["Y_t2_m1_intercept"],
                     est["Y_t2_m2_intercept"],
                     est["Y_t2_m3_intercept"]), 3)
)
knitr::kable(tab, row.names = FALSE)
```

## Why we do not pool the intercepts

A natural alternative is to stack period-1 and period-2 rows into one
long-format dataset and fit a single-factor CFA. The resulting
intercepts would be the *pooled* means of `Y` across both periods,
biased by `λ_m · E[f₂]/2` (because the pooled mean averages the
period-1 and period-2 factor means). Plugging those into Stage 2
under-estimates `α` by roughly `E[f₂]/2`.

The wrapper avoids that: Stage 1 estimates period-specific intercepts,
then `build_dynamic_previous_stage()` carries only the anchor
period's intercepts into Stage 2.

## Stage 2: `build_dynamic_previous_stage()` + `SE_linear`

```{r}
dummy <- build_dynamic_previous_stage(
  dyn           = dyn,
  stage1_result = result_stage1,
  data          = dat,
  anchor_period = 1L
)

fm_stage2 <- define_factor_model(
  n_factors        = 2,
  n_types          = 1,
  factor_structure = "SE_linear"
)
ms_stage2 <- define_model_system(
  components     = list(),       # measurement components prepended from previous_stage
  factor         = fm_stage2,
  previous_stage = dummy
)

init_s2 <- initialize_parameters(ms_stage2, dat, verbose = FALSE)
init_s2$init_params["factor_var_1"]    <- unname(dummy$estimates["factor_var_1"])
init_s2$init_params["se_intercept"]    <- 0.0
init_s2$init_params["se_linear_1"]     <- 0.5
init_s2$init_params["se_residual_var"] <- 0.5

result_stage2 <- estimate_model_rcpp(
  model_system = ms_stage2,
  data         = dat,
  init_params  = init_s2$init_params,
  control      = ctrl,
  optimizer    = "nlminb",
  parallel     = FALSE,
  verbose      = FALSE
)
result_stage2$convergence
```

## Recovery

```{r}
est <- result_stage2$estimates
se  <- result_stage2$std_errors
ps  <- c("factor_var_1", "se_intercept", "se_linear_1", "se_residual_var")

tab <- data.frame(
  param = ps,
  true  = c(true_var_f1, true_alpha, true_beta, true_sigma_e2),
  est   = round(unname(est[ps]), 4),
  se    = round(unname(se[ps]),  4)
)
tab$z <- round((tab$est - tab$true) / tab$se, 2)
knitr::kable(tab, row.names = FALSE)
```

`se_intercept` (the structural α) recovers essentially exactly, which
is the whole point of the workflow. The slope β, residual variance
σ²_ε, and input factor variance σ²_f₁ also recover within their
standard errors.

## Cluster-robust standard errors

The standard errors above are model-based (inverse observed
information). When individuals are grouped (for example pupils within
schools, or siblings within households) and that grouping induces
dependence the model does not capture, `vcov_factorana()` provides
cluster-robust standard errors. Pass `cluster` as the name of a
grouping column in the data (or a vector of length `nrow(dat)`).

Here we attach a synthetic school id (50 schools) and cluster on it:

```{r}
set.seed(7)
dat$school <- sample(1:50, n, replace = TRUE)

se_cluster <- robust_se(result_stage2, dat, type = "cluster", cluster = "school")

ps <- c("factor_var_1", "se_intercept", "se_linear_1", "se_residual_var")
data.frame(
  param      = ps,
  se_model   = round(unname(result_stage2$std_errors[ps]), 4),
  se_cluster = round(unname(se_cluster[ps]),                4),
  row.names  = NULL
)
```

Because the schools were assigned at random here, the cluster-robust
and model-based standard errors are close; with genuine within-cluster
dependence the cluster-robust values would typically be larger.

**Two-stage caveat.** These standard errors are computed on the Stage 2
fit and treat the Stage 1 measurement parameters (loadings, sigmas,
intercepts) as known. They therefore understate the total uncertainty
(the generated-regressor problem). For honest two-stage inference,
resample clusters and re-run *both* stages in a bootstrap; a
single-stage robust or cluster-robust covariance is exact only when the
measurement system is not itself estimated.

## Bootstrap for honest two-stage standard errors

A cluster bootstrap that re-runs *both* stages on each resample accounts
for the first-stage uncertainty. `bootstrap_factorana_multistage()`
drives this in one call: it resamples clusters, and for each replicate
fits Stage 1 and then Stage 2 chained on that replicate's own Stage 1
fit. Each `(stage, sample)` fit is saved to disk and skipped if it
already exists, so the run is restartable and the per-sample work can be
scattered across a compute cluster with `bootstrap_fit_sample()`.

```{r, eval = FALSE}
stage_builders <- list(
  # Stage 1: the measurement system (prev_fits is empty for the first stage)
  function(prev_fits, data) dyn$model_system,
  # Stage 2: SE_linear, chaining on THIS replicate's Stage 1 fit
  function(prev_fits, data) {
    prev <- build_dynamic_previous_stage(
      dyn = dyn, stage1_result = prev_fits[[1]], data = data, anchor_period = 1L)
    fm2 <- define_factor_model(n_factors = 2, factor_structure = "SE_linear")
    define_model_system(components = list(), factor = fm2, previous_stage = prev)
  }
)

boot <- bootstrap_factorana_multistage(
  stage_builders, dat, R = 500, cluster = "person",
  dir = "boot_run", control = ctrl,
  optimizer = "nlminb", parallel = FALSE, verbose = FALSE
)

boot$final$boot_se   # Stage 2 bootstrap standard errors (include Stage 1 uncertainty)
boot$final$ci        # percentile intervals
```

For a multi-node run, generate the samples once with
`generate_bootstrap_samples()` and call `bootstrap_fit_sample()` from an
array job (one task per sample, per stage), then summarise with
`collect_bootstrap()`.

## Notes on generalisation

- **More than two periods.** `period_prefixes` can have length
  3, 4, or more. The wrapper builds a k-factor independent Stage 1
  with all `k · (n_items - 1)` loadings and `k · n_items` sigmas tied
  across periods. For Stage 2 SE modelling, pick any two factors
  (e.g., periods *t-1* and *t*) and build a 2-factor SE model with
  `previous_stage`.
- **Ordered probit measures.** Pass `model_type = "oprobit"` and
  `n_categories = K`. The wrapper ties the `K-1` cutpoints of each
  item across periods (in place of sigma).
- **Additional covariates.** Pass `covariates = c("intercept", "x1", ...)`.
  Covariate coefficients on items are estimated per-period in Stage 1
  (not tied by this wrapper); the anchor-period values are carried
  into Stage 2.

## Where to go next

- `?define_factor_model`: `factor_structure = "SE_quadratic"` adds a
  quadratic term `γ·f_1²` to the structural equation (trap/threshold
  dynamics).
- `?define_model_system`: the underlying `equality_constraints`
  argument, used by the wrapper.
- `vignette("measurement_system", package = "factorana")`: basics
  of measurement-system estimation.
- `vignette("roy_model", package = "factorana")`: a different
  structural setting (discrete sector choice plus continuous outcomes
  sharing a latent ability).
