top of page

Pooled SEM Analysis Across Cohorts

  • Nov 19
  • 3 min read
ree

Why Did We Choose Structural Equation Modeling (SEM)?

In the EQUAL-LIFE project, we face a complex challenge: understanding how multiple urban environmental factors influence children's health, using data from different European cohort studies. Structural Equation Modeling (SEM) represents the ideal methodology for this analysis because it allows us to simultaneously model complex causal relationships between observed and latent variables, accounting for the interconnections among environmental exposures, socio-economic factors, and health outcomes.

Unlike traditional regressions that analyze one relationship at a time, SEM allows us to build an integrated theoretical model where, for example, maternal education influences both environmental exposures and pregnancy behaviors, which in turn influence birth weight and ultimately the risk of ADHD. This approach more faithfully reflects the biological and social reality in which children grow up.


The Federated Approach: Working with Correlation Matrices

A distinctive feature of the EQUAL-LIFE project is the use of correlation matrices instead of individual data. This choice arises from the need to respect stringent privacy regulations for health data: each cohort study independently calculates its own correlation matrix among all harmonized variables and shares only this statistical summary, without ever transferring information about individual participants.

SEM is particularly suited to this federated approach because, under the assumption of multivariate normality, correlation matrices contain all the necessary information to estimate model parameters. This allows us to combine evidence from thousands of children distributed across Europe, while keeping raw data protected within each research center.


Multi-Group Pooling: Combining Heterogeneous Studies

Our analytical strategy uses multi-group SEM, a technique that allows us to simultaneously analyze all studies while maintaining their separate identities. Instead of simply "adding up" the data, we explicitly test whether relationships between variables are homogeneous (equal across studies) or heterogeneous (different by geographical or cultural context).

We start with a completely flexible model where each study has its own parameters. Then we progressively test equality constraints: if regression coefficients do not differ significantly between studies, we constrain them equal to obtain more precise pooled estimates. When we detect significant heterogeneity, we model it explicitly, allowing specific parameters to vary across studies. This approach allows us to maximize statistical power where appropriate, without forcing artificial homogeneity that would hide important differences between populations.


Managing Complexity: Methodological Challenges

Pooled analysis of European studies presents inevitable practical complexities. Missing variables represent the main challenge: not all studies have collected the same information (for example, some lack data on maternal age at birth, others lack air pollution measures). Our solution is to build study-specific models that include all variables available for that study, while still allowing common parameters to be constrained equal.

Other technical challenges include handling non-positive definite correlation matrices (which can emerge from using pairwise deletion with missing data), managed through correction algorithms that minimally modify the original correlations. Additionally, we work iteratively to fix residual variances consistently with observed R² values, ensuring the model accurately reproduces the structure of the original data.


From Theory to Interpretable Results

Each outcome-specific analysis (ADHD, autism, depression, anxiety, cognition, wellbeing) follows a theoretically motivated causal structure that reflects current epidemiological knowledge. We create a latent variable "Physical Environment" that synthesizes multiple correlated exposures (air pollution, noise, urban density, availability of green spaces), reducing dimensional complexity and the risk of collinearity.

Results are presented through standardized coefficients interpretable as effect sizes, forest plots that visualize any heterogeneity across studies, and path diagrams that graphically illustrate the network of causal relationships. This transparent approach allows us to identify not only which environmental factors are relevant, but also through which mechanisms they exert their effects on child health, providing a solid foundation for evidence-based public policy interventions.






Comments


bottom of page