Models

Overview

Statistical models are mathematical frameworks that represent how data is generated, enabling us to estimate relationships between variables, make predictions, and understand underlying mechanisms. Unlike simple summary statistics, models can account for confounding factors, quantify uncertainty, and test hypotheses about complex processes. This category encompasses a broad range of modeling approaches from basic linear regression to advanced mixed-effects and survival models.

Foundation and Purpose

At their core, statistical models specify assumptions about how observed data relates to unobserved parameters. The process typically involves specifying a probability model, choosing appropriate estimation methods, and assessing model fit. Python implementations of statistical models primarily rely on statsmodels, which provides a comprehensive suite of modeling tools, alongside NumPy for numerical computation and SciPy for statistical functions.

Regression Models

The Regression subcategory includes linear models that form the foundation of statistical inference. Ordinary Least Squares (OLS) regression minimizes squared deviations between observed and predicted values, suitable for continuous outcomes with approximately normal errors. Weighted Least Squares (WLS) and Generalized Least Squares (GLS) extend OLS to handle heteroscedasticity and correlation structures. Quantile regression models relationships at different points of the conditional distribution, useful when the effect of predictors varies across quantiles. Beyond standard linear models, robust regression uses M-estimators that downweight outliers, and specification tests help identify model misspecification before relying on results.

Generalized Linear Models

The Generalized Linear subcategory extends linear models to non-normal response distributions through link functions. Generalized Linear Models (GLM) with different families—binomial for binary/proportion data, Poisson for counts, Gamma for positive right-skewed data, Negative Binomial for overdispersed counts, and Tweedie for flexible distribution modeling—allow flexible modeling of various data types. These models use maximum likelihood estimation and are fundamental for handling outcomes beyond continuous normally-distributed responses.

Discrete Choice Models

When the response is categorical, the Discrete Choice subcategory provides specialized approaches. Logistic regression models binary outcomes, multinomial logit extends this to multiple unordered categories, ordered logit respects ordinal structure in categorical outcomes, and probit models offer an alternative parameterization using the normal cumulative distribution function. These models are essential for classification problems and understanding how predictors influence categorical choices.

Count Data Models

The Count subcategory addresses challenges in modeling non-negative integer responses. Standard Poisson models assume the mean equals the variance, but data often exhibits overdispersion (variance exceeds mean) or excess zeros. Zero-Inflated Poisson (ZIP) and Zero-Inflated Negative Binomial (ZINB) models accommodate excess zeros through mixture processes. Hurdle models explicitly separate the process determining whether a count is zero from the distribution of positive counts, providing interpretable two-stage modeling.

Mixed Effects Models

Clustered or hierarchical data requires accounting for within-group correlation. The Mixed Effects subcategory includes Linear Mixed Models (LMM) with random intercepts and slopes for continuous outcomes, Generalized Linear Mixed Models (GLMM) extending GLM to hierarchical structures with binomial or Poisson families, and Generalized Estimating Equations (GEE) for population-averaged inference in correlated data. These approaches simultaneously model fixed population effects and random subject-specific deviations.

Survival Analysis

The Survival subcategory handles time-to-event data with censoring. Kaplan-Meier estimators provide non-parametric survival function estimates by time group. Cox Proportional Hazards models extend this to regression by modeling how covariates affect the instantaneous risk of events while making minimal distributional assumptions. Parametric survival models (e.g., exponential) assume specific distributions and are useful when proportionality assumptions are questionable.

Figure 1: Statistical Model Landscapes: (A) Linear Regression showing OLS fitting a continuous response (Y) to a predictor (X) with prediction intervals. (B) Logistic Regression showing the probability of binary outcomes (0/1) across a range of predictor values, with the characteristic S-shaped sigmoid curve.

Count

Tool	Description
HURDLE_COUNT_MODEL	Fits a Hurdle model for count data with two-stage process (zero vs.
ZINB_MODEL	Fits a Zero-Inflated Negative Binomial (ZINB) model for overdispersed count data with excess zeros.
ZIP_MODEL	Fits a Zero-Inflated Poisson (ZIP) model for count data with excess zeros.

Discrete Choice

Tool	Description
LOGIT_MODEL	Fits a binary logistic regression model to predict binary outcomes using maximum likelihood estimation.
MULTINOMIAL_LOGIT	Fits a multinomial logistic regression model for multi-category outcomes.
ORDERED_LOGIT	Fits an ordered logistic regression model for ordinal outcomes.
PROBIT_MODEL	Fits a binary probit regression model using maximum likelihood estimation.

Generalized Linear

Tool	Description
GLM_BINOMIAL	Fits a Generalized Linear Model with binomial family for binary or proportion data.
GLM_GAMMA	Fit a Generalized Linear Model with Gamma family for positive continuous data.
GLM_INV_GAUSS	Fits a Generalized Linear Model with Inverse Gaussian family for right-skewed positive data.
GLM_NEG_BINOM	Fits a Generalized Linear Model with Negative Binomial family for overdispersed count data.
GLM_POISSON	Fits a Generalized Linear Model with Poisson family for count data.
GLM_TWEEDIE	Fits a Generalized Linear Model with Tweedie family for flexible distribution modeling.

Mixed Effects

Tool	Description
GEE_MODEL	Fits a Generalized Estimating Equations (GEE) model for correlated data.
GLMM_BINOMIAL	Fits a Generalized Linear Mixed Model (GLMM) with binomial family for binary clustered data.
GLMM_POISSON	Fits a Generalized Linear Mixed Model (GLMM) with Poisson family for count clustered data.
MIXED_LINEAR_MODEL	Fits a Linear Mixed Effects Model (LMM) with random intercepts and slopes.

Regression

Tool	Description
GLS_REGRESSION	Fits a Generalized Least Squares (GLS) regression model.
INFLUENCE_DIAG	Computes regression influence diagnostics for identifying influential observations.
OLS_DIAGNOSTICS	Performs diagnostic tests on OLS regression residuals.
OLS_REGRESSION	Fits an Ordinary Least Squares (OLS) regression model.
QUANTILE_REGRESSION	Fits a quantile regression model to estimate conditional quantiles of the response distribution.
REGRESS_DIAG	Performs comprehensive regression diagnostic tests.
ROBUST_LINEAR_MODEL	Fits a robust linear regression model using M-estimators.
SPECIFICATION_TESTS	Performs regression specification tests to detect model misspecification.
WLS_REGRESSION	Fits a Weighted Least Squares (WLS) regression model.

Survival

Tool	Description
COX_HAZARDS	Fits a Cox Proportional Hazards regression model for survival data.
EXP_SURVIVAL_REG	Fits a parametric exponential survival regression model.
KAPLAN_MEIER	Computes the Kaplan-Meier survival function estimate for time-to-event data.