Models

Overview

Statistical models are mathematical frameworks that represent how data is generated, enabling us to estimate relationships between variables, make predictions, and understand underlying mechanisms. Unlike simple summary statistics, models can account for confounding factors, quantify uncertainty, and test hypotheses about complex processes. This category encompasses a broad range of modeling approaches from basic linear regression to advanced mixed-effects and survival models.

Foundation and Purpose

At their core, statistical models specify assumptions about how observed data relates to unobserved parameters. The process typically involves specifying a probability model, choosing appropriate estimation methods, and assessing model fit. Python implementations of statistical models primarily rely on statsmodels, which provides a comprehensive suite of modeling tools, alongside NumPy for numerical computation and SciPy for statistical functions.

Regression Models

The Regression subcategory includes linear models that form the foundation of statistical inference. Ordinary Least Squares (OLS) regression minimizes squared deviations between observed and predicted values, suitable for continuous outcomes with approximately normal errors. Weighted Least Squares (WLS) and Generalized Least Squares (GLS) extend OLS to handle heteroscedasticity and correlation structures. Quantile regression models relationships at different points of the conditional distribution, useful when the effect of predictors varies across quantiles. Beyond standard linear models, robust regression uses M-estimators that downweight outliers, and specification tests help identify model misspecification before relying on results.

Generalized Linear Models

The Generalized Linear subcategory extends linear models to non-normal response distributions through link functions. Generalized Linear Models (GLM) with different families—binomial for binary/proportion data, Poisson for counts, Gamma for positive right-skewed data, Negative Binomial for overdispersed counts, and Tweedie for flexible distribution modeling—allow flexible modeling of various data types. These models use maximum likelihood estimation and are fundamental for handling outcomes beyond continuous normally-distributed responses.

Discrete Choice Models

When the response is categorical, the Discrete Choice subcategory provides specialized approaches. Logistic regression models binary outcomes, multinomial logit extends this to multiple unordered categories, ordered logit respects ordinal structure in categorical outcomes, and probit models offer an alternative parameterization using the normal cumulative distribution function. These models are essential for classification problems and understanding how predictors influence categorical choices.

Count Data Models

The Count subcategory addresses challenges in modeling non-negative integer responses. Standard Poisson models assume the mean equals the variance, but data often exhibits overdispersion (variance exceeds mean) or excess zeros. Zero-Inflated Poisson (ZIP) and Zero-Inflated Negative Binomial (ZINB) models accommodate excess zeros through mixture processes. Hurdle models explicitly separate the process determining whether a count is zero from the distribution of positive counts, providing interpretable two-stage modeling.

Mixed Effects Models

Clustered or hierarchical data requires accounting for within-group correlation. The Mixed Effects subcategory includes Linear Mixed Models (LMM) with random intercepts and slopes for continuous outcomes, Generalized Linear Mixed Models (GLMM) extending GLM to hierarchical structures with binomial or Poisson families, and Generalized Estimating Equations (GEE) for population-averaged inference in correlated data. These approaches simultaneously model fixed population effects and random subject-specific deviations.

Survival Analysis

The Survival subcategory handles time-to-event data with censoring. Kaplan-Meier estimators provide non-parametric survival function estimates by time group. Cox Proportional Hazards models extend this to regression by modeling how covariates affect the instantaneous risk of events while making minimal distributional assumptions. Parametric survival models (e.g., exponential) assume specific distributions and are useful when proportionality assumptions are questionable.

Figure 1: Statistical Model Landscapes: (A) Linear Regression showing OLS fitting a continuous response (Y) to a predictor (X) with prediction intervals. (B) Logistic Regression showing the probability of binary outcomes (0/1) across a range of predictor values, with the characteristic S-shaped sigmoid curve.

Count

Tool Description
HURDLE_COUNT_MODEL Fits a Hurdle model for count data with two-stage process (zero vs.
ZINB_MODEL Fits a Zero-Inflated Negative Binomial (ZINB) model for overdispersed count data with excess zeros.
ZIP_MODEL Fits a Zero-Inflated Poisson (ZIP) model for count data with excess zeros.

Discrete Choice

Tool Description
LOGIT_MODEL Fits a binary logistic regression model to predict binary outcomes using maximum likelihood estimation.
MULTINOMIAL_LOGIT Fits a multinomial logistic regression model for multi-category outcomes.
ORDERED_LOGIT Fits an ordered logistic regression model for ordinal outcomes.
PROBIT_MODEL Fits a binary probit regression model using maximum likelihood estimation.

Generalized Linear

Tool Description
GLM_BINOMIAL Fits a Generalized Linear Model with binomial family for binary or proportion data.
GLM_GAMMA Fit a Generalized Linear Model with Gamma family for positive continuous data.
GLM_INV_GAUSS Fits a Generalized Linear Model with Inverse Gaussian family for right-skewed positive data.
GLM_NEG_BINOM Fits a Generalized Linear Model with Negative Binomial family for overdispersed count data.
GLM_POISSON Fits a Generalized Linear Model with Poisson family for count data.
GLM_TWEEDIE Fits a Generalized Linear Model with Tweedie family for flexible distribution modeling.

Mixed Effects

Tool Description
GEE_MODEL Fits a Generalized Estimating Equations (GEE) model for correlated data.
GLMM_BINOMIAL Fits a Generalized Linear Mixed Model (GLMM) with binomial family for binary clustered data.
GLMM_POISSON Fits a Generalized Linear Mixed Model (GLMM) with Poisson family for count clustered data.
MIXED_LINEAR_MODEL Fits a Linear Mixed Effects Model (LMM) with random intercepts and slopes.

Regression

Tool Description
GLS_REGRESSION Fits a Generalized Least Squares (GLS) regression model.
INFLUENCE_DIAG Computes regression influence diagnostics for identifying influential observations.
OLS_DIAGNOSTICS Performs diagnostic tests on OLS regression residuals.
OLS_REGRESSION Fits an Ordinary Least Squares (OLS) regression model.
QUANTILE_REGRESSION Fits a quantile regression model to estimate conditional quantiles of the response distribution.
REGRESS_DIAG Performs comprehensive regression diagnostic tests.
ROBUST_LINEAR_MODEL Fits a robust linear regression model using M-estimators.
SPECIFICATION_TESTS Performs regression specification tests to detect model misspecification.
WLS_REGRESSION Fits a Weighted Least Squares (WLS) regression model.

Survival

Tool Description
COX_HAZARDS Fits a Cox Proportional Hazards regression model for survival data.
EXP_SURVIVAL_REG Fits a parametric exponential survival regression model.
KAPLAN_MEIER Computes the Kaplan-Meier survival function estimate for time-to-event data.