Models
Overview
Statistical models are mathematical frameworks that represent how data is generated, enabling us to estimate relationships between variables, make predictions, and understand underlying mechanisms. Unlike simple summary statistics, models can account for confounding factors, quantify uncertainty, and test hypotheses about complex processes. This category encompasses a broad range of modeling approaches from basic linear regression to advanced mixed-effects and survival models.
Foundation and Purpose
At their core, statistical models specify assumptions about how observed data relates to unobserved parameters. The process typically involves specifying a probability model, choosing appropriate estimation methods, and assessing model fit. Python implementations of statistical models primarily rely on statsmodels, which provides a comprehensive suite of modeling tools, alongside NumPy for numerical computation and SciPy for statistical functions.
Regression Models
The Regression subcategory includes linear models that form the foundation of statistical inference. Ordinary Least Squares (OLS) regression minimizes squared deviations between observed and predicted values, suitable for continuous outcomes with approximately normal errors. Weighted Least Squares (WLS) and Generalized Least Squares (GLS) extend OLS to handle heteroscedasticity and correlation structures. Quantile regression models relationships at different points of the conditional distribution, useful when the effect of predictors varies across quantiles. Beyond standard linear models, robust regression uses M-estimators that downweight outliers, and specification tests help identify model misspecification before relying on results.
Generalized Linear Models
The Generalized Linear subcategory extends linear models to non-normal response distributions through link functions. Generalized Linear Models (GLM) with different families—binomial for binary/proportion data, Poisson for counts, Gamma for positive right-skewed data, Negative Binomial for overdispersed counts, and Tweedie for flexible distribution modeling—allow flexible modeling of various data types. These models use maximum likelihood estimation and are fundamental for handling outcomes beyond continuous normally-distributed responses.
Discrete Choice Models
When the response is categorical, the Discrete Choice subcategory provides specialized approaches. Logistic regression models binary outcomes, multinomial logit extends this to multiple unordered categories, ordered logit respects ordinal structure in categorical outcomes, and probit models offer an alternative parameterization using the normal cumulative distribution function. These models are essential for classification problems and understanding how predictors influence categorical choices.
Count Data Models
The Count subcategory addresses challenges in modeling non-negative integer responses. Standard Poisson models assume the mean equals the variance, but data often exhibits overdispersion (variance exceeds mean) or excess zeros. Zero-Inflated Poisson (ZIP) and Zero-Inflated Negative Binomial (ZINB) models accommodate excess zeros through mixture processes. Hurdle models explicitly separate the process determining whether a count is zero from the distribution of positive counts, providing interpretable two-stage modeling.
Mixed Effects Models
Clustered or hierarchical data requires accounting for within-group correlation. The Mixed Effects subcategory includes Linear Mixed Models (LMM) with random intercepts and slopes for continuous outcomes, Generalized Linear Mixed Models (GLMM) extending GLM to hierarchical structures with binomial or Poisson families, and Generalized Estimating Equations (GEE) for population-averaged inference in correlated data. These approaches simultaneously model fixed population effects and random subject-specific deviations.
Survival Analysis
The Survival subcategory handles time-to-event data with censoring. Kaplan-Meier estimators provide non-parametric survival function estimates by time group. Cox Proportional Hazards models extend this to regression by modeling how covariates affect the instantaneous risk of events while making minimal distributional assumptions. Parametric survival models (e.g., exponential) assume specific distributions and are useful when proportionality assumptions are questionable.
Count
| Tool | Description |
|---|---|
| HURDLE_COUNT_MODEL | Fits a Hurdle model for count data with two-stage process (zero vs. |
| ZINB_MODEL | Fits a Zero-Inflated Negative Binomial (ZINB) model for overdispersed count data with excess zeros. |
| ZIP_MODEL | Fits a Zero-Inflated Poisson (ZIP) model for count data with excess zeros. |
Discrete Choice
| Tool | Description |
|---|---|
| LOGIT_MODEL | Fits a binary logistic regression model to predict binary outcomes using maximum likelihood estimation. |
| MULTINOMIAL_LOGIT | Fits a multinomial logistic regression model for multi-category outcomes. |
| ORDERED_LOGIT | Fits an ordered logistic regression model for ordinal outcomes. |
| PROBIT_MODEL | Fits a binary probit regression model using maximum likelihood estimation. |
Generalized Linear
| Tool | Description |
|---|---|
| GLM_BINOMIAL | Fits a Generalized Linear Model with binomial family for binary or proportion data. |
| GLM_GAMMA | Fit a Generalized Linear Model with Gamma family for positive continuous data. |
| GLM_INV_GAUSS | Fits a Generalized Linear Model with Inverse Gaussian family for right-skewed positive data. |
| GLM_NEG_BINOM | Fits a Generalized Linear Model with Negative Binomial family for overdispersed count data. |
| GLM_POISSON | Fits a Generalized Linear Model with Poisson family for count data. |
| GLM_TWEEDIE | Fits a Generalized Linear Model with Tweedie family for flexible distribution modeling. |
Mixed Effects
| Tool | Description |
|---|---|
| GEE_MODEL | Fits a Generalized Estimating Equations (GEE) model for correlated data. |
| GLMM_BINOMIAL | Fits a Generalized Linear Mixed Model (GLMM) with binomial family for binary clustered data. |
| GLMM_POISSON | Fits a Generalized Linear Mixed Model (GLMM) with Poisson family for count clustered data. |
| MIXED_LINEAR_MODEL | Fits a Linear Mixed Effects Model (LMM) with random intercepts and slopes. |
Regression
| Tool | Description |
|---|---|
| GLS_REGRESSION | Fits a Generalized Least Squares (GLS) regression model. |
| INFLUENCE_DIAG | Computes regression influence diagnostics for identifying influential observations. |
| OLS_DIAGNOSTICS | Performs diagnostic tests on OLS regression residuals. |
| OLS_REGRESSION | Fits an Ordinary Least Squares (OLS) regression model. |
| QUANTILE_REGRESSION | Fits a quantile regression model to estimate conditional quantiles of the response distribution. |
| REGRESS_DIAG | Performs comprehensive regression diagnostic tests. |
| ROBUST_LINEAR_MODEL | Fits a robust linear regression model using M-estimators. |
| SPECIFICATION_TESTS | Performs regression specification tests to detect model misspecification. |
| WLS_REGRESSION | Fits a Weighted Least Squares (WLS) regression model. |
Survival
| Tool | Description |
|---|---|
| COX_HAZARDS | Fits a Cox Proportional Hazards regression model for survival data. |
| EXP_SURVIVAL_REG | Fits a parametric exponential survival regression model. |
| KAPLAN_MEIER | Computes the Kaplan-Meier survival function estimate for time-to-event data. |