GEE_MODEL

Overview

The GEE_MODEL function fits a Generalized Estimating Equations (GEE) model for analyzing correlated or clustered data, such as longitudinal studies, repeated measures, or hierarchical designs. GEE extends generalized linear models (GLMs) to handle situations where observations within clusters may be correlated, while observations between clusters are assumed to be independent.

GEE was introduced by Liang and Zeger (1986) as a method for estimating regression parameters in marginal models when the data have a grouped structure. Unlike mixed-effects models that estimate subject-specific effects, GEE focuses on estimating population-averaged (marginal) effects, making it particularly suitable for epidemiological and clinical studies where population-level interpretations are desired.

This implementation uses the statsmodels library. For detailed documentation, see the GEE class reference and the GEE user guide.

The function supports multiple distribution families (Gaussian, Binomial, Poisson, Gamma) and working correlation structures:

Exchangeable: Assumes constant correlation between all pairs of observations within a cluster
AR(1): First-order autoregressive structure, appropriate for equally-spaced longitudinal data
Independence: Assumes no within-cluster correlation (equivalent to GLM with robust standard errors)
Unstructured: Estimates a separate correlation for each pair of observations

GEE uses an iterative algorithm that alternates between estimating regression coefficients (using weighted least squares) and updating the working correlation structure. A key advantage is that parameter estimates remain consistent even if the working correlation structure is misspecified, though efficiency improves with a correctly specified structure. The function returns robust “sandwich” standard errors following Liang and Zeger (1986), which provide valid inference regardless of the true correlation structure.

For more information on the theoretical foundations, see the original paper: Liang, K-Y. and Zeger, S.L. (1986). “Longitudinal data analysis using generalized linear models.” Biometrika, 73(1): 13-22. Additional background is available on Wikipedia.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=GEE_MODEL(y, x, groups, family, cov_struct, fit_intercept, alpha)

y (list[list], required): Dependent variable as a column vector.
x (list[list], required): Independent variables (predictors) as a matrix where each column is a predictor.
groups (list[list], required): Group or cluster identifiers as a column vector with integer group codes.
family (str, optional, default: “gaussian”): Distribution family for the response variable.
cov_struct (str, optional, default: “exchangeable”): Within-group covariance structure.
fit_intercept (bool, optional, default: true): If True, adds an intercept term to the model.
alpha (float, optional, default: 0.05): Significance level for confidence intervals.

Returns (list[list]): 2D list with GEE results, or error message string.

Examples

Example 1: Gaussian with exchangeable correlation

Inputs:

y	x	groups
1.2	1	1
1.8	2	1
2.3	3	2
2.9	4	2
3.4	5	3
4.1	6	3

Excel formula:

=GEE_MODEL({1.2;1.8;2.3;2.9;3.4;4.1}, {1;2;3;4;5;6}, {1;1;2;2;3;3})

Expected output:

parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
intercept	0.6541	0.008694	75.24	0	0.6371	0.6712
x1	0.5607	0.002461	227.8	0	0.5559	0.5655

Example 2: Poisson with AR1 correlation and multiple predictors

Inputs:

y	x		groups	family	cov_struct
1	1	0.5	1	poisson	ar1
2	2	0.6	1
3	3	0.7	1
2	2	0.8	2
4	4	0.9	2
5	5	1	2

Excel formula:

=GEE_MODEL({1;2;3;2;4;5}, {1,0.5;2,0.6;3,0.7;2,0.8;4,0.9;5,1}, {1;1;1;2;2;2}, "poisson", "ar1")

Expected output:

parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
intercept	-0.07104	0.0547	-1.299	0.1941	-0.1783	0.03618
x1	0.3376	0.06257	5.395	6.833e-8	0.2149	0.4602
x2	0.06159	0.3222	0.1911	0.8484	-0.5699	0.6931

Example 3: Binomial with independence correlation (no intercept)

Inputs:

y	x	groups	family	cov_struct	fit_intercept
0	1	1	binomial	independence	false
1	2	1
0	1.5	2
1	2.5	2
1	3	3
1	3.5	3

Excel formula:

=GEE_MODEL({0;1;0;1;1;1}, {1;2;1.5;2.5;3;3.5}, {1;1;2;2;3;3}, "binomial", "independence", FALSE)

Expected output:

parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
x1	0.6031	0.2233	2.701	0.006908	0.1655	1.041

Example 4: Gaussian with custom alpha for 90% confidence intervals

Inputs:

y	x	groups	family	cov_struct	fit_intercept	alpha
5.2	2	1	gaussian	exchangeable	true	0.1
7.1	3	1
6.3	2.5	2
8.2	4	2
9.1	4.5	3
10.8	5	3

Excel formula:

=GEE_MODEL({5.2;7.1;6.3;8.2;9.1;10.8}, {2;3;2.5;4;4.5;5}, {1;1;2;2;3;3}, "gaussian", "exchangeable", TRUE, 0.1)

Expected output:

parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
intercept	1.779	0.2431	7.319	2.506e-13	1.379	2.179
x1	1.715	0.05682	30.19	2.919e-200	1.622	1.809

Python Code

import statsmodels.api as sm
from statsmodels.genmod.generalized_estimating_equations import GEE as statsmodels_gee
from statsmodels.genmod import families

def gee_model(y, x, groups, family='gaussian', cov_struct='exchangeable', fit_intercept=True, alpha=0.05):
    """
    Fits a Generalized Estimating Equations (GEE) model for correlated data.

    See: https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_estimating_equations.GEE.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable as a column vector.
        x (list[list]): Independent variables (predictors) as a matrix where each column is a predictor.
        groups (list[list]): Group or cluster identifiers as a column vector with integer group codes.
        family (str, optional): Distribution family for the response variable. Valid options: Gaussian, Binomial, Poisson, Gamma. Default is 'gaussian'.
        cov_struct (str, optional): Within-group covariance structure. Valid options: Exchangeable, AR1, Independence, Unstructured. Default is 'exchangeable'.
        fit_intercept (bool, optional): If True, adds an intercept term to the model. Default is True.
        alpha (float, optional): Significance level for confidence intervals. Default is 0.05.

    Returns:
        list[list]: 2D list with GEE results, or error message string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    def validate_float(val, name):
        try:
            converted = float(val)
        except Exception:
            return f"Error: Invalid input: {name} must be a number."
        if converted != converted or converted == float('inf') or converted == float('-inf'):
            return f"Error: Invalid input: {name} must be finite."
        return converted

    def validate_2d_list(data, name):
        if not isinstance(data, list):
            return f"Error: Invalid input: {name} must be a 2D list."
        if len(data) == 0:
            return f"Error: Invalid input: {name} cannot be empty."
        for row in data:
            if not isinstance(row, list):
                return f"Error: Invalid input: {name} must be a 2D list."
        return None
    try:
        # Normalize 2D list inputs
        y = to2d(y)
        x = to2d(x)
        groups = to2d(groups)

        # Validate 2D list structures
        err = validate_2d_list(y, "y")
        if err:
            return err
        err = validate_2d_list(x, "x")
        if err:
            return err
        err = validate_2d_list(groups, "groups")
        if err:
            return err

        # Extract column vectors
        y_flat = []
        for row in y:
            if len(row) != 1:
                return "Error: Invalid input: y must be a column vector (single column)."
            val = validate_float(row[0], "y element")
            if isinstance(val, str):
                return val
            y_flat.append(val)

        groups_flat = []
        for row in groups:
            if len(row) != 1:
                return "Error: Invalid input: groups must be a column vector (single column)."
            val = validate_float(row[0], "groups element")
            if isinstance(val, str):
                return val
            groups_flat.append(int(val))

        # Extract x matrix
        n_rows = len(x)
        if n_rows != len(y_flat):
            return "Error: Invalid input: x and y must have the same number of rows."
        if n_rows != len(groups_flat):
            return "Error: Invalid input: x and groups must have the same number of rows."

        n_cols = len(x[0])
        x_flat = []
        for row in x:
            if len(row) != n_cols:
                return "Error: Invalid input: x must have consistent number of columns."
            row_vals = []
            for val in row:
                converted = validate_float(val, "x element")
                if isinstance(converted, str):
                    return converted
                row_vals.append(converted)
            x_flat.append(row_vals)

        # Validate alpha
        alpha_val = validate_float(alpha, "alpha")
        if isinstance(alpha_val, str):
            return alpha_val
        if alpha_val <= 0 or alpha_val >= 1:
            return "Error: Invalid input: alpha must be between 0 and 1."

        # Validate family
        family_map = {
            'gaussian': families.Gaussian(),
            'binomial': families.Binomial(),
            'poisson': families.Poisson(),
            'gamma': families.Gamma()
        }
        if family.lower() not in family_map:
            return f"Error: Invalid input: family must be one of {list(family_map.keys())}."
        family_obj = family_map[family.lower()]

        # Validate covariance structure
        cov_struct_map = {
            'exchangeable': sm.cov_struct.Exchangeable(),
            'ar1': sm.cov_struct.Autoregressive(),
            'independence': sm.cov_struct.Independence(),
            'unstructured': sm.cov_struct.Unstructured()
        }
        if cov_struct.lower() not in cov_struct_map:
            return f"Error: Invalid input: cov_struct must be one of {list(cov_struct_map.keys())}."
        cov_struct_obj = cov_struct_map[cov_struct.lower()]

        # Add intercept if requested
        if fit_intercept:
            x_flat = [[1.0] + row for row in x_flat]

        # Fit GEE model
        try:
            model = statsmodels_gee(y_flat, x_flat, groups=groups_flat, family=family_obj, cov_struct=cov_struct_obj)
            results = model.fit()
        except Exception as exc:
            return f"Error: GEE fitting error: {exc}"

        # Extract results
        params = results.params
        std_errors = results.bse
        z_stats = results.tvalues
        p_values = results.pvalues
        conf_int = results.conf_int(alpha=alpha_val)

        # Build output table
        output = [['parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper']]

        # Add parameter rows
        for i, (param, se, z, p, ci) in enumerate(zip(params, std_errors, z_stats, p_values, conf_int)):
            if fit_intercept and i == 0:
                param_name = 'intercept'
            else:
                param_idx = i if not fit_intercept else i - 1
                param_name = f'x{param_idx + 1}'
            output.append([param_name, float(param), float(se), float(z), float(p), float(ci[0]), float(ci[1])])

        return output
    except Exception as e:
        return f"Error: {str(e)}"

Overview

Excel Usage

Examples

Python Code

Online Calculator