GLM_BINOMIAL

Overview

The GLM_BINOMIAL function fits a Generalized Linear Model (GLM) with a binomial family distribution, designed for modeling binary outcomes (0/1) or proportion data (values between 0 and 1). This type of model is fundamental in fields such as epidemiology, marketing, and social sciences where the response variable represents a probability or binary classification.

GLMs extend ordinary linear regression by allowing the response variable to follow distributions from the exponential family—including binomial, Poisson, and gamma distributions. For binomial data, the model relates the expected probability \mu to the linear predictor \eta = X\beta through a link function g:

g(\mu) = X\beta \quad \text{or equivalently} \quad \mu = g^{-1}(X\beta)

This implementation supports multiple link functions. The default logit link is the most common choice for binomial regression (logistic regression):

\text{logit}(\mu) = \log\left(\frac{\mu}{1-\mu}\right)

Other supported links include probit (based on the standard normal CDF), cloglog (complementary log-log), log, and cauchy. Each link function provides a different transformation between the probability scale and the linear predictor, which can be useful depending on the nature of the data.

Model parameters are estimated via Iteratively Reweighted Least Squares (IRLS), a maximum likelihood method. The function returns coefficient estimates, standard errors, z-statistics, p-values, and confidence intervals. For logit models, odds ratios are also calculated—representing the multiplicative change in odds for a one-unit increase in each predictor. Model fit statistics include deviance, Pearson chi-squared, AIC, BIC, and log-likelihood.

This implementation uses the statsmodels library. For more details, see the GLM documentation and binomial family reference. The statsmodels GitHub repository provides source code and additional examples.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=GLM_BINOMIAL(y, x, glm_binomial_link, fit_intercept, alpha)
  • y (list[list], required): Dependent variable as a column vector. For binary data, values should be 0 or 1. For proportion data, values should be between 0 and 1.
  • x (list[list], required): Independent variables (predictors) as a matrix. Each column is a predictor variable, and each row corresponds to an observation.
  • glm_binomial_link (str, optional, default: “logit”): Link function to use for the binomial GLM.
  • fit_intercept (bool, optional, default: true): If True, includes an intercept term in the model.
  • alpha (float, optional, default: 0.05): Significance level for confidence intervals (between 0 and 1).

Returns (list[list]): 2D list with GLM results and statistics, or error string.

Examples

Example 1: Logit with single predictor

Inputs:

y x
0 1
0 1.5
0 2
1 2.5
0 3
1 3.5
1 4
1 4.5

Excel formula:

=GLM_BINOMIAL({0;0;0;1;0;1;1;1}, {1;1.5;2;2.5;3;3.5;4;4.5})

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper odds_ratio
intercept -7.0526 4.8673 -1.449 0.1473 -16.5924 2.4872 0.0009
x1 2.5646 1.7208 1.4903 0.1361 -0.8082 5.9373 12.9953
deviance 5.0061
pearson_chi2 4.1931
aic 9.0061
bic 9.165
log_likelihood -2.503

Example 2: Logit with proportions and two predictors

Inputs:

y x
0.1 1 2
0.2 1.5 2.5
0.35 2 3
0.45 2.5 2.5
0.55 3 3.5
0.65 3.5 3
0.75 4 4
0.85 4.5 4.5

Excel formula:

=GLM_BINOMIAL({0.1;0.2;0.35;0.45;0.55;0.65;0.75;0.85}, {1,2;1.5,2.5;2,3;2.5,2.5;3,3.5;3.5,3;4,4;4.5,4.5})

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper odds_ratio
intercept -2.9587 4.1871 -0.7066 0.4798 -11.1652 5.2478 0.0519
x1 0.9781 1.598 0.612 0.5405 -2.154 4.1101 2.6593
x2 0.0656 2.3438 0.028 0.9777 -4.5282 4.6593 1.0678
deviance 0.0287
pearson_chi2 0.028
aic 12.0468
bic 12.2852
log_likelihood -3.0234

Example 3: Probit link with single predictor

Inputs:

y x glm_binomial_link
0 1 probit
0 1.5
0 2
1 2.5
0 3
1 3.5
1 4
1 4.5

Excel formula:

=GLM_BINOMIAL({0;0;0;1;0;1;1;1}, {1;1.5;2;2.5;3;3.5;4;4.5}, "probit")

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper odds_ratio
intercept -4.2877 2.67 -1.6059 0.1083 -9.5209 0.9455
x1 1.5592 0.941 1.657 0.0975 -0.2851 3.4035
deviance 4.8514
pearson_chi2 4.0761
aic 8.8514
bic 9.0103
log_likelihood -2.4257

Example 4: Logit without intercept and custom alpha

Inputs:

y x fit_intercept alpha
0 1 false 0.1
0 1.5
0 2
1 2.5
0 3
1 3.5
1 4
1 4.5

Excel formula:

=GLM_BINOMIAL({0;0;0;1;0;1;1;1}, {1;1.5;2;2.5;3;3.5;4;4.5}, FALSE, 0.1)

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper odds_ratio
x1 0.2065 0.2543 0.8117 0.4169 -0.2119 0.6248 1.2293
deviance 10.3844
pearson_chi2 7.8764
aic 12.3844
bic 12.4638
log_likelihood -5.1922

Python Code

import math
import statsmodels.api as sm
from statsmodels.genmod.families import Binomial as sm_Binomial
from statsmodels.genmod.generalized_linear_model import SET_USE_BIC_LLF

SET_USE_BIC_LLF(True)

def glm_binomial(y, x, glm_binomial_link='logit', fit_intercept=True, alpha=0.05):
    """
    Fits a Generalized Linear Model with binomial family for binary or proportion data.

    See: https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable as a column vector. For binary data, values should be 0 or 1. For proportion data, values should be between 0 and 1.
        x (list[list]): Independent variables (predictors) as a matrix. Each column is a predictor variable, and each row corresponds to an observation.
        glm_binomial_link (str, optional): Link function to use for the binomial GLM. Valid options: Logit, Probit, CLogLog, Log, Cauchy. Default is 'logit'.
        fit_intercept (bool, optional): If True, includes an intercept term in the model. Default is True.
        alpha (float, optional): Significance level for confidence intervals (between 0 and 1). Default is 0.05.

    Returns:
        list[list]: 2D list with GLM results and statistics, or error string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    def validate_numeric_2d(data, name):
        if not isinstance(data, list):
            return f"Error: Invalid input: {name} must be a 2D list."
        if len(data) == 0:
            return f"Error: Invalid input: {name} cannot be empty."
        for i, row in enumerate(data):
            if not isinstance(row, list):
                return f"Error: Invalid input: {name} must be a 2D list."
            if len(row) == 0:
                return f"Error: Invalid input: {name} rows cannot be empty."
            for j, val in enumerate(row):
                if not isinstance(val, (int, float)):
                    return f"Error: Invalid input: {name}[{i}][{j}] must be numeric."
                if math.isnan(val) or math.isinf(val):
                    return f"Error: Invalid input: {name}[{i}][{j}] must be finite."
        return None

    try:
        # Normalize inputs
        y = to2d(y)
        x = to2d(x)

        # Validate inputs
        err = validate_numeric_2d(y, 'y')
        if err:
            return err
        err = validate_numeric_2d(x, 'x')
        if err:
            return err

        # Check y is a column vector
        if len(y[0]) != 1:
            return "Error: Invalid input: y must be a column vector (single column)."

        # Check dimensions match
        n_obs_y = len(y)
        n_obs_x = len(x)
        if n_obs_y != n_obs_x:
            return "Error: Invalid input: y and x must have the same number of rows."

        # Validate alpha
        if not isinstance(alpha, (int, float)):
            return "Error: Invalid input: alpha must be numeric."
        if math.isnan(alpha) or math.isinf(alpha):
            return "Error: Invalid input: alpha must be finite."
        if alpha <= 0 or alpha >= 1:
            return "Error: Invalid input: alpha must be between 0 and 1."

        # Validate link function
        valid_links = ['logit', 'probit', 'cloglog', 'log', 'cauchy']
        if not isinstance(glm_binomial_link, str):
            return "Error: Invalid input: glm_binomial_link must be a string."
        if glm_binomial_link not in valid_links:
            return f"Error: Invalid input: glm_binomial_link must be one of {valid_links}."

        # Convert to flat list for y
        y_flat = [row[0] for row in y]

        # Check y values are in valid range [0, 1]
        for i, val in enumerate(y_flat):
            if val < 0 or val > 1:
                return f"Error: Invalid input: y[{i}] = {val} must be between 0 and 1."

        # Get number of columns in x
        n_cols = len(x[0])
        for i, row in enumerate(x):
            if len(row) != n_cols:
                return "Error: Invalid input: all rows in x must have the same number of columns."

        # Convert x to list of columns
        x_data = []
        for col_idx in range(n_cols):
            x_data.append([x[row_idx][col_idx] for row_idx in range(n_obs_x)])

        # Add intercept if requested
        if fit_intercept:
            x_data.insert(0, [1.0] * n_obs_x)

        # Transpose to get design matrix
        design_matrix = []
        for row_idx in range(n_obs_x):
            design_matrix.append([col[row_idx] for col in x_data])

        # Create link object
        try:
            if glm_binomial_link == 'logit':
                link = sm.families.links.Logit()
            elif glm_binomial_link == 'probit':
                link = sm.families.links.Probit()
            elif glm_binomial_link == 'cloglog':
                link = sm.families.links.CLogLog()
            elif glm_binomial_link == 'log':
                link = sm.families.links.Log()
            elif glm_binomial_link == 'cauchy':
                link = sm.families.links.Cauchy()
        except Exception as e:
            return f"Error: Invalid input: unable to create link function: {e}"

        # Fit GLM
        try:
            family = sm_Binomial(link=link)
            model = sm.GLM(y_flat, design_matrix, family=family)
            result = model.fit()
        except Exception as e:
            return f"Error: statsmodels.GLM error: {e}"

        # Extract results
        try:
            params = result.params
            bse = result.bse
            tvalues = result.tvalues
            pvalues = result.pvalues
            conf_int = result.conf_int(alpha=alpha)

            # Build parameter names
            param_names = []
            if fit_intercept:
                param_names.append('intercept')
            for i in range(n_cols):
                param_names.append(f'x{i+1}')

            # Build results table
            results = [['parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper', 'odds_ratio']]

            for i, name in enumerate(param_names):
                coef = float(params[i])
                stderr = float(bse[i])
                zstat = float(tvalues[i])
                pval = float(pvalues[i])
                ci_low = float(conf_int[i][0])
                ci_high = float(conf_int[i][1])
                odds_ratio = math.exp(coef) if glm_binomial_link == 'logit' else None

                results.append([
                    name,
                    coef,
                    stderr,
                    zstat,
                    pval,
                    ci_low,
                    ci_high,
                    odds_ratio if odds_ratio is not None else ''
                ])

            # Add model statistics
            results.append(['deviance', float(result.deviance), '', '', '', '', '', ''])
            results.append(['pearson_chi2', float(result.pearson_chi2), '', '', '', '', '', ''])
            results.append(['aic', float(result.aic), '', '', '', '', '', ''])
            results.append(['bic', float(result.bic_llf), '', '', '', '', '', ''])
            results.append(['log_likelihood', float(result.llf), '', '', '', '', '', ''])

            return results

        except Exception as e:
            return f"Error: statsmodels.GLM error: unable to extract results: {e}"
    except Exception as e:
        return f"Error: {str(e)}"

Online Calculator