ROBUST_LINEAR_MODEL

Overview

The ROBUST_LINEAR_MODEL function fits a robust linear regression model using M-estimators, which are designed to be less sensitive to outliers than ordinary least squares (OLS) regression. This approach is particularly valuable when working with real-world data that may contain anomalous observations that would otherwise disproportionately influence regression coefficients.

The function implements the Robust Linear Model (RLM) class from the statsmodels library. For detailed documentation, see the RLM class reference and the Robust Linear Models guide. The statsmodels source code is available on GitHub.

M-estimators generalize maximum likelihood estimation by replacing the squared residuals used in OLS with a robust loss function \rho. The objective is to minimize:

\sum_{i=1}^{n} \rho\left(\frac{r_i}{s}\right)

where r_i are the residuals and s is a robust scale estimate. The model is fitted using Iteratively Reweighted Least Squares (IRLS), which assigns weights to observations based on their residuals—downweighting outliers while preserving the influence of typical observations.

This implementation supports four M-estimator norm functions:

  • Huber (default): Combines squared loss for small residuals with linear loss for large residuals, providing a balance between efficiency and robustness. See HuberT norm.
  • Bisquare (Tukey): Completely rejects extreme outliers by assigning zero weight beyond a threshold. See TukeyBiweight norm.
  • Andrew (Andrew’s Wave): Uses a sinusoidal weighting function that smoothly downweights outliers. See AndrewWave norm.
  • Hampel: A three-part redescending function that provides a smooth transition from full weight to zero. See Hampel norm.

The function returns coefficient estimates, standard errors, t-statistics, p-values, and 95% confidence intervals for each predictor, along with a robust scale estimate. The theoretical foundations of robust regression are detailed in Huber’s seminal work Robust Statistics (John Wiley and Sons, 1981).

This example function is provided as-is without any representation of accuracy.

Excel Usage

=ROBUST_LINEAR_MODEL(y, x, rlm_m_estimator, fit_intercept)
  • y (list[list], required): Dependent variable as a column vector (n_obs, 1).
  • x (list[list], required): Independent variables matrix with dimensions (n_obs, n_predictors). Each column is a predictor.
  • rlm_m_estimator (str, optional, default: “huber”): M-estimator norm function to use for robust regression.
  • fit_intercept (bool, optional, default: true): If True, adds an intercept term to the model.

Returns (list[list]): 2D list with robust regression results, or error string.

Examples

Example 1: Basic linear regression with default Huber estimator

Inputs:

y x
1 1
2 2
3 3
4 4
5 5

Excel formula:

=ROBUST_LINEAR_MODEL({1;2;3;4;5}, {1;2;3;4;5})

Expected output:

parameter coefficient std_error t_statistic p_value ci_lower ci_upper
intercept 0 0 -1.41842 0.15607 0 0
x1 1 0 1412441541641955.2 0 1 1
scale 0

Example 2: Regression with outlier using Huber estimator

Inputs:

y x rlm_m_estimator
1 1 huber
2 2
3 3
4 4
100 5

Excel formula:

=ROBUST_LINEAR_MODEL({1;2;3;4;100}, {1;2;3;4;5}, "huber")

Expected output:

parameter coefficient std_error t_statistic p_value ci_lower ci_upper
intercept -37.37772 75.42349 -0.49557 0.6202 -185.20504 110.44959
x1 19.69388 22.74104 0.86601 0.38649 -24.87773 64.26549
scale 27.73046

Example 3: Multiple predictors with Bisquare estimator

Inputs:

y x rlm_m_estimator fit_intercept
2 1 0.5 bisquare true
4 2 1.2
6 3 1.8
8 4 2.5
10 5 3.1
12 6 3.9

Excel formula:

=ROBUST_LINEAR_MODEL({2;4;6;8;10;12}, {1,0.5;2,1.2;3,1.8;4,2.5;5,3.1;6,3.9}, "bisquare", TRUE)

Expected output:

parameter coefficient std_error t_statistic p_value ci_lower ci_upper
intercept 0 0 -0.08582 0.93161 0 0
x1 2 0 11498802301654.041 0 2 2
x2 0 0 -0.10933 0.91294 0 0
scale 0

Example 4: No intercept model with Hampel estimator

Inputs:

y x rlm_m_estimator fit_intercept
3.1 1 hampel false
5.9 2
9.2 3
11.8 4
15.1 5
17.9 6

Excel formula:

=ROBUST_LINEAR_MODEL({3.1;5.9;9.2;11.8;15.1;17.9}, {1;2;3;4;5;6}, "hampel", FALSE)

Expected output:

parameter coefficient std_error t_statistic p_value ci_lower ci_upper
x1 2.9956 0.01612 185.82525 0 2.96401 3.0272
scale 0.16781

Python Code

import math
import warnings
from statsmodels.robust.robust_linear_model import RLM as statsmodels_RLM
from statsmodels.robust import norms as statsmodels_norms
import numpy as np

def robust_linear_model(y, x, rlm_m_estimator='huber', fit_intercept=True):
    """
    Fits a robust linear regression model using M-estimators.

    See: https://www.statsmodels.org/stable/generated/statsmodels.robust.robust_linear_model.RLM.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable as a column vector (n_obs, 1).
        x (list[list]): Independent variables matrix with dimensions (n_obs, n_predictors). Each column is a predictor.
        rlm_m_estimator (str, optional): M-estimator norm function to use for robust regression. Valid options: Huber, Andrew, Bisquare, Hampel. Default is 'huber'.
        fit_intercept (bool, optional): If True, adds an intercept term to the model. Default is True.

    Returns:
        list[list]: 2D list with robust regression results, or error string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    def validate_2d_array(arr, name):
        if not isinstance(arr, list):
            return f"Error: {name} must be a 2D list."
        if len(arr) == 0:
            return f"Error: {name} must not be empty."
        for i, row in enumerate(arr):
            if not isinstance(row, list):
                return f"Error: {name} row {i} must be a list."
            if len(row) == 0:
                return f"Error: {name} row {i} must not be empty."
            for j, val in enumerate(row):
                if not isinstance(val, (int, float)):
                    return f"Error: {name}[{i}][{j}] must be a number."
                if math.isnan(val) or math.isinf(val):
                    return f"Error: {name}[{i}][{j}] must be finite."
        return None
    try:
        # Normalize inputs to 2D lists
        y = to2d(y)
        x = to2d(x)

        # Validate inputs
        y_error = validate_2d_array(y, "y")
        if y_error:
            return y_error

        x_error = validate_2d_array(x, "x")
        if x_error:
            return x_error

        # Check y is a column vector
        if len(y[0]) != 1:
            return "Error: y must be a column vector (single column)."

        # Check dimensions match
        n_obs_y = len(y)
        n_obs_x = len(x)
        if n_obs_y != n_obs_x:
            return f"Error: y has {n_obs_y} observations but x has {n_obs_x} observations."

        # Check all rows in x have the same number of columns
        n_predictors = len(x[0])
        for i, row in enumerate(x):
            if len(row) != n_predictors:
                return f"Error: x row {i} has {len(row)} columns but expected {n_predictors}."

        # Check we have enough observations
        min_obs = n_predictors + (1 if fit_intercept else 0)
        if n_obs_y <= min_obs:
            return f"Error: need more than {min_obs} observations for {n_predictors} predictors."

        # Validate rlm_m_estimator
        valid_estimators = ['huber', 'andrew', 'bisquare', 'hampel']
        if not isinstance(rlm_m_estimator, str):
            return "Error: rlm_m_estimator must be a string."
        if rlm_m_estimator.lower() not in valid_estimators:
            return f"Error: rlm_m_estimator must be one of {valid_estimators}."

        # Validate fit_intercept
        if not isinstance(fit_intercept, bool):
            return "Error: fit_intercept must be a boolean."

        # Convert to numpy arrays
        y_array = np.array([row[0] for row in y], dtype=float)
        x_array = np.array(x, dtype=float)

        # Add intercept column if needed
        if fit_intercept:
            x_array = np.column_stack([np.ones(n_obs_y), x_array])

        # Map rlm_m_estimator to norm class
        estimator_lower = rlm_m_estimator.lower()
        if estimator_lower == 'huber':
            norm = statsmodels_norms.HuberT()
        elif estimator_lower == 'andrew':
            norm = statsmodels_norms.AndrewWave()
        elif estimator_lower == 'bisquare':
            norm = statsmodels_norms.TukeyBiweight()
        elif estimator_lower == 'hampel':
            norm = statsmodels_norms.Hampel()

        # Fit the model (suppress warnings for numerical edge cases)
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            model = statsmodels_RLM(y_array, x_array, M=norm)
            results = model.fit()

        # Extract results
        params = results.params
        std_errors = results.bse
        t_stats = results.tvalues
        p_values = results.pvalues
        conf_int = results.conf_int()
        scale = results.scale

        # Build output
        output = [['parameter', 'coefficient', 'std_error', 't_statistic', 'p_value', 'ci_lower', 'ci_upper']]

        # Add parameter rows
        for i in range(len(params)):
            if fit_intercept and i == 0:
                param_name = 'intercept'
            else:
                predictor_idx = i if not fit_intercept else i - 1
                param_name = f'x{predictor_idx + 1}'

            output.append([
                param_name,
                float(params[i]),
                float(std_errors[i]),
                float(t_stats[i]),
                float(p_values[i]),
                float(conf_int[i, 0]),
                float(conf_int[i, 1])
            ])

        # Add scale estimate
        output.append(['scale', float(scale), '', '', '', '', ''])

        # Validate output values
        for i, row in enumerate(output[1:], 1):  # Skip header row
            for j, val in enumerate(row):
                if val != '' and isinstance(val, float):
                    if math.isnan(val) or math.isinf(val):
                        return f"Error: result contains non-finite value at row {i}, column {j}."

        return output

    except Exception as e:
        return f"Error: {e}"

Online Calculator