OLS_REGRESSION

Overview

The OLS_REGRESSION function fits an Ordinary Least Squares (OLS) regression model to estimate the relationship between a dependent variable and one or more independent variables (predictors). OLS is the most widely used method in linear regression analysis and is foundational to econometrics, statistics, and predictive modeling.

OLS works by finding the coefficient values that minimize the sum of squared residuals (the differences between observed and predicted values). Given a linear model of the form:

y = X\beta + \varepsilon

where y is the vector of observed responses, X is the matrix of predictors, \beta is the vector of unknown coefficients, and \varepsilon represents random errors, the OLS estimator is computed as:

\hat{\beta} = (X^T X)^{-1} X^T y

This closed-form solution yields the best linear unbiased estimator (BLUE) under the assumptions of the Gauss-Markov theorem: linearity, exogeneity, homoscedasticity (constant variance of errors), and no perfect multicollinearity among predictors. For more theoretical background, see the Wikipedia article on Ordinary Least Squares.

This implementation uses the statsmodels Python library, which provides comprehensive statistical modeling capabilities. The statsmodels OLS documentation describes the underlying implementation, and the source code is available on GitHub.

The function returns a structured table containing coefficient estimates, standard errors, t-statistics, p-values, and confidence intervals for each parameter. It also provides key model diagnostics including:

  • R-squared and Adjusted R-squared: measures of how well the model explains variance in the response
  • F-statistic and its p-value: tests whether the model has overall explanatory power
  • AIC/BIC: information criteria for model comparison
  • Condition number: indicates potential multicollinearity issues

When fit_intercept is set to True, a constant term is automatically added to the model. The alpha parameter controls the significance level for confidence intervals (default 0.05 for 95% confidence).

This example function is provided as-is without any representation of accuracy.

Excel Usage

=OLS_REGRESSION(y, x, fit_intercept, alpha)
  • y (list[list], required): Dependent variable as a column vector (N rows x 1 column) of float values.
  • x (list[list], required): Independent variables (predictors) as a matrix (N rows x P columns) of float values.
  • fit_intercept (bool, optional, default: true): If true, adds an intercept (constant) term to the model.
  • alpha (float, optional, default: 0.05): Significance level for confidence intervals, between 0 and 1.

Returns (list[list]): 2D list with OLS results, or error message string.

Examples

Example 1: Demo case 1

Inputs:

y x
1.2 1
2.1 2
2.9 3
4.2 4
4.8 5

Excel formula:

=OLS_REGRESSION({1.2;2.1;2.9;4.2;4.8}, {1;2;3;4;5})

Expected output:

parameter coefficient std_error t_statistic p_value ci_lower ci_upper
const 0.25 0.17445 1.43306 0.24728 -0.30518 0.80518
x1 0.93 0.0526 17.68091 0.00039 0.76261 1.09739
r_squared 0.99049
adj_r_squared 0.98733
f_statistic 312.61446
f_pvalue 0.00039
aic -2.30238
bic -3.0835
log_likelihood 3.15119
condition_number 8.36575

Example 2: Demo case 2

Inputs:

y x
2.5 1 0.5
3.7 2 1
4.2 3 1.5
5.8 4 2
6.1 5 2.5

Excel formula:

=OLS_REGRESSION({2.5;3.7;4.2;5.8;6.1}, {1,0.5;2,1;3,1.5;4,2;5,2.5})

Expected output:

parameter coefficient std_error t_statistic p_value ci_lower ci_upper
const 1.67 0.34414 4.85266 0.0167 0.57479 2.76521
x1 0.744 0.08301 8.96277 0.00293 0.47983 1.00817
x2 0.372 0.04151 8.96277 0.00293 0.23991 0.50409
r_squared 0.964
adj_r_squared 0.952
f_statistic 80.33127
f_pvalue 0.00293
aic 4.49168
bic 3.71056
log_likelihood -0.24584
condition_number 12676737680377076

Example 3: Demo case 3

Inputs:

y x fit_intercept
2.1 1 false
3.9 2
6.2 3
7.8 4
10.1 5

Excel formula:

=OLS_REGRESSION({2.1;3.9;6.2;7.8;10.1}, {1;2;3;4;5}, FALSE)

Expected output:

parameter coefficient std_error t_statistic p_value ci_lower ci_upper
x1 2.00364 0.02229 89.90304 0 1.94176 2.06551
r_squared 0.99951
adj_r_squared 0.99938
f_statistic 8082.55574
f_pvalue 0
aic -2.92735
bic -3.31791
log_likelihood 2.46367
condition_number 1

Example 4: Demo case 4

Inputs:

y x alpha
2.5 1 0.5 0.1
3.7 2 1
4.2 3 1.5
5.8 4 2
6.1 5 2.5

Excel formula:

=OLS_REGRESSION({2.5;3.7;4.2;5.8;6.1}, {1,0.5;2,1;3,1.5;4,2;5,2.5}, 0.1)

Expected output:

parameter coefficient std_error t_statistic p_value ci_lower ci_upper
const 1.67 0.34414 4.85266 0.0167 0.86011 2.47989
x1 0.744 0.08301 8.96277 0.00293 0.54865 0.93935
x2 0.372 0.04151 8.96277 0.00293 0.27432 0.46968
r_squared 0.964
adj_r_squared 0.952
f_statistic 80.33127
f_pvalue 0.00293
aic 4.49168
bic 3.71056
log_likelihood -0.24584
condition_number 12676737680377076

Python Code

import math
import pandas as pd
import statsmodels.api as sm

def ols_regression(y, x, fit_intercept=True, alpha=0.05):
    """
    Fits an Ordinary Least Squares (OLS) regression model.

    See: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable as a column vector (N rows x 1 column) of float values.
        x (list[list]): Independent variables (predictors) as a matrix (N rows x P columns) of float values.
        fit_intercept (bool, optional): If true, adds an intercept (constant) term to the model. Default is True.
        alpha (float, optional): Significance level for confidence intervals, between 0 and 1. Default is 0.05.

    Returns:
        list[list]: 2D list with OLS results, or error message string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    def validate_numeric_2d(arr, name):
        if not isinstance(arr, list):
            return f"Error: {name} must be a 2D list."
        if not arr:
            return f"Error: {name} cannot be empty."
        for i, row in enumerate(arr):
            if not isinstance(row, list):
                return f"Error: {name} must be a 2D list."
            if not row:
                return f"Error: {name} cannot have empty rows."
            for j, val in enumerate(row):
                if not isinstance(val, (int, float)):
                    return f"Error: {name}[{i}][{j}] must be numeric."
                if math.isnan(val) or math.isinf(val):
                    return f"Error: {name}[{i}][{j}] must be finite."
        return None

    def validate_scalar(val, name, val_type, check_range=None):
        if not isinstance(val, val_type):
            type_name = val_type.__name__ if hasattr(val_type, '__name__') else str(val_type)
            return f"Error: {name} must be {type_name}."
        # Check if val_type includes numeric types
        is_numeric = (val_type in (int, float)) or (isinstance(val_type, tuple) and any(t in (int, float) for t in val_type))
        if is_numeric and isinstance(val, (int, float)):
            if math.isnan(val) or math.isinf(val):
                return f"Error: {name} must be finite."
            if check_range and not (check_range[0] <= val <= check_range[1]):
                return f"Error: {name} must be in range {check_range}."
        return None

    try:
        # Normalize inputs
        y = to2d(y)
        x = to2d(x)

        # Validate inputs
        err = validate_numeric_2d(y, "y")
        if err:
            return err
        err = validate_numeric_2d(x, "x")
        if err:
            return err
        err = validate_scalar(fit_intercept, "fit_intercept", bool)
        if err:
            return err
        err = validate_scalar(alpha, "alpha", (int, float), (0.0, 1.0))
        if err:
            return err

        # Check dimensions
        n_obs_y = len(y)
        n_cols_y = len(y[0])
        if n_cols_y != 1:
            return "Error: y must be a column vector (single column)."

        # Check consistent column length in y
        for i, row in enumerate(y):
            if len(row) != n_cols_y:
                return f"Error: y row {i} has inconsistent number of columns."

        n_obs_x = len(x)
        n_predictors = len(x[0])

        # Check consistent column length in x
        for i, row in enumerate(x):
            if len(row) != n_predictors:
                return f"Error: x row {i} has inconsistent number of columns."

        if n_obs_y != n_obs_x:
            return f"Error: y and x must have the same number of observations ({n_obs_y} vs {n_obs_x})."

        # Check sufficient observations
        min_obs = n_predictors + (1 if fit_intercept else 0)
        if n_obs_y <= min_obs:
            return f"Error: insufficient observations ({n_obs_y}) for {min_obs} parameters."

        # Convert to pandas for proper parameter names
        y_series = pd.Series([row[0] for row in y])

        # Create column names for X
        x_col_names = [f'x{i+1}' for i in range(n_predictors)]
        x_df = pd.DataFrame(x, columns=x_col_names)

        # Prepare X matrix
        if fit_intercept:
            x_with_const = sm.add_constant(x_df, has_constant='add')
        else:
            x_with_const = x_df

        # Fit OLS model
        model = sm.OLS(y_series, x_with_const)
        results = model.fit()

        # Get confidence intervals
        conf_int = results.conf_int(alpha=alpha)

        # Build output table
        output = []

        # Header row
        output.append(['parameter', 'coefficient', 'std_error', 't_statistic', 'p_value', 'ci_lower', 'ci_upper'])

        # Parameter rows
        param_names = results.params.index.tolist()
        for i, param_name in enumerate(param_names):
            row = [
                param_name,
                float(results.params.iloc[i]),
                float(results.bse.iloc[i]),
                float(results.tvalues.iloc[i]),
                float(results.pvalues.iloc[i]),
                float(conf_int.iloc[i, 0]),
                float(conf_int.iloc[i, 1])
            ]
            output.append(row)

        # Model statistics rows (pad to 7 columns for consistency with header)
        output.append(['r_squared', float(results.rsquared), '', '', '', '', ''])
        output.append(['adj_r_squared', float(results.rsquared_adj), '', '', '', '', ''])
        output.append(['f_statistic', float(results.fvalue), '', '', '', '', ''])
        output.append(['f_pvalue', float(results.f_pvalue), '', '', '', '', ''])
        output.append(['aic', float(results.aic), '', '', '', '', ''])
        output.append(['bic', float(results.bic), '', '', '', '', ''])
        output.append(['log_likelihood', float(results.llf), '', '', '', '', ''])
        output.append(['condition_number', float(results.condition_number), '', '', '', '', ''])

        return output

    except Exception as e:
        return f"Error: {e}"

Online Calculator