CANCORR

Overview

The CANCORR function performs Canonical Correlation Analysis (CCA), a multivariate statistical technique that identifies and measures the linear relationships between two sets of variables. First introduced by Harold Hotelling in 1936, CCA finds linear combinations of each variable set that maximize the correlation between them, producing canonical variates — pairs of composite variables with the highest possible correlation.

Given two sets of variables X = (x_1, \ldots, x_n) and Y = (y_1, \ldots, y_m), CCA seeks weight vectors a and b such that the correlation between U = a^T X and V = b^T Y is maximized. Subsequent pairs of canonical variates are derived with the constraint that they are uncorrelated with all previous pairs. The number of canonical correlations equals \min(n, m).

This implementation uses singular value decomposition (SVD) via the statsmodels CanCorr class. The algorithm computes canonical correlations by solving an eigenvalue problem on the cross-covariance structure of the standardized variables. The function returns:

  • Canonical correlations: Values ranging from 0 to 1 indicating the strength of each canonical relationship
  • Eigenvalues: Computed from canonical correlations as \lambda = r^2 / (1 - r^2)
  • Wilks’ lambda: A multivariate test statistic for the null hypothesis of no correlation
  • Chi-square statistics: Bartlett’s approximation for hypothesis testing:

\chi^2 = -\left(n - 1 - \frac{p + q + 1}{2}\right) \ln(\Lambda)

where n is the number of observations, p and q are the number of variables in each set, and \Lambda is Wilks’ lambda.

CCA is widely used in psychology, ecology, marketing research, and bioinformatics to explore relationships between measurement batteries, such as comparing personality inventories or linking gene expression data to phenotypic outcomes. For additional background, see the Wikipedia article on Canonical Correlation and the statsmodels multivariate documentation.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=CANCORR(x_vars, y_vars, standardize)
  • x_vars (list[list], required): First set of variables where rows are observations and columns are variables.
  • y_vars (list[list], required): Second set of variables where rows are observations and columns are variables.
  • standardize (bool, optional, default: true): Whether to standardize variables (mean=0, std=1) before analysis.

Returns (list[list]): 2D list with canonical correlations, or error message string.

Examples

Example 1: Demo case 1

Inputs:

x_vars y_vars
1 2.5 2.1 1.5
2 3.2 3 2.8
3 4.1 4.2 3.1
4 5.3 5.1 4.5
5 6 6 5.2

Excel formula:

=CANCORR({1,2.5;2,3.2;3,4.1;4,5.3;5,6}, {2.1,1.5;3,2.8;4.2,3.1;5.1,4.5;6,5.2})

Expected output:

canonical_variate correlation eigenvalue wilks_lambda chi_square df p_value
1 0.99964 1401.98353 0.0007 10.9071 4 0.05204
2 0.15727 0.02536 0.97527 0.03757 1 0.84273
X Coefficients
Variable CV1 CV2
X1 -0.24893 1.54403
X2 -0.07654 -1.67918
Y Coefficients
Variable CV1 CV2
Y1 -0.27798 3.51887
Y2 -0.04201 -3.86311

Example 2: Demo case 2

Inputs:

x_vars y_vars standardize
1.2 2.8 1.9 3.4 2.1 1.6 true
2.3 3.5 2.4 4.2 3.3 2.5
3.1 4.2 3.7 5.3 4.5 3.2
4.5 5.1 4.2 6.2 5.2 4.7
5.3 6.4 5.6 7.1 6.5 5.3
6.7 7.3 6.1 8.3 7.2 6.8
7.2 8.1 7.4 9.1 8.4 7.2

Excel formula:

=CANCORR({1.2,2.8,1.9;2.3,3.5,2.4;3.1,4.2,3.7;4.5,5.1,4.2;5.3,6.4,5.6;6.7,7.3,6.1;7.2,8.1,7.4}, {3.4,2.1,1.6;4.2,3.3,2.5;5.3,4.5,3.2;6.2,5.2,4.7;7.1,6.5,5.3;8.3,7.2,6.8;9.1,8.4,7.2}, TRUE)

Expected output:

canonical_variate correlation eigenvalue wilks_lambda chi_square df p_value
1 0.99997 15525.13258 0.00001 29.25827 9 0.01232
2 0.93194 6.60553 0.12835 5.13257 4 0.29308
3 0.15448 0.02445 0.97614 0.06038 1 0.8041
X Coefficients
Variable CV1 CV2 CV3
X1 0.00807 -1.89357 -3.97215
X2 -0.0645 -0.03864 2.38972
X3 -0.13273 1.89581 1.42223
Y Coefficients
Variable CV1 CV2 CV3
Y1 -0.16296 0.6095 1.51665
Y2 0.01371 0.71955 -2.83903
Y3 -0.0352 -1.41293 1.12238

Example 3: Demo case 3

Inputs:

x_vars y_vars
1 1.5 1.8 2.1
2 2.2 2.7 3.3
3 3.8 3.5 4.2
4 4.5 4.6 5.1
5 5.7 5.4 6.5
6 6.3 6.9 7.2
7 7.9 7.3 8.4

Excel formula:

=CANCORR({1,1.5;2,2.2;3,3.8;4,4.5;5,5.7;6,6.3;7,7.9}, {1.8,2.1;2.7,3.3;3.5,4.2;4.6,5.1;5.4,6.5;6.9,7.2;7.3,8.4})

Expected output:

canonical_variate correlation eigenvalue wilks_lambda chi_square df p_value
1 0.99975 2028.36488 0.00028 28.60577 4 0.00002
2 0.65377 0.74648 0.57258 1.9516 1 0.15906
X Coefficients
Variable CV1 CV2
X1 -0.09984 1.44725
X2 -0.09006 -1.34131
Y Coefficients
Variable CV1 CV2
Y1 -0.24708 1.79612
Y2 0.05561 -1.72225

Example 4: Demo case 4

Inputs:

x_vars y_vars standardize
1 2.5 2.1 1.5 false
2 3.2 3 2.8
3 4.1 4.2 3.1
4 5.3 5.1 4.5
5 6 6 5.2

Excel formula:

=CANCORR({1,2.5;2,3.2;3,4.1;4,5.3;5,6}, {2.1,1.5;3,2.8;4.2,3.1;5.1,4.5;6,5.2}, FALSE)

Expected output:

canonical_variate correlation eigenvalue wilks_lambda chi_square df p_value
1 0.99964 1401.98353 0.0007 10.9071 4 0.05204
2 0.15727 0.02536 0.97527 0.03757 1 0.84273
X Coefficients
Variable CV1 CV2
X1 -0.24893 1.54403
X2 -0.07654 -1.67918
Y Coefficients
Variable CV1 CV2
Y1 -0.27798 3.51887
Y2 -0.04201 -3.86311

Python Code

import math
from statsmodels.multivariate.cancorr import CanCorr as statsmodels_cancorr

def cancorr(x_vars, y_vars, standardize=True):
    """
    Performs Canonical Correlation Analysis (CCA) between two sets of variables.

    See: https://www.statsmodels.org/stable/generated/statsmodels.multivariate.cancorr.CanCorr.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        x_vars (list[list]): First set of variables where rows are observations and columns are variables.
        y_vars (list[list]): Second set of variables where rows are observations and columns are variables.
        standardize (bool, optional): Whether to standardize variables (mean=0, std=1) before analysis. Default is True.

    Returns:
        list[list]: 2D list with canonical correlations, or error message string.
    """
    def to2d(x):
      return [[x]] if not isinstance(x, list) else x

    def validate_2d_array(arr, name):
      # Validate that arr is a 2D list of numeric values
      if not isinstance(arr, list):
        return f"Error: Invalid input: {name} must be a 2D list."
      if len(arr) == 0:
        return f"Error: Invalid input: {name} must not be empty."
      for i, row in enumerate(arr):
        if not isinstance(row, list):
          return f"Error: Invalid input: {name} must be a 2D list."
        if len(row) == 0:
          return f"Error: Invalid input: {name} rows must not be empty."
        for j, val in enumerate(row):
          if not isinstance(val, (int, float, bool)):
            return f"Error: Invalid input: {name}[{i}][{j}] must be numeric."
          num_val = float(val)
          if math.isnan(num_val) or math.isinf(num_val):
            return f"Error: Invalid input: {name}[{i}][{j}] must be finite."
      # Check that all rows have the same length
      row_lengths = [len(row) for row in arr]
      if len(set(row_lengths)) > 1:
        return f"Error: Invalid input: {name} must have consistent row lengths."
      return None

    try:
      # Normalize inputs
      x_vars = to2d(x_vars)
      y_vars = to2d(y_vars)

      # Validate inputs
      error = validate_2d_array(x_vars, "x_vars")
      if error:
        return error
      error = validate_2d_array(y_vars, "y_vars")
      if error:
        return error

      # Validate standardize
      if not isinstance(standardize, bool):
        return "Error: Invalid input: standardize must be a boolean."

      # Check that x_vars and y_vars have the same number of rows
      if len(x_vars) != len(y_vars):
        return "Error: Invalid input: x_vars and y_vars must have the same number of observations (rows)."

      # Check minimum number of observations
      n_obs = len(x_vars)
      n_x_vars = len(x_vars[0])
      n_y_vars = len(y_vars[0])

      if n_obs < max(n_x_vars, n_y_vars) + 1:
        return "Error: Invalid input: number of observations must be greater than the number of variables."

      # Perform canonical correlation analysis
      cca = statsmodels_cancorr(x_vars, y_vars, standardize=standardize)

      # Get test results
      corr_test = cca.corr_test()

      # Build output table
      output = []

      # Header row
      output.append([
        'canonical_variate',
        'correlation',
        'eigenvalue',
        'wilks_lambda',
        'chi_square',
        'df',
        'p_value'
      ])

      # Results for each canonical correlation
      n_cv = len(cca.cancorr)
      for i in range(n_cv):
        # Calculate eigenvalue from canonical correlation
        r = float(cca.cancorr[i])
        eigenval = (r * r) / (1.0 - r * r) if r < 1.0 else float('inf')

        # Get Wilks' lambda from test results
        wilks = float(corr_test.stats.loc[i, "Wilks' lambda"])

        # Calculate chi-square using Bartlett's approximation
        chi_sq = -(n_obs - 1.0 - (n_x_vars + n_y_vars + 1.0) / 2.0) * math.log(wilks) if wilks > 0 else float('inf')

        # Get degrees of freedom and p-value
        df = float(corr_test.stats.loc[i, 'Num DF'])
        pval = float(corr_test.stats.loc[i, 'Pr > F'])

        row = [
          i + 1,  # canonical variate number
          r,  # canonical correlation
          eigenval,  # eigenvalue
          wilks,  # Wilks' lambda
          chi_sq,  # chi-square
          df,  # degrees of freedom
          pval  # p-value
        ]
        output.append(row)

      # Add blank row separator
      output.append([''] * 7)

      # Add X coefficients section
      output.append(['X Coefficients'] + [''] * 6)
      x_coef_header = ['Variable'] + [f'CV{j+1}' for j in range(n_cv)] + [''] * (7 - n_cv - 1)
      output.append(x_coef_header[:7])

      for i in range(n_x_vars):
        row = [f'X{i+1}'] + [float(cca.x_cancoef[i, j]) for j in range(n_cv)] + [''] * (7 - n_cv - 1)
        output.append(row[:7])

      # Add blank row separator
      output.append([''] * 7)

      # Add Y coefficients section
      output.append(['Y Coefficients'] + [''] * 6)
      y_coef_header = ['Variable'] + [f'CV{j+1}' for j in range(n_cv)] + [''] * (7 - n_cv - 1)
      output.append(y_coef_header[:7])

      for i in range(n_y_vars):
        row = [f'Y{i+1}'] + [float(cca.y_cancoef[i, j]) for j in range(n_cv)] + [''] * (7 - n_cv - 1)
        output.append(row[:7])

      return output
    except Exception as e:
      return f"Error: {str(e)}"

Online Calculator