Association and Correlation

Overview

Association tests determine whether two variables are related or independent, answering fundamental research questions across statistics, epidemiology, social sciences, and engineering. Understanding associations and correlations is essential for hypothesis testing, exploratory data analysis, and building predictive models. This category encompasses two complementary approaches: testing for associations between categorical variables in contingency tables, and measuring correlations between continuous or ordinal variables.

Background and Importance: Determining if variables are related is one of the most common statistical tasks. Whether investigating whether smoking status is associated with disease risk, whether study time correlates with exam performance, or whether treatment type affects patient outcomes, association tests provide rigorous, evidence-based methods to answer these questions. The distinction between association and causation remains critical—a significant association indicates a relationship but does not establish causal mechanisms.

Contingency Table Tests: When both variables are categorical, contingency tables organize frequency counts into rows and columns. Tests in this family evaluate the null hypothesis that row and column variables are independent. The CHI2_CONTINGENCY test is the most widely used classical method, comparing observed frequencies to expected values under independence. However, when sample sizes are small or cells contain few observations, exact tests provide more reliable p-values. The FISHER_EXACT test is the gold standard for 2×2 tables with small samples, while BARNARD_EXACT and BOSCHLOO_EXACT offer alternatives when specific design assumptions apply. These exact methods compute precise p-values through permutation logic rather than relying on asymptotic approximations.

Correlation Coefficients for Continuous Data: Measuring linear association between continuous variables is a cornerstone of statistics. The PEARSONR correlation coefficient quantifies the strength and direction of linear relationships, ranging from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship). Pearson correlation assumes bivariate normality, linear relationships, and homoscedasticity (equal variance across the range of one variable). When these assumptions are violated—due to outliers, non-linear relationships, or heavy-tailed distributions—robust alternatives become essential. The THEILSLOPES and SIEGELSLOPES methods compute robust linear regression slopes through median-based estimation, providing estimates insensitive to extreme values.

Ordinal Association Measures: When data are ranked or ordinal rather than continuous, monotonic correlation coefficients are appropriate. The SPEARMANR correlation coefficient applies the Pearson formula to ranks, measuring whether variables tend to move together monotonically. KENDALLTAU offers another rank-based measure with different statistical properties and greater robustness to ties. For asymmetric relationships where one variable is ordinal and influences another, SOMERSD quantifies directional association. The WEIGHTEDTAU variant extends Kendall’s tau to weighted data, useful when observations carry different importance or represent survey data with sampling weights.

Mixed Measurement Scales: When one variable is continuous and the other is binary (dichotomous), the POINTBISERIALR correlation coefficient bridges measurement scales. This measure is mathematically equivalent to Pearson correlation when one variable is dichotomous, making it ideal for comparing a dichotomous grouping variable with a continuous outcome. For detecting trends across ordered groups or treatments, the PAGE_TREND_TEST evaluates whether a metric shows monotonic change across ordered categories, extending the contingency table framework to ordered settings.

Implementation and Libraries: These tools leverage SciPy’s comprehensive statistics module (scipy.stats), which implements exact tests through permutation algorithms and correlation measures with associated p-value calculations. SciPy’s implementations combine computational efficiency with statistical rigor, allowing researchers to apply appropriate methods across diverse data structures and distributions. Most functions return both point estimates and p-values for hypothesis testing at conventional significance levels (typically \alpha = 0.05).

Choosing the Right Test: Selection depends on the measurement scales of both variables. For categorical-categorical relationships, choose among chi-square, Fisher’s exact, Barnard’s exact, or Boschloo’s exact tests based on sample size and table structure. For continuous-continuous relationships, use Pearson correlation if assumptions are met, or robust alternatives for data with outliers or non-linear patterns. For ordinal data, prefer Spearman’s tau or Kendall’s tau. For mixed scales, use point-biserial correlation (continuous-binary) or Somers’ D (ordinal-ordinal with directionality). These tools enable comprehensive analysis of associations across the full spectrum of statistical data types.

Tools

Tool Description
BARNARD_EXACT Perform Barnard’s exact test on a 2x2 contingency table.
BOSCHLOO_EXACT Perform Boschloo’s exact test on a 2x2 contingency table.
CHI2_CONTINGENCY Perform the chi-square test of independence for variables in a contingency table.
FISHER_EXACT Perform Fisher’s exact test on a 2x2 contingency table.
KENDALLTAU Calculate Kendall’s tau, a correlation measure for ordinal data.
PAGE_TREND_TEST Perform Page’s L trend test for monotonic trends across treatments.
PEARSONR Calculate the Pearson correlation coefficient and p-value for two datasets.
POINTBISERIALR Calculate a point biserial correlation coefficient and its p-value.
SIEGELSLOPES Compute the Siegel repeated medians estimator for robust linear regression using scipy.stats.siegelslopes.
SOMERSD Calculate Somers’ D, an asymmetric measure of ordinal association between two variables.
SPEARMANR Calculate a Spearman rank-order correlation coefficient with associated p-value.
THEILSLOPES Compute the Theil-Sen estimator for a set of points (robust linear regression).
WEIGHTEDTAU Compute a weighted version of Kendall’s tau correlation coefficient.