Motivation

Imagine you have a cloud of data points in 2D.

Euclidean Distance Limitation: If you use standard Euclidean distance (), you're measuring the straight-line distance. This works well if your data cloud is roughly circular and dimensions are equally scaled. However, what if the cloud is stretched out like an ellipse, meaning the variables are correlated or have vastly different variances?

Consider two points, A and B, that are the same Euclidean distance from the center of the ellipse. If point A lies along the short axis of the ellipse and point B lies along the long axis, intuitively, point A feels more like an "outlier" relative to the typical spread of the data than point B does. Euclidean distance doesn't capture this.

Mahalanobis Distance Idea: Mahalanobis distance accounts for the shape (correlation) and spread (variance) of the data distribution. It effectively asks: "How many standard deviations away is this point from the mean of the distribution, considering the correlations between variables?"

  • It "transforms" the data space so that the correlations are removed and the variance is equalized in all directions (making the elliptical cloud look like a circle).
  • Then, it calculates the standard Euclidean distance in this transformed space.
  • Therefore, a point far away along a direction of low variance will have a larger Mahalanobis distance than a point equally far (in Euclidean terms) along a direction of high variance. It measures distance relative to the covariance structure of the data.

Formulation

Let:

  • be the vector representing the point whose distance we want to measure.
  • be the vector representing the mean (centroid) of the distribution (e.g., a dataset or a class).
  • be the covariance matrix of the distribution. The diagonal elements contain the variances of each variable, and the off-diagonal elements contain the covariances between pairs of variables.

The Mahalanobis distance () from the point to the distribution with mean and covariance is defined as:

where

  1. : This is the inverse of the covariance matrix, also known as the precision matrix. This is the key part that accounts for variance and correlation.
    • Multiplying by effectively rescales the space. Directions with high variance (large values in ) are shrunk (small values in ), and directions with low variance are stretched. It also rotates the space to align with the principal components, effectively decorrelating the variables.
  2. : This quadratic form calculates the squared distance in the transformed space. The result is a non-negative scalar.

Special Case: If the variables are uncorrelated and have unit variance (i.e., is the identity matrix ), then . The formula becomes , which is exactly the Euclidean distance between and . This shows Mahalanobis distance is a generalization of Euclidean distance.

Considerations

  • Requires estimating the mean vector and the covariance matrix (or its inverse ). This requires sufficient data points.
  • The covariance matrix must be invertible (non-singular). This can be an issue with high-dimensional data or collinear features. Regularization techniques might be needed.
  • Sensitive to outliers when estimating and . Robust estimation methods might be necessary.