Data Depth Background
Table of Contents
A data depth measures how deep (or central) a given point x in d-dimensional space is relative to F, a probability distribution in d-dimensional space, or relative to a given data cloud.
The following concepts apply to the data depth methodolgy and distinguish it from other statistical methods.
- Non-parametric methodology:
Scientific measurements can be viewed as sample points drawn from some unknown probability distribution, where the analysis of the measurements involves computation of quantitative characteristics of the probability distribution (estimators), based on the data set.
If the underlying distribution is known (for example normal distribution, log-normal distribution, Cauchy, etc.), the characteristics of the data can be computed using methods from classical statistics. However, in most real life experiments the underlying distribution is not known.
The concept of data depth requires no assumption about the underlying distribution and data is analyzed according to the relative position of the data points.
- Center-outward ordering of points:
The data depth concept allows the creation of a multivariate analog to the univariate statistical analysis tool of {\em rank statistics}. {\em Rank statistics} is based on the ordering of one-dimensional observations, where the order reflects extremeness, contiguity, variability or the effect of external contamination and provides a parameter estimation method [barnett76].
In higher dimensions the order of multivariate data is not well defined, and several ordering methods were suggested. The data depth concept provides a method of extending order statistics to any dimension by ordering the points according to their depth values.
- Application to multivariate (high-dimensional) data sets:
The concept of data depth is defined with respect to points in Euclidean space in any dimension, thus enabling the derivation of multivariate distributional characteristics of a data set.
The methodology enables the exploration of high dimensional data sets using simple two-dimensional graphs that are easy to visualize and interpret, and using quantitative estimators.
- Robustness:
In the statistical analysis of datasets, observations that deviate from the main part of the data (outliers) can have an undesirable influence on the analysis of the data.
Many depth functions are "robust against the possibility of one or several unannounced outliers that may occur in the data and yield reasonable results even if several unannounced outliers occur in the data".
For example, "by adding k bad data points to a data-set one can corrupt at most the k-outermost (half-space) depth contours while the ones inside must still reflect the shape of the good data".
Table of Contents
© 2006 John Hugg, Tufts University