fairlens.metrics.EarthMoversDistance#
- class EarthMoversDistance(bin_edges=None)[source]#
Bases:
fairlens.metrics.distance.CategoricalDistanceMetric
Earth movers distance (EMD), aka Wasserstein 1-distance, for categorical data.
Using EarthMoversDistance on the raw data is faster and recommended.
Methods
Initialize categorical distance metric.
Check whether the input is valid.
Distance between the distribution of numerical data in x and y.
Distance between 2 aligned normalized histograms.
Returns a p-value for the test that x and y are sampled from the same distribution.
- Parameters
bin_edges (Optional[numpy.ndarray]) –
- __call__(x, y)#
Calculate the distance between two distributions.
- Parameters
x (pd.Series) – The data in the column representing the first group.
y (pd.Series) – The data in the column representing the second group.
- Returns
The computed distance.
- Return type
Optional[float]
- __init__(bin_edges=None)#
Initialize categorical distance metric.
- Parameters
bin_edges (Optional[np.ndarray], optional) – A numpy array of bin edges used to bin continuous data or to indicate bins of pre-binned data to metrics which take the distance space into account. i.e. For bins [0-5, 5-10, 10-15, 15-20], bin_edges would be [0, 5, 10, 15, 20]. See numpy.histogram_bin_edges() for more information.
- check_input(x, y)#
Check whether the input is valid. Returns False if x and y have different dtypes by default.
- Parameters
x (pd.Series) – The data in the column representing the first group.
y (pd.Series) – The data in the column representing the second group.
- Returns
Whether or not the input is valid.
- Return type
bool
- distance(x, y)#
Distance between the distribution of numerical data in x and y. Derived classes must implement this.
- Parameters
x (pd.Series) – Numerical data in a column.
y (pd.Series) – Numerical data in a column.
- Returns
The computed distance.
- Return type
float
- distance_pdf(p, q, bin_edges)[source]#
Distance between 2 aligned normalized histograms. Derived classes must implement this.
- Parameters
p (pd.Series) – A normalized histogram.
q (pd.Series) – A normalized histogram.
bin_edges (Optional[np.ndarray]) – bin_edges for binned continuous data. Used by metrics such as Earth Mover’s Distance to compute the distance metric space.
- Returns
The computed distance.
- Return type
float
- property id: str#
A string identifier for the method. Used by fairlens.metrics.stat_distance(). Derived classes must implement this.
- p_value(x, y)#
Returns a p-value for the test that x and y are sampled from the same distribution.
- Parameters
x (pd.Series) – Numerical data in a column.
y (pd.Series) – Numerical data in a column.
- Returns
The computed p-value.
- Return type
float