fairlens.metrics.CategoricalDistanceMetric#

class CategoricalDistanceMetric(bin_edges=None)[source]#

Bases: fairlens.metrics.distance.DistanceMetric

Base class for distance metrics on categorical data.

Continuous data is automatically binned to create histograms, bin edges can be provided as an argument and will be used to bin continous data. If the data has been pre-binned and consists of pd.Intervals for instance, the histograms will be computed using the counts of each bin, and the bin_edges, if given, will be used in metrics such as EarthMoversDistanceCategorical to compute the distance space.

Subclasses must implement a distance_pdf method.

Methods

__init__

Initialize categorical distance metric.

check_input

Check whether the input is valid.

distance

Distance between the distribution of numerical data in x and y.

distance_pdf

Distance between 2 aligned normalized histograms.

p_value

Returns a p-value for the test that x and y are sampled from the same distribution.

Parameters

bin_edges (Optional[numpy.ndarray]) –

__call__(x, y)#

Calculate the distance between two distributions.

Parameters
  • x (pd.Series) – The data in the column representing the first group.

  • y (pd.Series) – The data in the column representing the second group.

Returns

The computed distance.

Return type

Optional[float]

__init__(bin_edges=None)[source]#

Initialize categorical distance metric.

Parameters

bin_edges (Optional[np.ndarray], optional) – A numpy array of bin edges used to bin continuous data or to indicate bins of pre-binned data to metrics which take the distance space into account. i.e. For bins [0-5, 5-10, 10-15, 15-20], bin_edges would be [0, 5, 10, 15, 20]. See numpy.histogram_bin_edges() for more information.

check_input(x, y)[source]#

Check whether the input is valid. Returns False if x and y have different dtypes by default.

Parameters
  • x (pd.Series) – The data in the column representing the first group.

  • y (pd.Series) – The data in the column representing the second group.

Returns

Whether or not the input is valid.

Return type

bool

distance(x, y)[source]#

Distance between the distribution of numerical data in x and y. Derived classes must implement this.

Parameters
  • x (pd.Series) – Numerical data in a column.

  • y (pd.Series) – Numerical data in a column.

Returns

The computed distance.

Return type

float

abstract distance_pdf(p, q, bin_edges)[source]#

Distance between 2 aligned normalized histograms. Derived classes must implement this.

Parameters
  • p (pd.Series) – A normalized histogram.

  • q (pd.Series) – A normalized histogram.

  • bin_edges (Optional[np.ndarray]) – bin_edges for binned continuous data. Used by metrics such as Earth Mover’s Distance to compute the distance metric space.

Returns

The computed distance.

Return type

float

abstract property id: str#

A string identifier for the method. Used by fairlens.metrics.stat_distance(). Derived classes must implement this.

p_value(x, y)[source]#

Returns a p-value for the test that x and y are sampled from the same distribution.

Parameters
  • x (pd.Series) – Numerical data in a column.

  • y (pd.Series) – Numerical data in a column.

Returns

The computed p-value.

Return type

float