fairlens.metrics.CategoricalDistanceMetric#
- class CategoricalDistanceMetric(bin_edges=None)[source]#
Bases:
fairlens.metrics.distance.DistanceMetric
Base class for distance metrics on categorical data.
Continuous data is automatically binned to create histograms, bin edges can be provided as an argument and will be used to bin continous data. If the data has been pre-binned and consists of pd.Intervals for instance, the histograms will be computed using the counts of each bin, and the bin_edges, if given, will be used in metrics such as EarthMoversDistanceCategorical to compute the distance space.
Subclasses must implement a distance_pdf method.
Methods
Initialize categorical distance metric.
Check whether the input is valid.
Distance between the distribution of numerical data in x and y.
Distance between 2 aligned normalized histograms.
Returns a p-value for the test that x and y are sampled from the same distribution.
- Parameters
bin_edges (Optional[numpy.ndarray]) –
- __call__(x, y)#
Calculate the distance between two distributions.
- Parameters
x (pd.Series) – The data in the column representing the first group.
y (pd.Series) – The data in the column representing the second group.
- Returns
The computed distance.
- Return type
Optional[float]
- __init__(bin_edges=None)[source]#
Initialize categorical distance metric.
- Parameters
bin_edges (Optional[np.ndarray], optional) – A numpy array of bin edges used to bin continuous data or to indicate bins of pre-binned data to metrics which take the distance space into account. i.e. For bins [0-5, 5-10, 10-15, 15-20], bin_edges would be [0, 5, 10, 15, 20]. See numpy.histogram_bin_edges() for more information.
- check_input(x, y)[source]#
Check whether the input is valid. Returns False if x and y have different dtypes by default.
- Parameters
x (pd.Series) – The data in the column representing the first group.
y (pd.Series) – The data in the column representing the second group.
- Returns
Whether or not the input is valid.
- Return type
bool
- distance(x, y)[source]#
Distance between the distribution of numerical data in x and y. Derived classes must implement this.
- Parameters
x (pd.Series) – Numerical data in a column.
y (pd.Series) – Numerical data in a column.
- Returns
The computed distance.
- Return type
float
- abstract distance_pdf(p, q, bin_edges)[source]#
Distance between 2 aligned normalized histograms. Derived classes must implement this.
- Parameters
p (pd.Series) – A normalized histogram.
q (pd.Series) – A normalized histogram.
bin_edges (Optional[np.ndarray]) – bin_edges for binned continuous data. Used by metrics such as Earth Mover’s Distance to compute the distance metric space.
- Returns
The computed distance.
- Return type
float
- abstract property id: str#
A string identifier for the method. Used by fairlens.metrics.stat_distance(). Derived classes must implement this.