Statistical Distances#

FairLens allows users to make use of a wide range of statistical distance metrics to measure the difference between the distributions of a variable in two potentially sensitive sub-groups of data. These metrics are available for individual use in the package fairlens.metrics, or they can be called using the stat_distance wrapper function.

Let’s import this method and load in the compas dataset.

In [1]: import pandas as pd

In [2]: import fairlens as fl

In [3]: df = pd.read_csv("../datasets/compas.csv")

In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20281 entries, 0 to 20280
Data columns (total 22 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PersonID                 20281 non-null  int64  
 1   AssessmentID             20281 non-null  int64  
 2   CaseID                   20281 non-null  int64  
 3   Agency                   20281 non-null  object 
 4   LastName                 20281 non-null  object 
 5   FirstName                20281 non-null  object 
 6   MiddleName               5216 non-null   object 
 7   Sex                      20281 non-null  object 
 8   Ethnicity                20281 non-null  object 
 9   DateOfBirth              20281 non-null  object 
 10  ScaleSet                 20281 non-null  object 
 11  AssessmentReason         20281 non-null  object 
 12  Language                 20281 non-null  object 
 13  LegalStatus              20281 non-null  object 
 14  CustodyStatus            20281 non-null  object 
 15  MaritalStatus            20281 non-null  object 
 16  ScreeningDate            20281 non-null  object 
 17  RecSupervisionLevelText  20281 non-null  object 
 18  RawScore                 20281 non-null  float64
 19  DecileScore              20281 non-null  int64  
 20  ScoreText                20245 non-null  object 
 21  AssessmentType           20281 non-null  object 
dtypes: float64(1), int64(4), object(17)
memory usage: 3.4+ MB

Distance metrics from fairlens.metrics can be used by passing two columns of data to their __call__ method, which will return the respective metric or None if it cannot be computed. Different metrics take in different keyword arguments which can be passed to their constructor.

In [5]: target_attr = "RawScore"

In [6]: x = df[df["Ethnicity"] == "African-American"][target_attr]

In [7]: y = df[df["Ethnicity"] == "Caucasian"][target_attr]

In [8]: fl.metrics.KolmogorovSmirnovDistance()(x, y)
Out[8]: 0.26075238442125354

In [9]: _, bin_edges = np.histogram(df[target_attr], bins="auto")

In [10]: fl.metrics.EarthMoversDistance(bin_edges)(x, y)
Out[10]: 0.526609942555413

In [11]: ord = 1

In [12]: fl.metrics.Norm(ord=ord)(x, y)
Out[12]: 0.5231267237105415

The method stat_distance provides a simplified wrapper for distance metrics and allows us to define which sub-groups of data we want to measure the distance between using a simplified dictionary notation or a predicate itself.

In [13]: group1 = {"Ethnicity": ["African-American"]}

In [14]: group2 = df["Ethnicity"] == "Caucasian"

We can now make a call to stat_distance. The parameter mode is indicative of the statistical distance metric, and corresponds to the id function in the classes for the distance metrics. If it is set to “auto”, a suitable distance metric will be chosen depending on the distribution of the target variable.

In [15]: fl.metrics.stat_distance(df, target_attr, group1, group2, mode="auto")
Out[15]: (0.26075238442125354,)

We can see that the distance between the groups is the same as above. stat_distance has chosen KolmogorovSmirnovDistance as the best metric since the target column is continous.

It is possible to get a p-value back with the distance by using the p_value flag.

In [16]: fl.metrics.stat_distance(df, target_attr, group1, group2, mode="auto", p_value=True)
Out[16]: (0.26075238442125354, 6.76556519475455e-241)

Additional parameters are passed to the __init__ function of the distance metric. This can be used to pass keyword arguments such as bin_edges to categorical distance metrics.

In [17]: _, bin_edges = np.histogram(df[target_attr], bins="auto")

In [18]: fl.metrics.stat_distance(df, target_attr, group1, group2, mode="emd", bin_edges=bin_edges)
Out[18]: (0.526609942555413,)