Statistical Distances¶
FairLens allows users to make use of a wide range of statistical distance metrics to measure the difference
between the distributions of a variable in two potentially sensitive sub-groups of data. These metrics
are available for individual use in the package fairlens.metrics
, or they can be called using
the stat_distance
wrapper function.
Let’s import this method and load in the compas dataset.
In [1]: import pandas as pd
In [2]: import fairlens as fl
In [3]: df = pd.read_csv("../datasets/compas.csv")
In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20281 entries, 0 to 20280
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PersonID 20281 non-null int64
1 AssessmentID 20281 non-null int64
2 CaseID 20281 non-null int64
3 Agency 20281 non-null object
4 LastName 20281 non-null object
5 FirstName 20281 non-null object
6 MiddleName 5216 non-null object
7 Sex 20281 non-null object
8 Ethnicity 20281 non-null object
9 DateOfBirth 20281 non-null object
10 ScaleSet 20281 non-null object
11 AssessmentReason 20281 non-null object
12 Language 20281 non-null object
13 LegalStatus 20281 non-null object
14 CustodyStatus 20281 non-null object
15 MaritalStatus 20281 non-null object
16 ScreeningDate 20281 non-null object
17 RecSupervisionLevelText 20281 non-null object
18 RawScore 20281 non-null float64
19 DecileScore 20281 non-null int64
20 ScoreText 20245 non-null object
21 AssessmentType 20281 non-null object
dtypes: float64(1), int64(4), object(17)
memory usage: 3.4+ MB
Distance metrics from fairlens.metrics
can be used by passing two columns of data
to their __call__
method, which will return the respective metric or None if it
cannot be computed. Different metrics take in different keyword arguments which can
be passed to their constructor.
In [5]: target_attr = "RawScore"
In [6]: x = df[df["Ethnicity"] == "African-American"][target_attr]
In [7]: y = df[df["Ethnicity"] == "Caucasian"][target_attr]
In [8]: fl.metrics.KolmogorovSmirnovDistance()(x, y)
Out[8]: 0.26075238442125354
In [9]: _, bin_edges = np.histogram(df[target_attr], bins="auto")
In [10]: fl.metrics.EarthMoversDistance(bin_edges)(x, y)
Out[10]: 0.5266093392666343
In [11]: ord = 1
In [12]: fl.metrics.Norm(ord=ord)(x, y)
Out[12]: 0.5231267237105415
The method stat_distance
provides a simplified wrapper for distance metrics
and allows us to define which sub-groups of data we want to measure the distance between
using a simplified dictionary notation or a predicate itself.
In [13]: group1 = {"Ethnicity": ["African-American"]}
In [14]: group2 = df["Ethnicity"] == "Caucasian"
We can now make a call to stat_distance
. The parameter mode is indicative of the
statistical distance metric, and corresponds to the id
function in the classes
for the distance metrics. If it is set to “auto”, a suitable distance metric will
be chosen depending on the distribution of the target variable.
In [15]: fl.metrics.stat_distance(df, target_attr, group1, group2, mode="auto")
Out[15]: (0.26075238442125354,)
We can see that the distance between the groups is the same as above. stat_distance
has
chosen KolmogorovSmirnovDistance
as the best metric since the target column is continous.
It is possible to get a p-value back with the distance by using the p_value
flag.
In [16]: fl.metrics.stat_distance(df, target_attr, group1, group2, mode="auto", p_value=True)
Out[16]: (0.26075238442125354, 0.9817864673203285)
Additional parameters are passed to the __init__
function of the distance metric. This can
be used to pass keyword arguments such as bin_edges
to categorical distance metrics.
In [17]: _, bin_edges = np.histogram(df[target_attr], bins="auto")
In [18]: fl.metrics.stat_distance(df, target_attr, group1, group2, mode="emd", bin_edges=bin_edges)
Out[18]: (0.5266093392666343,)