Significance Tests#

Usually while measuring statistical distances and correlations users may want to test the significance of such metrics.

Resampling#

Say we want to test the hypothesis that the distribution of a variable in a subgroup is different to that in the remaining population. For arbitrary metrics and statistics, we can use the data to tell us about their distribution. This is done by using methods such as bootstrapping or permutation tests to resample the data multiple times and recompute the statistic on each sample, thereby providing an estimate for its distribution. The distribution can then be used to compute a p-value or a confidence interval for the metric.

Users can resample their data using methods from fairlens.metrics to provide an estimate for the distribution of their test statistic. For instance if we want to obtain an estimate for the distribution of the distance between the means of two subgroups of data using bootstrapping or permutation testing we can do the following.

In [1]: import pandas as pd

In [2]: import fairlens as fl

In [3]: df = pd.read_csv("../datasets/compas.csv")

In [4]: group1 = df[df["Sex"] == "Male"]["RawScore"]

In [5]: group2 = df[df["Sex"] == "Female"]["RawScore"]

In [6]: test_statistic = lambda x, y: x.mean() - y.mean()

In [7]: t_distribution = fl.metrics.permutation_statistic(group1, group2, test_statistic, n_perm=100)

In [8]: t_distribution
Out[8]: 
array([ 0.01286838,  0.02142541,  0.00337466, -0.00592307, -0.01974862,
        0.00136294, -0.02096199,  0.00688509, -0.00781374, -0.00074677,
       -0.00604124, -0.00607006,  0.01387136,  0.002046  , -0.01778013,
       -0.02228201,  0.00164539, -0.0173017 , -0.0083527 , -0.01675121,
       -0.00727479,  0.00819934, -0.00377012, -0.00122232, -0.00846222,
        0.02124383, -0.01278541, -0.00315911,  0.01935028, -0.0212185 ,
       -0.01974286,  0.00667758,  0.00044354, -0.01154898, -0.00408139,
        0.00321326,  0.0047235 , -0.00330322, -0.02104558, -0.0024357 ,
        0.01809079, -0.01953246, -0.00320811, -0.03821727,  0.01253117,
        0.00227946,  0.00497712,  0.01709069, -0.00225989, -0.01580587,
       -0.00381047, -0.02887919, -0.00848816,  0.01967596,  0.01674196,
       -0.01973709,  0.00951647,  0.01056268,  0.01489163, -0.03951999,
       -0.01598457,  0.0342451 , -0.01465302, -0.00018764,  0.0074961 ,
        0.01457172,  0.02149746,  0.01532395, -0.00795497, -0.00861209,
        0.00012075,  0.03148979,  0.00441223,  0.01936469, -0.01926731,
       -0.01379992,  0.03159931,  0.001386  ,  0.00608386, -0.00013   ,
        0.00081822, -0.00458288,  0.00835786,  0.00339196, -0.00727479,
       -0.01179684, -0.02681271, -0.00374707, -0.0078858 , -0.01147116,
        0.01035229,  0.00496271,  0.00631155,  0.00661129, -0.0190108 ,
       -0.02351844, -0.01281711, -0.01940277,  0.0152029 ,  0.00487625])

In [9]: t_distribution = fl.metrics.bootstrap_statistic(group1, group2, test_statistic, n_samples=100)

In [10]: t_distribution
Out[10]: 
array([0.26608373, 0.29933828, 0.25179608, 0.26880486, 0.2818501 ,
       0.26238803, 0.2441391 , 0.26988521, 0.28785505, 0.26253883,
       0.27808238, 0.28851902, 0.24522395, 0.29042539, 0.28939455,
       0.2489354 , 0.28331758, 0.24391852, 0.27556381, 0.27597119,
       0.30116363, 0.27623678, 0.27054918, 0.27244655, 0.26559082,
       0.24680846, 0.28877335, 0.27703579, 0.26148098, 0.25946658,
       0.29509115, 0.27383975, 0.25932703, 0.25119739, 0.2379856 ,
       0.28228899, 0.29041864, 0.27453747, 0.27057619, 0.27520594,
       0.27039388, 0.26960612, 0.27967815, 0.26098132, 0.2813752 ,
       0.24898942, 0.2419919 , 0.26435742, 0.25946658, 0.29125366,
       0.25728112, 0.28851002, 0.28760972, 0.26834346, 0.30146072,
       0.28900518, 0.26641008, 0.28576637, 0.2567657 , 0.27849201,
       0.30372496, 0.25273914, 0.28678821, 0.29401756, 0.25266487,
       0.26247355, 0.25901418, 0.29288769, 0.26712357, 0.26867432,
       0.28620077, 0.28834796, 0.25640558, 0.27189287, 0.25450822,
       0.26054918, 0.2773734 , 0.28823543, 0.2396804 , 0.28382624,
       0.23843349, 0.26538825, 0.30276165, 0.27771326, 0.22208418,
       0.24261535, 0.2838465 , 0.30166779, 0.26745217, 0.25525096,
       0.28066622, 0.27659239, 0.28871933, 0.26612199, 0.28482782,
       0.28491785, 0.25397704, 0.25208868, 0.26918073, 0.2763223 ])

Intervals and P-Values#

The distribution of the test statistic produced by resampling can be used to compute a confidence interval or a p-value. We can use our bootstrapped distribution from above to do so using the following methods.

In [11]: t_observed = test_statistic(group1, group2)

In [12]: fl.metrics.resampling_interval(t_observed, t_distribution, cl=0.95)
Out[12]: (0.23902577087553453, 0.30156943506639666)

In [13]: fl.metrics.resampling_p_value(t_observed, t_distribution, alternative="two-sided")
Out[13]: 0.49

Fairness Scorer

API Reference