Hidden Correlations#

Fairlens offers a range of correlation metrics that analyze associations between two numerical series, two categorical series or between categorical and numerical ones. These metrics are used both to generate correlation heatmaps of datasets to provide a general overview and to detect columns that act as proxies for the sensitive attributes of datasets, which pose the risk of training biased models.

Sensitive Proxy Detection#

In some datasets, it is possible that some apparently insensitive attributes are correlated highly enough with a sensitive column that they effectively become proxies for them and posing the danger to make a biased machine learning model if the dataset is used for training.

As such, the sensitive package provides utilities for scanning dataframes and detecting insensitive columns that are correlated with a protected category. For a dataframe, a user can choose to scan the whole dataframe and its columns or to provide an exterior Pandas series of interest that will be tested against the sensitive columns of the data.

In a similar fashion to the detection function, users have the possibility to provide their own custom string distance function and a threshold, as well as specify the correlation cutoff, which is a number representing the minimum correlation coefficient needed to consider two columns to be correlated.

Let’s first look at how we would go about detecting correlations inside a dataframe:

In [1]: import pandas as pd

In [2]: import fairlens as fl

In [3]: columns = ["gender", "random", "score"]

In [4]: data = [["male", 10, 50], ["female", 20, 80], ["male", 20, 60], ["female", 10, 90]]

In [5]: df = pd.DataFrame(data, columns=columns)

Here the score seems to be correlated with gender, with females leaning towards somewhat higher scores. This is picked up by the function, specifying both the insensitive and sensitive columns, as well as the protected category of the sensitive one:

In [6]: fl.sensitive.find_sensitive_correlations(df)
Out[6]: {}

In this example, the two scores are both correlated with sensitive columns, the first one with gender and the second with nationality:

In [7]: col_names = ["gender", "nationality", "random", "corr1", "corr2"]

In [8]: data = [
   ...:     ["woman", "spanish", 715, 10, 20],
   ...:     ["man", "spanish", 1008, 20, 20],
   ...:     ["man", "french", 932, 20, 10],
   ...:     ["woman", "french", 1300, 10, 10],
   ...: ]
   ...: 

In [9]: df = pd.DataFrame(data, columns=col_names)

In [10]: fl.sensitive.find_sensitive_correlations(df)
Out[10]: {'corr1': [('gender', 'Gender')], 'corr2': [('nationality', 'Nationality')]}

Correlation Heatmaps#

The plot module allows users to generate a correlation heatmap of any dataset by simply passing the dataframe to the two_column_heatmap() function, which will plot a heatmap from the matrix of the correlation coefficients computed by using the Pearson Coefficient, the Kruskal-Wallis Test and Cramer’s V between each two of the columns (for numerical-numerical, categorical-numerical and categorical-categorical associations, respectively).

To offer the possibility for extensibility, users are allowed to provide some or all of the correlation metrics themselves instead of just using the defaults.

Note

The fairlens.metrics package provides a number of correlation metrics for any type of association.

Alternatively, users can opt to implement their own metrics provided that they have two pandas.Series objects as input and return a float that will be used as the correlation coefficient in the heatmap.

Let’s look at an example by generating a correlation heatmap using the COMPAS dataset. First, we will load the data and check what columns in contains.

In [11]: df = pd.read_csv("../datasets/german_credit_data.csv")

In [12]: df
Out[12]: 
     Age     Sex  Job  ... Credit amount Duration              Purpose
0     67    male    2  ...          1169        6             radio/TV
1     22  female    2  ...          5951       48             radio/TV
2     49    male    1  ...          2096       12            education
3     45    male    2  ...          7882       42  furniture/equipment
4     53    male    2  ...          4870       24                  car
..   ...     ...  ...  ...           ...      ...                  ...
995   31  female    1  ...          1736       12  furniture/equipment
996   40    male    3  ...          3857       30                  car
997   38    male    2  ...           804       12             radio/TV
998   23    male    2  ...          1845       45             radio/TV
999   27    male    2  ...          4576       45                  car

[1000 rows x 9 columns]

We can generate a correlation heatmap to get a rough idea of any potentially hidden correlations. This will automatically choose different methods for different types of data, however, these are configurable.

In [13]: fl.plot.two_column_heatmap(df)
../_images/corr_heatmap_1.png

Let’s try generating a heatmap of the same dataset, but using some non-linear metrics for numerical-numerical and numerical-categorical associations for added precision.

In [14]: from fairlens.metrics import distance_nn_correlation, distance_cn_correlation, cramers_v

In [15]: fl.plot.two_column_heatmap(df, distance_nn_correlation, distance_cn_correlation, cramers_v)
../_images/corr_heatmap_2.png