Sensitive Attribute Detection#

Fairlens contains tools that allow users to analyse their datasets in order to detect columns that are sensitive or that act as proxies for other protected attributes, based on customisable configurations in its sensitive module.

Sensitive Attribute Detection#

Detecting sensitive columns in a dataframe is the main functionality of detection.py. This is done by using detect_names_df(), a function which finds sensitive columns and builds a dictionary mapping the attribute names to the corresponding protected category (read from the configuration). Let us take a look at an example dataframe, based on the default configuration config_engb.json:

In [1]: import pandas as pd

In [2]: import fairlens as fl

In [3]: columns = ["native", "location", "house", "subscription", "salary", "religion", "score"]

In [4]: df = pd.DataFrame(columns=columns)

In this scenario, we can use the function to get:

In [5]: fl.sensitive.detect_names_df(df)
Out[5]: 
{'native': 'Nationality',
 'location': 'Nationality',
 'house': 'Family Status',
 'religion': 'Religion'}

In some cases, the names of the dataframe columns alone might not be conclusive enough to decide on the sensitivity. In those cases, detect_names_df() has the option of enabling the deep_search flag to look at the actual data entries. For example, let’s assume we have a dataframe containing data referring to protected attributes, but where the column names are not related and let’s try using detection as in the previous example:

In [6]: columns = ["A", "B", "C", "Salary", "D", "Score"]

In [7]: data = [
   ...:     ["male", "hearing impairment", "heterosexual", "50000", "christianity", "10"],
   ...:     ["female", "obsessive compulsive disorder", "asexual", "60000", "daoism", "10"],
   ...: ]
   ...: 

In [8]: df = pd.DataFrame(data, columns=columns)

In [9]: fl.sensitive.detect_names_df(df)
Out[9]: {}

As we can see, since the column names do not have a lot of meaning, shallow search will not suffice. However, if we turn deep_search on:

In [10]: fl.sensitive.detect_names_df(df, deep_search=True)
Out[10]: {'A': 'Gender', 'B': 'Disability', 'C': 'Sexual Orientation', 'D': 'Religion'}

It is also possible for users to implement their own string distance functions to be used by the detection algorithm. By default, Ratcliff-Obershelp algorithm is used, but any function with type Callable[[Optional[str], Optional[str]], float] can be used. The detection threshold can also be changed to modify the strictness of the fuzzy matching.

Let us try applying the detection functionality in a more practical scenario, using the COMPAS dataset:

In [11]: df = pd.read_csv("../datasets/compas.csv")

In [12]: df.head()
Out[12]: 
   PersonID  AssessmentID  CaseID  ... DecileScore ScoreText AssessmentType
0     50844         57167   51950  ...           2       Low            New
1     50848         57174   51956  ...           1       Low            New
2     50855         57181   51963  ...           8      High            New
3     50850         57176   51958  ...           6    Medium            New
4     50839         57162   51945  ...           2       Low            New

[5 rows x 22 columns]

# Apply shallow detection algorithm.
In [13]: fl.sensitive.detect_names_df(df)
Out[13]: 
{'Sex': 'Gender',
 'Ethnicity': 'Ethnicity',
 'DateOfBirth': 'Age',
 'Language': 'Nationality',
 'MaritalStatus': 'Family Status'}

As we can see, the sensitive categories from the dataframe have been picked out by the shallow search. Let’s now see what happens when we deep search, but just to make the task a bit more difficult, let’s rename the sensitive columns to have random names.

In [14]: df_deep = pd.read_csv("../datasets/compas.csv")

In [15]: df_deep = df_deep.rename(columns={"Ethnicity": "A", "Language": "Random", "MaritalStatus": "B", "Sex": "C"})

# Apply deep detection algorithm.
In [16]: fl.sensitive.detect_names_df(df, deep_search=True)
Out[16]: 
{'Sex': 'Gender',
 'Ethnicity': 'Ethnicity',
 'DateOfBirth': 'Age',
 'Language': 'Nationality',
 'MaritalStatus': 'Family Status'}

The same sensitive columns have been picked, but based solely on their content, as the column names themselves have become non-sugestive.

Custom Configurations#

The sensitive or protected group attribute detection algorithm is based on an underlying configuration, which is a JSON file containing the sensitive categories, each having a list of synonyms and possible values attached to them. Since currently the detection algorithm is based on fuzzy string matching, different languages and scopes will require new comprehensive configurations.

The default configuration is in the English language and in accordance with the UK Government’s official protected group and category list. The configuration can be changed through API functions from detection.py. For example, in order to change the it to a new configuration config_custom.json placed is the configs folder from the sensitive module:

In [17]: from fairlens.sensitive import detection as dt

In [18]: dt.change_config("./configs/config_custom.json")

Any new operations performed on dataframes using functions from detection.py will assume that the contents of the new configuration are the objects of interest and use them for inference.

Statistical Distances

Hidden Correlations