Sensitive Attribute Detection#
Fairlens contains tools that allow users to analyse their datasets in order to detect columns that are
sensitive or that act as proxies for other protected attributes, based on customisable configurations in
its sensitive
module.
Sensitive Attribute Detection#
Detecting sensitive columns in a dataframe is the main functionality of detection.py
. This is done
by using detect_names_df()
, a function which finds sensitive columns and builds a dictionary
mapping the attribute names to the corresponding protected category (read from the configuration).
Let us take a look at an example dataframe, based on the default configuration config_engb.json
:
In [1]: import pandas as pd
In [2]: import fairlens as fl
In [3]: columns = ["native", "location", "house", "subscription", "salary", "religion", "score"]
In [4]: df = pd.DataFrame(columns=columns)
In this scenario, we can use the function to get:
In [5]: fl.sensitive.detect_names_df(df)
Out[5]:
{'native': 'Nationality',
'location': 'Nationality',
'house': 'Family Status',
'religion': 'Religion'}
In some cases, the names of the dataframe columns alone might not be conclusive enough to decide on
the sensitivity. In those cases, detect_names_df()
has the option of enabling the
deep_search
flag to look at the actual data entries. For example, let’s assume we have a
dataframe containing data referring to protected attributes, but where the column names are not
related and let’s try using detection as in the previous example:
In [6]: columns = ["A", "B", "C", "Salary", "D", "Score"]
In [7]: data = [
...: ["male", "hearing impairment", "heterosexual", "50000", "christianity", "10"],
...: ["female", "obsessive compulsive disorder", "asexual", "60000", "daoism", "10"],
...: ]
...:
In [8]: df = pd.DataFrame(data, columns=columns)
In [9]: fl.sensitive.detect_names_df(df)
Out[9]: {}
As we can see, since the column names do not have a lot of meaning, shallow search will not suffice.
However, if we turn deep_search
on:
In [10]: fl.sensitive.detect_names_df(df, deep_search=True)
Out[10]: {'A': 'Gender', 'B': 'Disability', 'C': 'Sexual Orientation', 'D': 'Religion'}
It is also possible for users to implement their own string distance functions to be used by the
detection algorithm. By default, Ratcliff-Obershelp algorithm is used, but any function with type
Callable[[Optional[str], Optional[str]], float]
can be used. The detection threshold can
also be changed to modify the strictness of the fuzzy matching.
Let us try applying the detection functionality in a more practical scenario, using the COMPAS dataset:
In [11]: df = pd.read_csv("../datasets/compas.csv")
In [12]: df.head()
Out[12]:
PersonID AssessmentID CaseID ... DecileScore ScoreText AssessmentType
0 50844 57167 51950 ... 2 Low New
1 50848 57174 51956 ... 1 Low New
2 50855 57181 51963 ... 8 High New
3 50850 57176 51958 ... 6 Medium New
4 50839 57162 51945 ... 2 Low New
[5 rows x 22 columns]
# Apply shallow detection algorithm.
In [13]: fl.sensitive.detect_names_df(df)
Out[13]:
{'Sex': 'Gender',
'Ethnicity': 'Ethnicity',
'DateOfBirth': 'Age',
'Language': 'Nationality',
'MaritalStatus': 'Family Status'}
As we can see, the sensitive categories from the dataframe have been picked out by the shallow search. Let’s now see what happens when we deep search, but just to make the task a bit more difficult, let’s rename the sensitive columns to have random names.
In [14]: df_deep = pd.read_csv("../datasets/compas.csv")
In [15]: df_deep = df_deep.rename(columns={"Ethnicity": "A", "Language": "Random", "MaritalStatus": "B", "Sex": "C"})
# Apply deep detection algorithm.
In [16]: fl.sensitive.detect_names_df(df, deep_search=True)
Out[16]:
{'Sex': 'Gender',
'Ethnicity': 'Ethnicity',
'DateOfBirth': 'Age',
'Language': 'Nationality',
'MaritalStatus': 'Family Status'}
The same sensitive columns have been picked, but based solely on their content, as the column names themselves have become non-sugestive.
Custom Configurations#
The sensitive or protected group attribute detection algorithm is based on an underlying configuration, which is a JSON file containing the sensitive categories, each having a list of synonyms and possible values attached to them. Since currently the detection algorithm is based on fuzzy string matching, different languages and scopes will require new comprehensive configurations.
The default configuration is in the English language and in accordance with the UK Government’s official protected group
and category list. The configuration can be changed through API functions from detection.py
. For example, in order
to change the it to a new configuration config_custom.json
placed is the configs
folder from the
sensitive
module:
In [17]: from fairlens.sensitive import detection as dt
In [18]: dt.change_config("./configs/config_custom.json")
Any new operations performed on dataframes using functions from detection.py
will assume that the contents of the new
configuration are the objects of interest and use them for inference.