fairlens.sensitive.find_sensitive_correlations#

find_sensitive_correlations(df, threshold=0.1, str_distance=None, corr_cutoff=0.75, p_cutoff=0.1, config_path=None)[source]#

Looks at the columns that are not considered to be immediately sensitive and finds if any is strongly correlated with a sensitive column, specifying both the sensitive column name and the sensitive category it is a part of.

Parameters
  • df (pd.DataFrame) – Pandas dataframe that will be analyzed.

  • threshold (float, optional) – The threshold for the string distance function that will be used for detecting sensitive columns. Defaults to 0.1.

  • str_distance (Callable[[Optional[str], Optional[str]], float], optional) – The string distance function that will be used for detecting sensitive columns. Defaults to Ratcliff-Obershelp algorithm.

  • corr_cutoff (float, optional) – The cutoff for considering a column to be correlated with a sensitive attribute, with Pearson’s correlation. Defaults to 0.75.

  • p_cutoff (float, optional) – The p-value cutoff to be used when checking if a categorical column is correlated with a numeric column using the Kruskal-Wallis H Test.

  • config_path (Union[str, pathlib.Path], optional) – The path of the JSON configuration file in which the dictionaries used for detecting sensitive attributes are defined. By default, the configuration is the one describing protected attributes and groups according to the UK Government.

Returns

The returned value is a dictionary with the non-sensitive column as the key and a tuple as the value, where the first entry is the name of the corresponding sensitive column in the dataframe and the second entry is the sensitive attribute category.

Return type

Dict[str, Tuple[Optional[str]]]