find_column_correlation(col, df, threshold=0.1, str_distance=None, corr_cutoff=0.75, p_cutoff=0.1, config_path=None)[source]#

This function takes in a series or a column name of a given dataframe and checks whether any of the sensitive attribute columns detected in the dataframe are strongly correlated with the series or the column corresponding to the given name. If matches are found, a list containing the correlated column names and its associated sensitive category, respectively, is returned.

  • col (Union[str, pd.Series]) – Pandas series or dataframe column name that will be analyzed.

  • df (pd.DataFrame) – Dataframe supporting the search, possibly already a column with the input name.

  • threshold (float, optional) – The threshold for the string distance function that will be used for detecting sensitive columns. Defaults to 0.1.

  • str_distance (Callable[[Optional[str], Optional[str]], float], optional) – The string distance function that will be used for detecting sensitive columns. Defaults to Ratcliff-Obershelp algorithm.

  • corr_cutoff (float, optional) – The cutoff for considering a column to be correlated with a sensitive attribute, with Pearson’s correlation. Defaults to 0.75.

  • p_cutoff (float, optional) – The p-value cutoff to be used when checking if a categorical column is correlated with a numeric column using the Kruskal-Wallis H Test.

  • config_path (Union[str, pathlib.Path], optional) – The path of the JSON configuration file in which the dictionaries used for detecting sensitive attributes are defined. By default, the configuration is the one describing protected attributes and groups according to the UK Government.


The returned value is a list containing tuples of all the correlated sensitive columns that were found, along with their associated sensitive category label.

Return type

List[Tuple[str, Optional[str]]]