fairlens.sensitive.detect_names_df#

detect_names_df(df, threshold=0.1, str_distance=None, deep_search=False, n_samples=20, config_path=None)[source]#

Detects the sensitive columns in a dataframe or string list and creates a dictionary which maps the attribute names to the corresponding sensitive category name (such as Gender, Religion etc). The option to deep search can be enabled in the case of dataframes, which looks at the values in the tables and infers sensitive categories, even when the column name is inconclusive.

Parameters
  • df (Union[pd.DataFrame, List[str]]) – Pandas dataframe or string list that will be analysed.

  • threshold (float, optional) – The threshold for the string distance function. Defaults to 0.1.

  • str_distance (Callable[[Optional[str], Optional[str]], float], optional) – The string distance function. Defaults to Ratcliff-Obershelp algorithm.

  • deep_search (bool, optional) – The boolean flag that enables deep search when set to true. Deep search also makes use of the content of the column to check if it is sensitive.

  • n_samples (int, optional) – The number of values to be sampled from series of large datasets when using the deep search algorithm. A low sample number will greatly improve speed and still produce accurate results, assuming that the underlying dictionaries are comprehensive.

  • config_path (Union[str, pathlib.Path], optional) – The path of the JSON configuration file in which the dictionaries used for detecting sensitive attributes are defined. By default, the configuration is the one describing protected attributes and groups according to the UK Government.

Returns

A dictionary containing a mapping from attribute names to a string representing the corresponding sensitive attribute category or None.

Return type

Dict[str, Optional[str]]

Examples

>>> detect_names_dict_dataframe(["age", "gender", "legality", "risk"])
{"age": "Age", "gender": "Gender"}
>>> col_names = ["native", "location", "house", "subscription", "salary", "religion", "score"]
>>> df = pd.DataFrame(columns=col_names)
>>> detect_names_dict_dataframe(df)
{"native": "Nationality", "location": "Nationality", "house": "Family Status", "religion": "Religion"}