fairlens.FairnessScorer#

class FairnessScorer(df, target_attr, sensitive_attrs=None, detect_sensitive=False, distr_type=None, sensitive_distr_types=None)[source]#

Bases: object

This class analyzes a given DataFrame, looks for biases and quantifies fairness.

Methods

__init__

Fairness Scorer constructor

compare_group_statistics

Generate a report of statistical measures (mean variance) of the target distributions with respect to each combination of the sensitive attributes by default, or with respect to the groups passed as input if mode is set to "manual".

demographic_report

Generate a report on the fairness of different groups of sensitive attributes.

distribution_score

Returns a dataframe consisting of all unique sub-groups and their statistical distance to the rest of the population w.r.t.

plot_distributions

Plot the distributions of the target variable with respect to all sensitive values.

Parameters
  • df (pandas.core.frame.DataFrame) –

  • target_attr (str) –

  • sensitive_attrs (Optional[Sequence[str]]) –

  • detect_sensitive (bool) –

  • distr_type (Optional[str]) –

  • sensitive_distr_types (Optional[Mapping[str, str]]) –

__init__(df, target_attr, sensitive_attrs=None, detect_sensitive=False, distr_type=None, sensitive_distr_types=None)[source]#

Fairness Scorer constructor

Parameters
  • df (pd.DataFrame) – Input DataFrame to be scored.

  • target_attr (str) – The target attribute name.

  • sensitive_attrs (Optional[Sequence[str]], optional) – The sensitive attribute names. Defaults to None.

  • detect_sensitive (bool, optional) – Whether to try to detect sensitive attributes from the column names. Defaults to False.

  • distr_type (Optional[str], optional) – The type of distribution of the target attribute. Can take values from [“categorical”, “continuous”, “binary”, “datetime”]. If None, the type of distribution is inferred based on the data in the column. Defaults to None.

  • sensitive_distr_types (Optional[Mapping[str, str]], optional) – The type of distribution of the sensitive attributes. Passed as a mapping from sensitive attribute name to corresponding distribution type. Can take values from [“categorical”, “continuous”, “binary”, “datetime”]. If None, the type of distribution of all sensitive attributes are inferred based on the data in the respective columns. Defaults to None.

compare_group_statistics(group_mode='auto', categorical_mode='entropy', groups=None, max_comb=4)[source]#

Generate a report of statistical measures (mean variance) of the target distributions with respect to each combination of the sensitive attributes by default, or with respect to the groups passed as input if mode is set to “manual”. The sensitive or input group combinations will have a maximum length of separate groups.

Parameters
  • group_mode (str, optional) – If set to “auto”, the function will consider combinations of pre-detected sensitive attributes, similar to distribution_score. If set to “manual”, the groups have to be provided by the user. Defaults to “auto”.

  • categorical_mode (str, optional) – Decides which measures to be used if the target attribute is categorical. Defaults to “entropy”.

  • groups (List[Union[Mapping[str, List[Any]], pd.Series]], optional) – List of groups to be compared, ignored if mode is set to “auto”. Defaults to None.

  • max_comb (int) – The maximum depth of the group combinations for which the statistics are generated. Defaults to 4.

Returns

Dataframe containing data on the first two central moments of the target distributions, by group.

Return type

pd.DataFrame

demographic_report(metric='auto', method='dist_to_all', alpha=0.05, max_comb=4, min_count=100, max_rows=10, hide_positive=False)[source]#

Generate a report on the fairness of different groups of sensitive attributes.

Parameters
  • metric (str, optional) – Choose a custom metric to use. Defaults to automatically chosen metric depending on the distribution of the target variable. See

  • method (str, optional) – The method used to apply the metric to the sub-group. Can take values [“dist_to_all”, “dist_to_rest”] which correspond to measuring the distance between the subgroup distribution and the overall distribution, or the overall distribution without the subgroup, respectively. Defaults to “dist_to_all”.

  • alpha (Optional[float], optional) – The maximum p-value to accept a bias. Defaults to 0.05.

  • max_comb (Optional[int], optional) – Max number of combinations of sensitive attributes to be considered. If None all combinations are considered. Defaults to 4.

  • min_count (Optional[int], optional) – If set, sub-groups with less samples than min_count will be ignored. Defaults to 100.

  • max_rows (int, optional) – Maximum number of biased demographics to display. Defaults to 10.

  • hide_positive (bool, optional) – Hides positive distances if set to True. This may be useful when using metrics which can return negative distances (binomial distance), in order to inspect a skew in only one direction. Alternatively changing the method may yeild more significant results. Defaults to False.

distribution_score(metric='auto', method='dist_to_all', p_value=False, max_comb=None)[source]#

Returns a dataframe consisting of all unique sub-groups and their statistical distance to the rest of the population w.r.t. the target variable.

Parameters
  • metric (str, optional) – Choose a metric to use. Defaults to automatically chosen metric depending on the distribution of the target variable.

  • method (str, optional) – The method used to apply the metric to the sub-group. Can take values [“dist_to_all”, dist_to_rest”] which correspond to measuring the distance between the subgroup distribution and the overall distribution, or the overall distribution without the subgroup, respectively. Defaults to “dist_to_all”.

  • p_value (bool, optional) – Whether or not to compute a p-value for the distances.

  • max_comb (Optional[int], optional) – Max number of combinations of sensitive attributes to be considered. If None all combinations are considered. Defaults to 4.

Return type

pandas.core.frame.DataFrame

plot_distributions(figsize=None, max_width=3, max_quantiles=8, show_hist=None, show_curve=None, shade=True, normalize=False, cmap=None)[source]#

Plot the distributions of the target variable with respect to all sensitive values.

Parameters
  • figsize (Optional[Tuple[int, int]], optional) – The size of each figure. Defaults to (6, 4).

  • max_width (int, optional) – The maximum amount of figures. Defaults to 3.

  • max_quantiles (int, optional) – The maximum amount of quantiles to use for continuous data. Defaults to 8.

  • show_hist (Optional[bool], optional) – Shows the histogram if True. Defaults to True if the data is categorical or binary.

  • show_curve (Optional[bool], optional) – Shows a KDE if True. Defaults to True if the data is continuous or a date.

  • shade (bool, optional) – Shades the curve if True. Defaults to True.

  • normalize (bool, optional) – Normalizes the counts so the sum of the bar heights is 1. Defaults to False.

  • cmap (Optional[Sequence[Tuple[float, float, float]]], optional) – A sequence of RGB tuples used to colour the histograms. If None seaborn’s default pallete will be used. Defaults to None.