fairlens.FairnessScorer

class FairnessScorer(df, target_attr, sensitive_attrs=None, detect_sensitive=False, distr_type=None, sensitive_distr_types=None)[source]

Bases: object

This class analyzes a given DataFrame, looks for biases and quantifies fairness.

Methods

__init__

Fairness Scorer constructor

demographic_report

Generate a report on the fairness of different groups of sensitive attributes.

distribution_score

Returns a dataframe consisting of all unique sub-groups and their statistical distance to the rest of the population w.r.t.

plot_distributions

Plot the distributions of the target variable with respect to all sensitive values.

__init__(df, target_attr, sensitive_attrs=None, detect_sensitive=False, distr_type=None, sensitive_distr_types=None)[source]

Fairness Scorer constructor

Parameters
  • df (pd.DataFrame) – Input DataFrame to be scored.

  • target_attr (str) – The target attribute name.

  • sensitive_attrs (Optional[Sequence[str]], optional) – The sensitive attribute names. Defaults to None.

  • detect_sensitive (bool, optional) – Whether to try to detect sensitive attributes from the column names. Defaults to False.

  • distr_type (Optional[str], optional) – The type of distribution of the target attribute. Can take values from [“categorical”, “continuous”, “binary”, “datetime”]. If None, the type of distribution is inferred based on the data in the column. Defaults to None.

  • sensitive_distr_types (Optional[Mapping[str, str]], optional) – The type of distribution of the sensitive attributes. Passed as a mapping from sensitive attribute name to corresponding distribution type. Can take values from [“categorical”, “continuous”, “binary”, “datetime”]. If None, the type of distribution of all sensitive attributes are inferred based on the data in the respective columns. Defaults to None.

demographic_report(metric='auto', method='dist_to_all', alpha=0.05, max_comb=4, min_count=100, max_rows=10, hide_positive=False)[source]

Generate a report on the fairness of different groups of sensitive attributes.

Parameters
  • metric (str, optional) – Choose a custom metric to use. Defaults to automatically chosen metric depending on the distribution of the target variable. See

  • method (str, optional) – The method used to apply the metric to the sub-group. Can take values [“dist_to_all”, “dist_to_rest”] which correspond to measuring the distance between the subgroup distribution and the overall distribution, or the overall distribution without the subgroup, respectively. Defaults to “dist_to_all”.

  • alpha (Optional[float], optional) – The maximum p-value to accept a bias. Defaults to 0.05.

  • max_comb (Optional[int], optional) – Max number of combinations of sensitive attributes to be considered. If None all combinations are considered. Defaults to 4.

  • min_count (Optional[int], optional) – If set, sub-groups with less samples than min_count will be ignored. Defaults to 100.

  • max_rows (int, optional) – Maximum number of biased demographics to display. Defaults to 10.

  • hide_positive (bool, optional) – Hides positive distances if set to True. This may be useful when using metrics which can return negative distances (binomial distance), in order to inspect a skew in only one direction. Alternatively changing the method may yeild more significant results. Defaults to False.

distribution_score(metric='auto', method='dist_to_all', p_value=False, max_comb=None)[source]

Returns a dataframe consisting of all unique sub-groups and their statistical distance to the rest of the population w.r.t. the target variable.

Parameters
  • metric (str, optional) – Choose a metric to use. Defaults to automatically chosen metric depending on the distribution of the target variable.

  • method (str, optional) – The method used to apply the metric to the sub-group. Can take values [“dist_to_all”, dist_to_rest”] which correspond to measuring the distance between the subgroup distribution and the overall distribution, or the overall distribution without the subgroup, respectively. Defaults to “dist_to_all”.

  • p_value (bool, optional) – Whether or not to compute a p-value for the distances.

  • max_comb (Optional[int], optional) – Max number of combinations of sensitive attributes to be considered. If None all combinations are considered. Defaults to 4.

Return type

pandas.core.frame.DataFrame

plot_distributions(figsize=None, max_width=3, max_quantiles=8, show_hist=None, show_curve=None, shade=True, normalize=False, cmap=None)[source]

Plot the distributions of the target variable with respect to all sensitive values.

Parameters
  • figsize (Optional[Tuple[int, int]], optional) – The size of each figure. Defaults to (6, 4).

  • max_width (int, optional) – The maximum amount of figures. Defaults to 3.

  • max_quantiles (int, optional) – The maximum amount of quantiles to use for continuous data. Defaults to 8.

  • show_hist (Optional[bool], optional) – Shows the histogram if True. Defaults to True if the data is categorical or binary.

  • show_curve (Optional[bool], optional) – Shows a KDE if True. Defaults to True if the data is continuous or a date.

  • shade (bool, optional) – Shades the curve if True. Defaults to True.

  • normalize (bool, optional) – Normalizes the counts so the sum of the bar heights is 1. Defaults to False.

  • cmap (Optional[Sequence[Tuple[float, float, float]]], optional) – A sequence of RGB tuples used to colour the histograms. If None seaborn’s default pallete will be used. Defaults to None.