fairlens.metrics.stat_distance#

stat_distance(df, target_attr, group1, group2, mode='auto', p_value=False, **kwargs)[source]#

Computes the statistical distance between two probability distributions ie. group 1 and group 2, with respect to the target attribute. The distance metric can be chosen through the mode parameter. If mode is set to “auto”, the most suitable metric depending on the target attributes’ distribution is chosen. If group1 is a dictionary and group2 is None then the distance is computed between group1 and the rest of the dataset.

Parameters
  • df (pd.DataFrame) – The input datafame.

  • target_attr (str) – The target attribute in the dataframe.

  • group1 (Union[Mapping[str, List[Any]], pd.Series]) – The first group of interest. Each group can be a mapping / dict from attribute to value or a predicate itself, i.e. pandas series consisting of bools which can be used as a predicate to index a subgroup from the dataframe. Examples: {“Sex”: [“Male”]}, df[“Sex”] == “Female”

  • group2 (Union[Mapping[str, List[Any]], pd.Series]) – The second group of interest. Each group can be a mapping / dict from attribute to value or a predicate itself, i.e. pandas series consisting of bools which can be used as a predicate to index a subgroup from the dataframe. Examples: {“Sex”: [“Male”]}, df[“Sex”] == “Female”

  • mode (str) – Which distance metric to use. Can be the names of classes from fairlens.metrics, or their id() strings. If set to “auto”, the method automatically picks a suitable metric based on the distribution of the target attribute. Defaults to “auto”.

  • p_value (bool) – Returns the a suitable p-value for the metric if it exists. Defaults to False.

  • **kwargs – Keyword arguments for the distance metric. Passed to the __init__ function of distance metrics.

Returns

The distance as a float, and the p-value if p_value is set to True and can be computed.

Return type

Tuple[float, …]

Examples

>>> df = pd.read_csv("datasets/compas.csv")
>>> group1 = {"Ethnicity": ["African-American", "African-Am"]}
>>> group2 = {"Ethnicity": ["Caucasian"]}
>>> group3 = {"Ethnicity": ["Asian"]}
>>> stat_distance(df, "RawScore", group1, group2, mode="auto")
0.1133214633580949
>>> stat_distance(df, "RawScore", group3, group2, mode="auto", p_value=True)
(0.0816143577815524, 0.02693435054772131)