COMPAS Recidivism#

The COMPAS dataset consists of the results of a commercial algorithm called COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), used to assess a convicted criminal’s likelihood of reoffending. COMPAS has been used by judges and parole officers and is widely known for its bias against African-Americans.

In this notebook, we will use fairlens to explore the COMPAS dataset for bias toward legally protected features. We will go on to show similar biases in a logistic regressor trained to forecast a criminal’s risk of reoffending using the dataset. [1]

[1]:

# Import libraries
import numpy as np
import pandas as pd
import fairlens as fl
import matplotlib.pyplot as plt
from itertools import combinations, chain
from sklearn.linear_model import LogisticRegression

# Load in the 2 year COMPAS Recidivism dataset
df = pd.read_csv("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv")
df

[1]:

	id	name	first	last	compas_screening_date	sex	dob	age	age_cat	race	...	v_decile_score	v_score_text	v_screening_date	in_custody	out_custody	priors_count.1	start	end	event	two_year_recid
0	1	miguel hernandez	miguel	hernandez	2013-08-14	Male	1947-04-18	69	Greater than 45	Other	...	1	Low	2013-08-14	2014-07-07	2014-07-14	0	0	327	0	0
1	3	kevon dixon	kevon	dixon	2013-01-27	Male	1982-01-22	34	25 - 45	African-American	...	1	Low	2013-01-27	2013-01-26	2013-02-05	0	9	159	1	1
2	4	ed philo	ed	philo	2013-04-14	Male	1991-05-14	24	Less than 25	African-American	...	3	Low	2013-04-14	2013-06-16	2013-06-16	4	0	63	0	1
3	5	marcu brown	marcu	brown	2013-01-13	Male	1993-01-21	23	Less than 25	African-American	...	6	Medium	2013-01-13	NaN	NaN	1	0	1174	0	0
4	6	bouthy pierrelouis	bouthy	pierrelouis	2013-03-26	Male	1973-01-22	43	25 - 45	Other	...	1	Low	2013-03-26	NaN	NaN	2	0	1102	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7209	10996	steven butler	steven	butler	2013-11-23	Male	1992-07-17	23	Less than 25	African-American	...	5	Medium	2013-11-23	2013-11-22	2013-11-24	0	1	860	0	0
7210	10997	malcolm simmons	malcolm	simmons	2014-02-01	Male	1993-03-25	23	Less than 25	African-American	...	5	Medium	2014-02-01	2014-01-31	2014-02-02	0	1	790	0	0
7211	10999	winston gregory	winston	gregory	2014-01-14	Male	1958-10-01	57	Greater than 45	Other	...	1	Low	2014-01-14	2014-01-13	2014-01-14	0	0	808	0	0
7212	11000	farrah jean	farrah	jean	2014-03-09	Female	1982-11-17	33	25 - 45	African-American	...	2	Low	2014-03-09	2014-03-08	2014-03-09	3	0	754	0	0
7213	11001	florencia sanmartin	florencia	sanmartin	2014-06-30	Female	1992-12-18	23	Less than 25	Hispanic	...	4	Low	2014-06-30	2015-03-15	2015-03-15	2	0	258	0	1

7214 rows × 53 columns

The analysis done by ProPublica suggests that certain cases may have had alternative reasons for being charged [1]. We will drop such rows which are not usable.

[2]:

df = df[(df["days_b_screening_arrest"] <= 30)
        & (df["days_b_screening_arrest"] >= -30)
        & (df["is_recid"] != -1)
        & (df["c_charge_degree"] != 'O')
        & (df["score_text"] != 'N/A')].reset_index(drop=True)

df

[2]:

	id	name	first	last	compas_screening_date	sex	dob	age	age_cat	race	...	v_decile_score	v_score_text	v_screening_date	in_custody	out_custody	priors_count.1	start	end	event	two_year_recid
0	1	miguel hernandez	miguel	hernandez	2013-08-14	Male	1947-04-18	69	Greater than 45	Other	...	1	Low	2013-08-14	2014-07-07	2014-07-14	0	0	327	0	0
1	3	kevon dixon	kevon	dixon	2013-01-27	Male	1982-01-22	34	25 - 45	African-American	...	1	Low	2013-01-27	2013-01-26	2013-02-05	0	9	159	1	1
2	4	ed philo	ed	philo	2013-04-14	Male	1991-05-14	24	Less than 25	African-American	...	3	Low	2013-04-14	2013-06-16	2013-06-16	4	0	63	0	1
3	7	marsha miles	marsha	miles	2013-11-30	Male	1971-08-22	44	25 - 45	Other	...	1	Low	2013-11-30	2013-11-30	2013-12-01	0	1	853	0	0
4	8	edward riddle	edward	riddle	2014-02-19	Male	1974-07-23	41	25 - 45	Caucasian	...	2	Low	2014-02-19	2014-03-31	2014-04-18	14	5	40	1	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6167	10996	steven butler	steven	butler	2013-11-23	Male	1992-07-17	23	Less than 25	African-American	...	5	Medium	2013-11-23	2013-11-22	2013-11-24	0	1	860	0	0
6168	10997	malcolm simmons	malcolm	simmons	2014-02-01	Male	1993-03-25	23	Less than 25	African-American	...	5	Medium	2014-02-01	2014-01-31	2014-02-02	0	1	790	0	0
6169	10999	winston gregory	winston	gregory	2014-01-14	Male	1958-10-01	57	Greater than 45	Other	...	1	Low	2014-01-14	2014-01-13	2014-01-14	0	0	808	0	0
6170	11000	farrah jean	farrah	jean	2014-03-09	Female	1982-11-17	33	25 - 45	African-American	...	2	Low	2014-03-09	2014-03-08	2014-03-09	3	0	754	0	0
6171	11001	florencia sanmartin	florencia	sanmartin	2014-06-30	Female	1992-12-18	23	Less than 25	Hispanic	...	4	Low	2014-06-30	2015-03-15	2015-03-15	2	0	258	0	1

6172 rows × 53 columns

Analysis#

We’ll begin by identifying the legally protected attributes in the data. fairlens detects these using using fuzzy matching on the column names and values to a custom preset of expected values.

[3]:

# Detect sensitive attributes
sensitive_attributes = fl.sensitive.detect_names_df(df, deep_search=True)
print(sensitive_attributes)
print(sensitive_attributes.keys())

{'sex': 'Gender', 'dob': 'Age', 'age': 'Age', 'race': 'Ethnicity'}
dict_keys(['sex', 'dob', 'age', 'race'])

We can see that the attributes that we should be concerned about correspond to gender, age and ethnicity.

[4]:

df[["sex", "race", "age", "dob", "age_cat", "decile_score"]].head()

[4]:

	sex	race	age	dob	age_cat	decile_score
0	Male	Other	69	1947-04-18	Greater than 45	1
1	Male	African-American	34	1982-01-22	25 - 45	3
2	Male	African-American	24	1991-05-14	Less than 25	4
3	Male	Other	44	1971-08-22	25 - 45	1
4	Male	Caucasian	41	1974-07-23	25 - 45	6

fairlens will discretize continuous sensitive attributes such as age to make the results more interpretable, i.e. “Greater than 45”, “25 - 45”, “Less than 25” in the case of age. The COMPAS dataset comes with a categorical column for age which we can use instead.

We can inspect potential biases in decile scores by visualizing the distributions of different sensitive sub-groups in the data. Methods in fairlens.plot can be used to generate plots of distributions of variables in different sub-groups in the data.

[5]:

target_attribute = "decile_score"
sensitive_attributes = ["sex", "race", "age", "dob", "age_cat"]

# Set the seaborn style
fl.plot.use_style()

# Plot the distributions
fl.plot.mult_distr_plot(df, target_attribute, sensitive_attributes)
plt.show()

The largest horizontal disparity in scores seems to be in race, specifically between African-Americans and Caucasians, who make up most of the sample. We can visualize or measure the distance between two arbitrary sub-groups using predicates as shown below.

[6]:

# Plot the distributions of decile scores in subgroups made of African-Americans and Caucasians
group1 = {"race": ["African-American"]}
group2 = {"race": ["Caucasian"]}

fl.plot.distr_plot(df, "decile_score", [group1, group2])
plt.legend(["African-American", "Caucasian"])
plt.show()

[7]:

group1 = {"race": ["African-American"]}
group2 = df["race"] != "African-American"

fl.plot.distr_plot(df, "decile_score", [group1, group2])
plt.legend(["African-American", "Rest of Population"])
plt.show()

The above disparity by measuring statistical distances between the two distributions. Since the the decile scores are categorical, metrics such as the Earth Mover’s Distance, the LP-Norm, or the Hellinger Distance would be useful. fairlens.metrics provides a stat_distance method which can be used to compute these metrics.

[8]:

import fairlens.metrics as fm

group1 = {"race": ["African-American"]}
group2 = {"race": ["Caucasian"]}

distances = {}
for metric in ["emd", "norm", "hellinger"]:
    distances[metric] = fm.stat_distance(df, "decile_score", group1, group2, mode=metric)


pd.DataFrame.from_dict(distances, orient="index", columns=["distance"])

[8]:

	distance
emd	0.245107
norm	0.210866
hellinger	0.214566

Measuring the statistical distance between the distribution of a variable in a subgroup and in the entire dataset can indicate how biased the variable is with respect to the subgroup. We can use the fl.FairnessScorer class to compute this for each sub-group.

[9]:

fscorer = fl.FairnessScorer(df, "decile_score", ["sex", "race", "age_cat"])
fscorer.distribution_score(max_comb=1, p_value=True)

[9]:

	Group	Distance	Proportion	Counts	P-Value
0	Greater than 45	0.322188	0.209494	1293	0.54
1	25 - 45	0.042123	0.572262	3532	0.98
2	Less than 25	0.272500	0.218244	1347	0.53
3	Other	0.253175	0.055574	343	0.69
4	African-American	0.130340	0.514420	3175	0.54
5	Caucasian	0.115572	0.340732	2103	0.64
6	Hispanic	0.184277	0.082469	509	0.67
7	Asian	0.331973	0.005023	31	0.86
8	Native American	0.492532	0.001782	11	0.90
9	Male	0.016210	0.809624	4997	0.97
10	Female	0.068942	0.190376	1175	0.62

The method fl.FairnessScorer.distribution_score() makes use of suitable hypothesis tests to determine how different the distribution of the decile scores is in each sensitive subgroup.

Training a Model#

Our above analysis has confirmed that there are inherent biases present in the COMPAS dataset. We now show the result of training a model on the COMPAS dataset and using it to predict an unknown criminal’s likelihood of reoffending.

We use a logistic regressor trained on a subset of the features.

[10]:

# Select the features to use
df = df[["sex", "race", "age_cat", "c_charge_degree", "priors_count", "two_year_recid", "score_text"]]

# Split the dataset into train and test
sp = int(len(df) * 0.8)
df_train = df[:sp].reset_index(drop=True)
df_test = df[sp:].reset_index(drop=True)

# Convert categorical columns to numerical columns
def preprocess(df):
    X = df.copy()
    X["sex"] = pd.factorize(df["sex"])[0]
    X["race"] = pd.factorize(df["race"])[0]
    X["age_cat"].replace(["Greater than 45", "25 - 45", "Less than 25"], [2, 1, 0], inplace=True)
    X["c_charge_degree"] = pd.factorize(df["c_charge_degree"])[0]
    X.drop(columns=["score_text"], inplace=True)
    X = X.to_numpy()

    y = pd.factorize(df["score_text"] != "Low")[0]

    return X, y

df_train = df[:sp].reset_index(drop=True)

# Train a regressor
X, y = preprocess(df_train)
clf = LogisticRegression(random_state=0).fit(X, y)

# Classify the training data
df_train["pred"] = clf.predict_proba(X)[:, 1]

# Plot the distributions
fscorer = fl.FairnessScorer(df_train, "pred", ["race", "sex", "age_cat"])
fscorer.plot_distributions()
plt.show()

[11]:

X = df.copy()
X["sex"] = pd.factorize(df["sex"])[0]
X["race"] = pd.factorize(df["race"])[0]
X["age_cat"].replace(["Greater than 45", "25 - 45", "Less than 25"], [2, 1, 0], inplace=True)
X["c_charge_degree"] = pd.factorize(df["c_charge_degree"])[0]
X.drop(columns=["score_text"], inplace=True)

X.corr()

[11]:

	sex	race	age_cat	c_charge_degree	priors_count	two_year_recid
sex	1.000000	0.025568	0.002701	0.061848	-0.118722	-0.100911
race	0.025568	1.000000	0.104551	0.078824	-0.123963	-0.088536
age_cat	0.002701	0.104551	1.000000	0.097368	0.169822	-0.156930
c_charge_degree	0.061848	0.078824	0.097368	1.000000	-0.145433	-0.120332
priors_count	-0.118722	-0.123963	0.169822	-0.145433	1.000000	0.290607
two_year_recid	-0.100911	-0.088536	-0.156930	-0.120332	0.290607	1.000000

Above, we see the distributions of the training predictions are similar to the distributions in the data from above. Let’s see the results on the held out test set.

[12]:

# Classify the test data
X, _ = preprocess(df_test)
df_test["pred"] = clf.predict_proba(X)[:, 1]

# Plot the distributions
fscorer = fl.FairnessScorer(df_test, "pred", ["race", "sex", "age_cat"])
fscorer.plot_distributions()
plt.show()

[13]:

fscorer.distribution_score(max_comb=1, p_value=True).sort_values("Distance", ascending=False).reset_index(drop=True)

[13]:

	Group	Distance	Proportion	Counts	P-Value
0	Native American	0.513360	0.002429	3	3.046032e-01
1	Greater than 45	0.495185	0.204858	253	7.771561e-16
2	Less than 25	0.456088	0.229150	283	3.005322e-44
3	Other	0.333459	0.059109	73	2.491168e-07
4	Asian	0.238664	0.003239	4	9.366481e-01
5	Caucasian	0.200244	0.324696	401	4.152367e-11
6	Female	0.175752	0.208907	258	3.079097e-06
7	African-American	0.166902	0.522267	645	8.779233e-11
8	25 - 45	0.117271	0.565992	699	8.242076e-06
9	Hispanic	0.082636	0.088259	109	4.743583e-01
10	Male	0.046412	0.791093	977	1.830865e-01

Lets try training a model after dropping “race” from the model and look at the results.

[14]:

# Drop the predicted column before training again
df_train.drop(columns=["pred"], inplace=True)

# Preprocess he data and drop race
X, y = preprocess(df_train)
X = np.delete(X, df_train.columns.get_loc("race"), axis=1)

# Train a regressor and classify the training data
clf = LogisticRegression(random_state=0).fit(X, y)
df_train["pred"] = clf.predict_proba(X)[:, 1]

# Plot the distributions
fscorer = fl.FairnessScorer(df_train, "pred", ["race", "sex", "age_cat"])
fscorer.plot_distributions()
plt.show()

[15]:

# Drop the predicted column before training again
df_test.drop(columns=["pred"], inplace=True)

# Classify the test data
X, _ = preprocess(df_test)
X = np.delete(X, df_test.columns.get_loc("race"), axis=1)

df_test["pred"] = clf.predict_proba(X)[:, 1]

# Plot the distributions
fscorer = fl.FairnessScorer(df_test, "pred", ["race", "sex", "age_cat"])
fscorer.plot_distributions()
plt.show()

[16]:

fscorer.distribution_score(max_comb=1, p_value=True).sort_values("Distance", ascending=False).reset_index(drop=True)

[16]:

	Group	Distance	Proportion	Counts	P-Value
0	Native American	0.672065	0.002429	3	7.123637e-02
1	Greater than 45	0.503900	0.204858	253	7.771561e-16
2	Less than 25	0.488846	0.229150	283	3.438215e-51
3	Asian	0.243927	0.003239	4	9.240617e-01
4	Other	0.211059	0.059109	73	3.564472e-03
5	Female	0.173141	0.208907	258	4.565218e-06
6	Caucasian	0.160895	0.324696	401	2.577566e-07
7	African-American	0.128776	0.522267	645	1.368363e-06
8	25 - 45	0.120753	0.565992	699	3.908314e-06
9	Hispanic	0.097017	0.088259	109	2.824348e-01
10	Male	0.045722	0.791093	977	1.962996e-01

We can see that despite dropping the attribute “race”, the biases toward people in different racial groups is relatively unaffected.

References#

[1] Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. How we analyzed the compas recidivism algorithm. 2016. URL: https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm.

Quickstart

Statistical Distances