Dataism: Assigning a 0-100 Percentile Score to Police Depts.
In this exercise, we walk through how to rank police departments on a scale from 0-100 based on the fraction of the population in their jurisdiction that is arrested for misdemeanors.
The best department, the one with the lowest frequency of misdemeanor arrests, gets a score of 100.
This can be interpreted as a measure of the extent to which the department engages in 'broken windows' policing i.e. targeting minor crimes like vandalism and public drinking. The theory is that this creates an atmosphere of order and lawfulness, thereby preventing more serious crimes. However, evidence shows that this kind of policing reinforces systemic racial bias in the criminal justice system. (see 'Weapons of Math Destruction' for a breakdown of this topic)
Assignment: Read through the following steps slowly. The code is all written, so you just have to run it and see what it does. If you wanna code something yourself, try the challenge exercise as the end.
Important: You can use the same methods in this exercise to create 0-100 scores for all sorts of things. For example, an activist group might want to
Grade countries on their ecological footprints, or
Score companies on gender/racial bias in hiring practices...
At least while society is easily convinced by such metrics, these could be effective ways to draw attention ;)
1. Import Python Libraries
First we import the Python libraries we need. The 'as pd' and 'as np' allow us to write 'pd' and 'np' in our code instead of having to write 'pandas' and 'numpy' all the time. (Remember, you can run a cell by highlighting it and pressing 'shift' + 'enter')
pip install xlrd
import pandas as pd
import numpy as np
import xlrd
2. Import the Police Scorecord Data
Now we import the data from its URL location on the course github.
Note: You can't just use the URL that you get when you click on the filename in Github. You have to right-click on 'View Raw' and select 'Copy Link Location.'
## Here we define a variable named 'data_url' (we can call it whatever we want)
## and set it equal to the URL of the data file.
data_url = Police Scorecard Source Data.xlsx
## Now we use the pandas read_excel() function to import the data using the URL.
## You can also pass it the directly location of a data file, but this is easiest for the exercise.
police_scorecard_data = pd.read_excel(data_url)
## We can also substitute 'police_scorecard_data' for whatever name we want.
## Formally, the police_score_card variable now refers to a 'dataframe' object.
## Print the first five rows so you can see how it looks.
police_scorecard_data.head(5)
3. Extracting the data about misdemeanor arrests and population.
The above data table has many more columns then we need. We are only interested in
Agency Name
Total Population of Jurisdiction
Total Misdemeanor Arrests (MACR 2016)
To clean things up, we create a new dataframe and name it misdemeanor_data
, containing only what we're interested in.
## Use the filter function to select only the columns we care about and group
## them into a new dataframe.
misdemeanor_data = police_scorecard_data.filter(['Agency Name', 'Total Population of Jurisdiction', 'Total Misdemeanor Arrests (MACR 2016)'], axis =1)
misdemeanor_data.head(5)
Agency Name | Total Population of Jurisdiction | Total Misdemeanor Arrests (MACR 2016) | |
---|---|---|---|
0 | Alameda Police Department | 79654 | 894.0 |
1 | Alhambra Police Department | 86475 | 642.0 |
2 | Anaheim Police Department | 354891 | 5803.0 |
3 | Antioch Police Department | 112630 | 2235.0 |
4 | Bakersfield Police Department | 379741 | 17643.0 |
4. Reindex the rows by column name
Currently the rows in our data table are indexed by the numbers 0, 1, 2 etc. These numbers aren't really connected to the data, and it's more convenient for us to index the rows by Agency Name.
misdemeanor_data = misdemeanor_data.set_index('Agency Name')
misdemeanor_data.head(5)
Agency Name | Total Population of Jurisdiction | Total Misdemeanor Arrests (MACR 2016) |
---|---|---|
Alameda Police Department | 79654 | 894.0 |
Alhambra Police Department | 86475 | 642.0 |
Anaheim Police Department | 354891 | 5803.0 |
Antioch Police Department | 112630 | 2235.0 |
Bakersfield Police Department | 379741 | 17643.0 |
5. Compute the frequency of misdemeanor arrests
Here we compute the percentage of the population in a jurisdiction that is arrested for misdemeanor. This is given by
formula not implementedFor example, if Exampleton Police Department arrested 100 people for misdemeanors and has a population of 10,000, then % Misdemeanor Arrests = 100/10,000 = 1%.
## First we extract the column contain populations.
population = misdemeanor_data['Total Population of Jurisdiction']
population.head(5)
Alameda Police Department | 79654 |
---|---|
Alhambra Police Department | 86475 |
Anaheim Police Department | 354891 |
Antioch Police Department | 112630 |
Bakersfield Police Department | 379741 |
## Then the column containing
misdemeanor_arrests = misdemeanor_data['Total Misdemeanor Arrests (MACR 2016)']
misdemeanor_arrests.head(5)
Alameda Police Department | 894.0 |
---|---|
Alhambra Police Department | 642.0 |
Anaheim Police Department | 5803.0 |
Antioch Police Department | 2235.0 |
Bakersfield Police Department | 17643.0 |
## Now we add a new column to misdemeanor_data and set it equal to
## misdemeanor_arrests/population as in the equation on the start of this
## section. We multiply this by 100, to convert the result into a percentage.
misdemeanor_data['% Misdemeanor Arrests'] = 100*misdemeanor_arrests / population
misdemeanor_data.head(5)
Agency Name | Total Population of Jurisdiction | Total Misdemeanor Arrests (MACR 2016) | % Misdemeanor Arrests |
---|---|---|---|
Alameda Police Department | 79654 | 894.0 | 1.1223541818364426 |
Alhambra Police Department | 86475 | 642.0 | 0.7424111014744146 |
Anaheim Police Department | 354891 | 5803.0 | 1.6351499474486533 |
Antioch Police Department | 112630 | 2235.0 | 1.9843736127141969 |
Bakersfield Police Department | 379741 | 17643.0 | 4.646061394476762 |
The % Misdemeanor Arrests
column shows how many out of every 1000 people were arrested for misdemeanors in that district.
6. Computing Percentiles
Now we want to rank the police departments based on the percentage of misdemeanor arrests using percentiles. If a department gets a score of 95, that means it's better then 95% of all CA police departments.
To compute this, you just have to order the departments from best (smallest % misdemeanor arrests) to worst (largest % misdemeanor arrests).
Suppose there are 50 departments in total. The first department on the list is the best and gets a percentile score of 100. The second on the list is the same or better than 49/50 = 98% other departments and gets a score of 98. The next on the list gets 48/50 = 96 and so on...
NOTE: You have to modify this if you have departments with the same % misdemeanor arrests (they won't be in the same position on the list, but they should get the same score). This is usually unlikely, but the idea is the same in any case.
## First reorder the departments from best to worst
misdemeanor_data = misdemeanor_data.sort_values('% Misdemeanor Arrests')
misdemeanor_data.head(5)
Agency Name | Total Population of Jurisdiction | Total Misdemeanor Arrests (MACR 2016) | % Misdemeanor Arrests |
---|---|---|---|
Monterey County Sheriff's Department | 99533 | 439.0 | 0.4410597490279606 |
San Jose Police Department | 1055778 | 5741.0 | 0.5437696182341364 |
Richmond Police Department | 111493 | 661.0 | 0.5928623321643511 |
Elk Grove Police Department | 169153 | 1031.0 | 0.60950736906824 |
Marin County Sheriff's Department | 101302 | 637.0 | 0.6288128566069772 |
## Now we create an empty column that will later be filled with the percentile
## scores.
## the numpy nan function fills the column with NaN = 'not a number'
misdemeanor_data['Misdemeanor Percentile'] = np.nan
misdemeanor_data.head(5)
Agency Name | Total Population of Jurisdiction | Total Misdemeanor Arrests (MACR 2016) | % Misdemeanor Arrests | Misdemeanor Percentile |
---|---|---|---|---|
Monterey County Sheriff's Department | 99533 | 439.0 | 0.4410597490279606 | |
San Jose Police Department | 1055778 | 5741.0 | 0.5437696182341364 | |
Richmond Police Department | 111493 | 661.0 | 0.5928623321643511 | |
Elk Grove Police Department | 169153 | 1031.0 | 0.60950736906824 | |
Marin County Sheriff's Department | 101302 | 637.0 | 0.6288128566069772 |
## We define a new variable which gives the total number of police department
number_of_departments = len(misdemeanor_data.index)
## print the value
print('Total number of police departments:', number_of_departments)
## The following for-loop is a simple way to do the procedure explained in the
## beginning of the section. For each department it computes the percentage
## of departments that are lower (including itself) on the list.
row_index = 0 # This is the starting position on the list. We increase it by 1
# each time the loop runs to move to keep track of where we are on the list.
for agency_name in misdemeanor_data.index: ## misdemeanor_data.index is the list
# of Agency Names ordered from best to worst. The for-loop goes over every item
# in the list. We called these items agency_name, but you could name them whatever
# you want.
# the code belonging to the for-loop is indented
agency_percentile = (1 - row_index/number_of_departments)*100 # This calculates
# the percentile. Work out the first few on paper to convince yourself.
misdemeanor_data.at[agency_name, 'Misdemeanor Percentile'] = agency_percentile
# This inserts the value into our table
row_index = row_index + 1
# We add 1 to the row_index to go to the next position on the list.
# row_index = 0 on the first iteration, then 1,2,3, etc. all the way up to
# number_of_departments - 1.
# Once the for-loop is over, we unindent and show the results
misdemeanor_data
NOTE: The data for San Franscisco Sheriff's department in incomplete, which led to it being sorted at the bottom of the list. It's score doesn't mean anything.
NOTE: There are fancier and faster ways than using a for-loop, but this is the most basic.
CHALLENGE:
Do the same procedure Total Deadly Force Incidents 2016-2018 (All Firearms Discharges + Other Force Resulting in Death or Serious Injury) (URSUS Dataset)
and Total Arrests, 2016-2018
columns of the police_scorecard_data
that we imported at the beginning of the notebook.
Here we want to compute the percentage of deadly force incidents per arrest for each department and rank them accordingly.
NOTE: You might have to reimport the data by rerunning the cells at the beginning of the notebook.
Filter and index:
Agency Name
Total Arrests, 2016-2018
Total Deadly Force Incidents 2016-2018 (All Firearms Discharges + Other Force Resulting in Death or Serious Injury) (URSUS Dataset)
deadly_force = police_scorecard_data.filter(['Agency Name', 'Total Arrests, 2016-2018', 'Total Deadly Force Incidents 2016-2018 (All Firearms Discharges + Other Force Resulting in Death or Serious Injury) (URSUS Dataset)'], axis=1)
deadly_force.set_index('Agency Name', inplace=True)
deadly_force.rename(columns={'Total Arrests, 2016-2018': 'Total Arrests', 'Total Deadly Force Incidents 2016-2018 (All Firearms Discharges + Other Force Resulting in Death or Serious Injury) (URSUS Dataset)': 'Total Deadly Force'}, inplace=True)
deadly_force.head(5)
Agency Name | Total Arrests | Total Deadly Force |
---|---|---|
Alameda Police Department | 3581 | 1 |
Alhambra Police Department | 3911 | 3 |
Anaheim Police Department | 28200 | 17 |
Antioch Police Department | 10115 | 10 |
Bakersfield Police Department | 51337 | 67 |
Calculate what percentage of arrests involved deadly force, sort by mortality incidence rate, and drop San Francisco County Sheriff's Department because the data is incomplete.
deadly_force['Percent Deadly'] = deadly_force['Total Deadly Force']/deadly_force['Total Arrests'] * 100
deadly_force.sort_values('Percent Deadly', inplace=True)
deadly_force.drop(index='San Francisco County Sheriff\'s Department', inplace=True)
deadly_force
deadly_force['Percent Deadly Percentile'] = deadly_force['Percent Deadly'].rank(pct=True,method='min')
deadly_force.tail(5)
#deadly_force
Agency Name | Total Arrests | Total Deadly Force | Percent Deadly | Percent Deadly Percentile |
---|---|---|---|---|
Riverside County Sheriff's Department | 12084 | 36 | 0.29791459781529295 | 0.9745222929936306 |
Modoc County Sheriff's Department | 636 | 2 | 0.3144654088050315 | 0.9808917197452229 |
Orange County Sheriff's Department | 10833 | 42 | 0.3877042370534478 | 0.9872611464968153 |
Tehama County Sheriff's Department | 2293 | 9 | 0.39249890972525076 | 0.9936305732484076 |
Richmond Police Department | 4864 | 23 | 0.47286184210526316 | 1.0 |
SUPER CHALLENGE:
Do it with a dataset of your choice.