Hisham Hussein Jul 08, 2017

The Analytical Police Officer

1.
Understanding predictive policing through visualization

The explosion of computerized data affects all parts of society, including law and order. Traditionally, hunches and experience have been the only techniques available for predicting crime. Police forces everywhere are now augmenting human judgment with statistical analytics, sometimes described as predictive policing.

This includes using predictive models, such as regression and classification — estimating relationships between variables, and assigning observations to known categories — to predict the location, time or number of crimes. Yet such models are useless on their own to your average patrol person.

Clear visualizations can bridge this knowledge gap. Let's look at two case studies: car theft in Chicago and murders across the United States.

1.1.
Setting up our R environment

We'll use the R language for this analysis.

Cell has changes
install dependencies
R
ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg))
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}

options(repos = c(CRAN = 'http://cran.rstudio.com'), unzip = 'internal')
packages <- c("ggplot2", "dplyr", "purrr", "readr", "lubridate", "maps", "ggmap", "ggthemes")
ipak(packages)
Done
libraries
R
library(ggplot2)      # visualization
library(dplyr)        # data manipulation
library(purrr)        # functional programming
library(readr)        # reading data into R
library(lubridate)    # date and time manipulation
library(plotly)       # interactive visualization on the web
library(maps)         # fetching map data
library(ggmap)        # plotting maps
library(ggthemes)     # adding custom themes to ggplot graphs

# let's set a minimalist theme for ggplot
theme_set(theme_minimal())   

2.
When and where are cars likely to be stolen in Chicago?

The Chicago Police Department's crime database offers us the time and location of car thefts in the CSV format.

2.1.
Reading And Preparing Data for Analysis

Let's see what information we have.

Done
vehicle data
R
# mvt stands for motor vehicle theft
mvt <- read.csv(mvt.csv, header = TRUE, stringsAsFactors = FALSE)

# have a look at the structure of data
str(mvt, give.attr = FALSE)
'data.frame':	191641 obs. of  3 variables:
 $ Date     : chr  "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" ...
 $ Latitude : num  41.8 41.9 42 41.8 41.8 ...
 $ Longitude: num  -87.6 -87.7 -87.8 -87.7 -87.6 ...

We notice that the Date is parsed as a character. Let's use lubridate to convert it to a date-time structure, to extract the day of the week and the hour of the day.

Done
date fix
R
mvt$Date <- mdy_hm(mvt$Date)

# now let's look again
str(mvt, give.attr = FALSE)
'data.frame':	191641 obs. of  3 variables:
 $ Date     : POSIXct, format: "2012-12-31 23:15:00" "2012-12-31 22:00:00" ...
 $ Latitude : num  41.8 41.9 42 41.8 41.8 ...
 $ Longitude: num  -87.6 -87.7 -87.8 -87.7 -87.6 ...

Now our dates are ripe for extraction.

Done
show dates
R
print(head(date fix.mvt))
# A tibble: 6 x 3
                 Date Latitude Longitude
               <dttm>    <dbl>     <dbl>
1 2012-12-31 23:15:00 41.75628 -87.62164
2 2012-12-31 22:00:00 41.89879 -87.66130
3 2012-12-31 22:00:00 41.96919 -87.76767
4 2012-12-31 22:00:00 41.76933 -87.65773
5 2012-12-31 21:30:00 41.83757 -87.62176
6 2012-12-31 20:30:00 41.92856 -87.75400

Of course, we're not just interested in the exact time and date a crime occured. We want to elicit patterns across years, months, days of the week, and hours of the day.

So let's extract each date component as a new column.

Done
extract dates
R
mvt <- mvt %>% 
  mutate(Year = as.factor(year(Date)),
         Month = month(Date),
         Weekday = wday(Date),
         Hour = as.factor(hour(Date)))

# let's look at our data again
print(head(mvt))
                 Date Latitude Longitude Year Month Weekday Hour
1 2012-12-31 23:15:00 41.75628 -87.62164 2012    12       2   23
2 2012-12-31 22:00:00 41.89879 -87.66130 2012    12       2   22
3 2012-12-31 22:00:00 41.96919 -87.76767 2012    12       2   22
4 2012-12-31 22:00:00 41.76933 -87.65773 2012    12       2   22
5 2012-12-31 21:30:00 41.83757 -87.62176 2012    12       2   21
6 2012-12-31 20:30:00 41.92856 -87.75400 2012    12       2   20

2.2.
Visualizing Chicago car thefts by location and time

Let's try to answer these questions.

  • Where are the Chicago car theft hot spots?
  • Are cars stolen more often in a particular month of the year, day of the week, or hour of the day?

2.2.1.
Where are car theft hot spots in Chicago?

Let's graph the distribution of car theft locations over a map of Chicago.

Done
mapDF
R
ncrimesLatLong <- mvt %>%
  mutate(Long = round(Longitude, 2),
         Lat = round(Latitude, 2)) %>% 
  filter(!is.na(Lat) | !is.na(Long)) %>% 
  group_by(Long, Lat) %>% 
  summarize(CrimeCount = n())

print(head(ncrimesLatLong))
Source: local data frame [6 x 3]
Groups: Long [5]

    Long   Lat CrimeCount
   <dbl> <dbl>      <int>
1 -87.93 41.99          1
2 -87.92 41.96          1
3 -87.91 41.96          1
4 -87.91 41.98        340
5 -87.90 41.98          1
6 -87.89 41.98          4
Done
mvtMap
R
Sys.setenv("MAPBOX_TOKEN" = 'pk.eyJ1IjoiaGlzaGFtaHVzc2VpbiIsImEiOiJjajBsZTZxMGcwMDA0MndwaWlyd3pkNmZlIn0.TMEP6gfJCxD8OCIED5Ddrw')

plot_mapbox(ncrimesLatLong, x = ~Long, y = ~Lat, 
                  hoverinfo = 'text',
                  text = ~paste('No. of Crimes:', CrimeCount, 
                                '<br>Location:', Long, ",", Lat)) %>%
  add_markers(size = ~CrimeCount, sizes = c(10,40), 
              color = ~CrimeCount,
              colors = c("yellow", "red"),
              marker = list(symbol = 'circle', 
                            sizemode = 'diameter',
                            opacity = 0.6,
                            line = list(color = "red", width = 2))) %>% 
  layout(mapbox = list(accesstoken = 'pk.eyJ1IjoiaGlzaGFtaHVzc2VpbiIsImEiOiJjajBsZTZxMGcwMDA0MndwaWlyd3pkNmZlIn0.TMEP6gfJCxD8OCIED5Ddrw',
                       zoom = 9,
                       center = list(lat = ~median(Lat, na.rm=T),
                                     lon = ~median(Long, na.rm=T))))

2.2.2.
Are cars stolen more often in specific months in a year, days of the week, or hours of the day?

First, let's group our data by month and compute the total number of thefts in January over all ten years.

Done
thefts by month
R
mvt %>%
  group_by(Month) %>% 
  summarize(CrimeFreq = n()) %>%
  plot_ly(x = ~Month, y = ~CrimeFreq,
          type = "scatter", mode = "markers+lines", 
          marker = list(color = 'purple'),
          line = list(color = 'purple')) %>% 
  layout(title = "Total Monthly Crimes",
         xaxis = list(title = ""),
         yaxis = list(title = "Crime Frequency"))

Chicago car thieves seem busy in July and October, and barely at all in February. We don't know why - but this prompts more questions.

Let's repeat this process for weekdays and hours of the day. We'll automate this with a small function.

Done
thefts by weekday/hour
R
vars <- c("Month", "Weekday", "Hour")
cols <- c("red", "blue", "purple")

myPlot <- function(var, col) {
  mvt %>% 
    group_by_(var) %>% 
    summarize(CrimeFreq = n()) %>% 
    plot_ly(x = as.formula(paste0("~", var)), y = ~CrimeFreq, 
            type='scatter', mode="lines+markers",
            name = var, opacity = 0.7,
            marker=list(color = col),
            line = list(color = col))
}

plots <- map2(vars, cols, myPlot)

subplot(plots, nrows = 3) %>% layout(title = "Crime Frequency vs. Time Units")

More patterns emerge. But we can do better. Let's correlate by weekday and hour.

Done
weekday/hour correlation
R
mvt %>%
  group_by(Weekday, Hour) %>% 
  summarise(CrimeFreq = n()) %>% 
  plot_ly(x = ~Hour, y = ~CrimeFreq, color = ~Weekday) %>% 
  add_lines() %>% 
  layout(title = "Total Motor Vehicle Thefts by Day and Hour")
[error] export of term '__nj_self__' failed with: Error in Summary.factor(structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, : 'range' not meaningful for factors

Warning message:
Numeric color variables cannot (yet) be mapped to lines.
 when the trace type is 'scatter' or 'scattergl'.
 

Better — but this plot is still hard to interpret. Seven lines is a lot! Let’s try a heatmap instead.

Done
lightgreen
R
 ggplotly(mvt %>% 
  group_by(Weekday, Hour) %>% 
  summarize(CrimeFreq = n()) %>%
  ggplot(aes(x = Hour, y = Weekday)) +
  # add the heat map, and map CrimeFreq to the color of each square
  geom_tile(aes(fill = CrimeFreq)) +
  # change the name of the legend
  scale_fill_gradient(name = "No. of Thefts", low = 'white', high = 'red') +
  labs(x = "Hour of the Day", y = "Crime Frequency",
       title = "Car Thefts Heat Map") +
  theme_tufte())

Now it's clear: Friday night may need more police on the streets.

3.
Which US state has the highest murder rate?

While this seems like a simple question to answer, the data can be misleading. Let's ensure we're analyzing it right and create a heat map of murder rates per state. We can find this data at the U.S. Census Bureau and the FBI.

Let's look at the raw data.

Done
murders
R
murders <- read.csv(murders.csv, header = TRUE, stringsAsFactors = FALSE)
str(murders, give.attr = F)
'data.frame':	51 obs. of  6 variables:
 $ State            : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ Population       : int  4779736 710231 6392017 2915918 37253956 5029196 3574097 897934 601723 19687653 ...
 $ PopulationDensity: num  94.65 1.26 57.05 56.43 244.2 ...
 $ Murders          : int  199 31 352 130 1811 117 131 48 131 987 ...
 $ GunMurders       : int  135 19 232 93 1257 65 97 38 99 669 ...
 $ GunOwnership     : num  0.517 0.578 0.311 0.553 0.213 0.347 0.167 0.255 0.036 0.245 ...

We have 51 observations for the 50 states — plus Washington DC — and six different variables: name, population, population density, number of murders, the number of gun-related murders, and the rate of gun ownership.

R includes a geometric description of the US map. Let's load it and look at its contents.

Done
states
R
statesMap <- map_data('state')

# Let's see what this looks like by typing in str(statesMap).
str(statesMap)
'data.frame':	15537 obs. of  6 variables:
 $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
 $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
 $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
 $ subregion: chr  NA NA NA NA ...

We see a data frame summarizing how to draw the states. To draw the map here, we turn to ggplot polygons geometry.

Done
emptyMap
R
statesMap_p <-ggplot(data = statesMap, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = 'white', color= 'black', size = 0.5) +
  labs(title = "Raw Map of the United States") + 
  theme_tufte() +
  theme(axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

ggplotly(statesMap_p) %>% layout(width = 880, height = 550)

Before plotting our murder data, let's ensure the state name matches across these data frames. We see some capitalization differences, so let’s create a new variable called region in our murders data frame to match state names up.

Done
addRegn
R
murders <- murders %>% 
  mutate(region = tolower(State)) %>% 
  select(State, region, everything())

# let's have a look again
print(head(murders,10))
                  State               region Population PopulationDensity
1               Alabama              alabama    4779736            94.650
2                Alaska               alaska     710231             1.264
3               Arizona              arizona    6392017            57.050
4              Arkansas             arkansas    2915918            56.430
5            California           california   37253956           244.200
6              Colorado             colorado    5029196            49.330
7           Connecticut          connecticut    3574097           741.400
8              Delaware             delaware     897934           470.700
9  District of Columbia district of columbia     601723         10298.000
10              Florida              florida   19687653           360.200
   Murders GunMurders GunOwnership
1      199        135        0.517
2       31         19        0.578
3      352        232        0.311
4      130         93        0.553
5     1811       1257        0.213
6      117         65        0.347
7      131         97        0.167
8       48         38        0.255
9      131         99        0.036
10     987        669        0.245

Now we can join these data frames using dplyr .

Done
joinedDF
R
murderMap <- statesMap %>% 
  inner_join(murders, by = "region") %>% 
  # add a column (region_abbr) that matches states full names to its appreviations
  mutate(region_abbr = state.abb[match(State,state.name)]) %>% 
  select(long, lat, group, order, region, region_abbr, everything())

# look at the new table
print(head(murderMap))
       long      lat group order  region region_abbr subregion   State
1 -87.46201 30.38968     1     1 alabama          AL      <NA> Alabama
2 -87.48493 30.37249     1     2 alabama          AL      <NA> Alabama
3 -87.52503 30.37249     1     3 alabama          AL      <NA> Alabama
4 -87.53076 30.33239     1     4 alabama          AL      <NA> Alabama
5 -87.57087 30.32665     1     5 alabama          AL      <NA> Alabama
6 -87.58806 30.32665     1     6 alabama          AL      <NA> Alabama
  Population PopulationDensity Murders GunMurders GunOwnership
1    4779736             94.65     199        135        0.517
2    4779736             94.65     199        135        0.517
3    4779736             94.65     199        135        0.517
4    4779736             94.65     199        135        0.517
5    4779736             94.65     199        135        0.517
6    4779736             94.65     199        135        0.517

Finally, let’s plot the number of murders on our US map. We’ll again use the ggplot function, but this time, our data frame is murderMap:

Done
freqMap
R
# create a df to store the coordinates of the centers of each state
# to place name abbreviation of each state on this center

centroids <- murderMap %>% 
  group_by(State, region, region_abbr) %>% 
  summarise(long = median(range(long), na.rm=T),
            lat  = median(range(lat), na.rm=T))

murderMap_p <-ggplot(data = murderMap, aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = Murders)) +
  geom_text(data = centroids, aes(long, lat, label = region_abbr), size = 2) +
  scale_fill_gradient(low = 'white', high = 'red') +
  theme_tufte() +
  theme(axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank()) +
  labs(title = "Frequency of Murders Across The United States")

ggplotly(murderMap_p)%>% layout(width = 880, height = 550)
Warning message:
Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly() 

So it looks like California and Texas lead the country in number of murders. Is that just because they're the most populous states? Let’s graph each state's population to verify.

Done
popMap
R
populationMap_p <-ggplot(data = murderMap, aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = Population)) +
  geom_text(data = centroids, aes(long, lat, label = region_abbr), size = 2) +
  scale_fill_gradient(low = 'white', high = 'red') +
  theme_tufte() +
  theme(axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank()) +
  labs(title = "Population by State")

ggplotly(populationMap_p) %>% layout(width = 880, height = 550)

The population map is nearly identical to the murders map. So, we need to plot the murder rate (per every 100,000 people) to gain more insight.

Done
propMap
R
# add a column for the murder rate (per each 100,000 of population)
murderMap <- murderMap %>% 
  mutate(murderRate = Murders/Population * 100000) 

murderRateMap_p <-ggplot(data = murderMap, aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = murderRate)) +
  geom_text(data = centroids, aes(long, lat, label = region_abbr), size = 2) +
  scale_fill_gradient(low = 'white', high = 'red') +
  theme_tufte() +
  theme(axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank()) +
  labs(title = "Murder Rate <br> Per every 100,000",
       fill = "Murder Rate")

ggplotly(murderRateMap_p) %>% layout(width = 880, height = 550)

Odd - why aren't there red states?

It turns out that Washington, DC is an outlier with a very high murder rate, but it's so small we can't see it! So let’s remove any observations with murder rates above 10. This should only exclude Washington, DC.

finalMap
R
murderRateMap2_p <-ggplot(data = murderMap, aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = murderRate)) +
  geom_text(data = centroids, aes(long, lat, label = region_abbr), size = 2) +
  # by adding the argument (limits), we can remove the unwanted values
  scale_fill_gradient(low = 'white', high = 'red', limits = c(0, 10)) +
  theme_tufte() +
  theme(axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank()) +
  labs(title = "Murder Rate <br> Per every 100,000 Except WA",
       fill = "Murder Rate")

ggplotly(murderRateMap2_p) %>% layout(width = 880, height = 550)

The picture is clearer now. As far as states go, Louisiana has the highest murder rate.

4.
The future of open predictive policing

Our simple analyses show great promise. We can see how they may be more effective than raw intuition. But can we trust them? How do we ensure we don't introduce racial bias, or worse?

As commercial predictive policing tools come under fire from the public, it's important for data scientists to explore and expose these models to peer reviewers, and to the public at large.