Final 551 Project

Author

Bena Smith, Brandon Kim, Amanda Belden

Introduction

Chicago is considered a great place to visit by many, with many tourist attractions and a big city culture, but it’s also known for having high crime rates. Our project aims to use Chicago crime data to determine whether certain neighborhoods of Chicago have associations with certain types of crimes being committed there. We plan to use various modeling and visualization techniques in order to better explore this topic, and provide Chicago locals, tourists, and police with a better understanding of the crime in their city.

Our three tasks will be:

Using multiple non-supervised clustering methods of varying precision to determine if specific locations, based on latitude and longitude, and times are associated with the frequency of certain crimes and the arrest rates of these crimes. (Particularly about homicides, sexual assault and officer interference).
Developing a gradient boosted decision tree model to make predictions on which type of crime may be reported given a location, time and other factors.
Developing a gradient boosted decision tree model to make predictions on whether an arrest was made given the same factors.

Our goal with the first task is to help tourists and any unfamiliar or safety conscious people of Chicago to be better informed about the trends of different crimes. Our clustering analysis aims to be able to find specific times and locations in which certain crimes are significantly more common than other crimes. Determining the centroids of these clustered points and finding their range in coordinates, times and dates will help these people to be more wary of certain crimes depending on location. We also wanted to hone our analysis on homicides, sexual assault, and officer interferences, by doing separate, more specific cluster analyses to determine trends and other discernments. Insights on these three incident types can be vital information for Chicago residents as homicides and sexual assaults are very disturbing, and officer interference can help gauge the effectiveness of the Chicago Police Department in specific areas.

Although beneficial, this type of analysis has a lot of biases and can be misleading to those that are less informed. For instance, it is important to realize that we are performing the clustering on the type of crimes, rather than the frequency. This means that we will be able to determine the most popular crimes given a place and time, rather than the odds of a certain crime happening. However, using the size of a cluster will help give people an idea of how often crimes happen in a given area.

Our goal with the second task is to allow locals, citizens, and police to choose their own location and time, and we will predict which type of crime may occur there. This will allow locals to potentially choose places to live or routes to work, and allow tourists to make informed choices on where they should or should not visit. Additionally, it could allow police to send additional resources for that specific crime to the given location.

Our third task is to make a model to predict if an arrest was made based on information about the crime including type of crime, time of day, day of the week, and location. This will allow Chicago police to know how they might reorganize their efforts. This will also give future researchers insights into how arrests are made in Chicago.

Overall, the findings of this project are intended to benefit multiple groups of people. Tourists planning to visit Chicago could benefit from a better understanding of safe and potentially risky areas. Residents will have more information which will allow them to make better decisions on their living situation and daily activities. Finally, the Chicago police will have more insight into crime hotspots, enabling them to better distribute their resources as necessary.

Data Source and Context

The datasets used in this analysis are from the Chicago Data Portal, specifically the Crimes dataset (Chicago Data Portal) and the Homicide Map dataset (Chicago Data Portal). These are compiled by the Chicago Police Department using their system called CLEAR - Citizen Law Enforcement Analysis and Reporting System, which is available to the public.

While this data includes all reported crimes published by the Chicago Police Department, there may be inherent bias involved. It is possible that racial profiling used by police departments may over-inflate the crimes reported for people of color. Additionally, those who live in lower income or gang related areas may also be more likely to be arrested for potentially committing a crime as compared to people from more affluent neighborhoods. It is also incredibly important to note that this data reflects the reported crimes in Chicago, not the true number of crimes. Any crimes that tend to be reported less often are unlikely to be adequately represented with this analysis. For example, domestic assault or sexual assault cases may be represented incorrectly, as these types of crimes tend to be reported less often. Thus, it is important that we include these biases in our conclusions and limit our generalizations, to ensure that the mentioned groups, as well as others, are not negatively affected by our conclusions. We will also investigate how crimes change over time but we realize that this may be because there is greater or less focus on policing these crimes over time and not necessarily that people are committing these crimes at different rates. We recognize that this data regards sensitive information, and we will do everything possible to ensure that our results are not used in a way which may harm certain groups of people.

In conclusion, this project seeks to find patterns of crime throughout Chicago and contribute to the greater good by giving the people of Chicago access to easily understandable information about the safety of their surroundings.

Previous Work

(Schreck, McGloin, and Kirk 2009) studied 300 neighborhoods in Chicago over 2 years (1995-1996) and found that these neighborhoods “clearly distinguish themselves based upon the types of crimes that occur there.” Some neighborhoods are more likely to have higher rates of violent or non violent crimes and these trends were often the same over years. We want to look at more recent years and larger timeline and see if these findings remain true. We would also like to focus on more specific types of crime instead of the more broad categories of violent and nonviolent.

Reasoning for the different distributions of different types of crime is proposed in “Street Gang Crime in Chicago” (Block and Block 1993) which finds that street gangs in Chicago have different areas that they are concentrated in and different crimes that they engage in. “Most of the criminal activity in smaller street gangs centered on representation turf defense. The most lethal street gang hot spot areas are along disputed boundaries between small street gangs…Street gangs specializing in instrumental violence were strongest in disrupted and declining neighborhoods. Street gangs specializing in expressive violence were strongest and most violent in relatively prosperous neighborhoods with expanding populations.”

This study was carried out in 1993 but “An analysis of police responses to gangs in Chicago” (Lemmer, Bensinger, and Lurigio 2008) finds that this hot-spot nature of street gangs has continued. This is interesting when finding the areas that violent crimes and non violent crimes occur in Chicago. There are likely underlying reasons why certain crimes occur in certain areas. When we perform cluster analysis on latitude and longitude, we want to compare this to an actual map of Chicago and potentially current information about street gangs.

Ba finds in “The Role of Officer Race and Gender in Police-Civilian Interactions in Chicago” that “Chicago is also heavily segregated, [and] has a history of racial tensions between residents and police” (Ba et al. 2021). This is potentially impactful to arrest rates as Smith writes in “Racial Profiling? A Multivariate Analysis of Police Traffic Stop Data” that in the US, “Historically, minorities, and particularly African Americans, have had physical force used against them or have been arrested or stopped by police at rates exceeding their percentage in the population.” (Smith and Petrocelli 2001) We would like to study the distribution of arrest rates in different locations of Chicago and regarding different types of Crime. When analyzing these rates and locations, we would like to reference maps including demographic information about the people living in these areas.

Different types of crimes may also occur at different times of the day. The US Department of Justice (“Violent Crime Time of Day(per 1,000 in Age Group)” 2022) finds that, “In general, the number of violent crimes ted by adults increases hourly from 6 a.m. through the afternoon and evening hours, peaks at 9 p.m., and then drops to a low point at 5 a.m. In contrast, violent crimes ted by youth peak in the afternoon between 3 p.m. and 4 p.m., the hour at the end of the school day. More than one-third (37%) of all violent crime ted by youth occurs in the 5 hour period between noon and 5 p.m. In comparison, 30% of all violent crime ted by adults occurs between 6 p.m. and 11 p.m.” We want to perform analysis on the time of day that certain types of crimes occur in Chicago and compare them to this data that is an aggregate of 45 states and DC.

(Jenness and Grattet 1996) studied crime in Denver, Colorado and Los Angeles, California. They created a model to predict types of crimes occurring at a certain location, time, and day of the week. They used a Decision Tree classifier and a Naïve Bayesian classifier and achieved 51% prediction accuracy in Denver and 54% prediction accuracy in Los Angeles for predicting the type of crime.

We would like to perform inference about crime trends regarding location type, location and time of day and we would also like to create a predictive model that attempts to predict crime type using time of day, location, and location type variables as predictors. We must be very careful about the use of this model because we should not assume the crime that someone predicted based only on these non-descriptive predictors. This model should be used only for personal interest and resource organization and should not be used for prosecution in any manner. We should be very careful about the ethical implications of police focus. We address these concerns further in the ethical implications section.

Data Cleaning

Code

crimes <- crimes %>%
  select(-c(`District`, `Community Area`)) %>%
  bind_rows(homicides) %>%
  drop_na() %>% 
  filter(`X Coordinate` != 0) %>%
  mutate(`Primary Type` = case_when(
    `Primary Type` == "CRIM SEXUAL ASSAULT" ~ "CRIMINAL SEXUAL ASSAULT", 
    `Primary Type` == "NON-CRIMINAL (SUBJECT SPECIFIED)" ~ "NON-CRIMINAL",
    `Primary Type` == "OTHER NARCOTIC VIOLATION" ~ "NARCOTICS",
    `Primary Type` == "SEX OFFENSE" ~ "CRIMINAL SEXUAL ASSAULT",
    `Primary Type` == "NON - CRIMINAL" ~ "NON-CRIMINAL",
    .default = `Primary Type`), 
    Date = mdy_hms(Date), 
    Month = as.factor(month(Date)),
    Hour = as.factor(hour(Date)), 
    Weekday = weekdays(Date),
    Ward = as.factor(Ward),
    Arrest = as.factor(Arrest), 
    `Location Description` = case_when(
      grepl("airport", `Location Description`, ignore.case = T) ~ "Airport/Aircraft",
      grepl("aircraft", `Location Description`, ignore.case = T) ~ "Airport/Aircraft",
      grepl("Tavern", `Location Description`, ignore.case = T) ~ "Tavern", 
      grepl("CHA ", `Location Description`, ignore.case = T) ~ "CHA",
      grepl("College", `Location Description`, ignore.case = T) ~ "College", 
      grepl("CTA", `Location Description`, ignore.case = T) ~ "CTA",
      grepl("RESIDENCE", `Location Description`, ignore.case = T) ~ "Residence",
      grepl("SCHOOL", `Location Description`, ignore.case = T) ~ "School",
      `Location Description` == "VEHICLE - OTHER RIDE SERVICE" ~ "VEHICLE - OTHER RIDE SHARE SERVICE (LYFT, UBER, ETC.)", 
      `Location Description` == "VEHICLE - OTHER RIDE SHARE SERVICE (E.G., UBER, LYFT)" ~ "VEHICLE - OTHER RIDE SHARE SERVICE (LYFT, UBER, ETC.)",
      `Location Description` == "PARKING LOT/GARAGE(NON.RESID.)" ~ "PARKING LOT/GARAGE (NON RESIDENTIAL)", 
      `Location Description` == "POLICE FACILITY/VEH PARKING LOT" ~ "POLICE FACILITY/VEHICLE PARKING LOT",
      `Location Description` == "NURSING HOME/RETIREMENT HOME" ~ "NURSING/RETIREMENT HOME",
      `Location Description` == "VEHICLE-COMMERCIAL: TROLLEY BUS" ~ "VEHICLE-COMMERCIAL",
      .default = `Location Description`),
      `Location Description` = gsub(" / ", "/", `Location Description`), 
      `Location Description` = gsub(" - ", "-", `Location Description`),
      `Location Description` = toupper(`Location Description`))

We wanted the data sets to have equal representation in the features, so we got rid of district and community area. We also decided to drop all NA values as there were less than 1% of observations that had an NA value in a feature we cared about.

Additionally, in the big data set, the crime type has multiple representations of a type of crime used, so we used a case_when to merge the like ones together.

Code

homicides <- homicides %>%
  janitor::clean_names() %>%
  drop_na() %>%
  filter(x_coordinate != 0) %>%
  mutate(date = as.POSIXct(date, format = "%m/%d/%Y %I:%M:%S %p"),
         year = year(date))

We would also like to clean the homicides data set for some surface level analysis as well and remove seemingly incorrect entries that have x and y coordinates = 0.

Exploratory Analysis

We need to perform some surface level analysis before fitting cluster models, to gauge the effectiveness of geospatial cluster analysis.

We can create a map view of our collected records of incidents and informally examine (eyeball) if any hotspots are visibly recognizable just from their coordinates, as we would hope these hotspots would become clusters in our analysis. In order to mitigate visual clutter in the map visualization, we can create multiple map visualizations off of all permutations of some of our interested features (i.e. day, month, year, type of crime). This lets us examine the variability that each hotspot has in size, range and location, which would better allow us to determine how many clusters we should use in our cluster analysis. Specifically, we visualized this using an R Shiny app where one can select specific features and view the hotspots of incidents.

For simplicity’s sake (and since shinyapps.io doesn’t allow data that exceeds 100 mb), we will be using data that only relates to homicides, sexual assault and officer interference, as those are the crimes we particularly care about. In addition, we will also be including assault, robbery, motor vehicle theft and prostitution, as those are some general crimes that people are especially weary of in Chicago.

Leaflet Shiny App:

Code

include_app("https://b7iuz3-brandon-kim.shinyapps.io/ChicagoCrimeSubset/", height = "925px")

Looking at multiple clusters across different inputs, we can see that the clusters do vary in size, shape and range.

Additionally, creating some bar plots to see the distribution of the number of incidents per different categories could definitely tell us about some of the trends we could potentially investigate.

Bar Plot - Incidents by Crime Type:

Code

data.frame(table(crimes$`Primary Type`), stringsAsFactors = FALSE) %>%
  mutate(Crime = factor(Var1, levels = unique(Var1)[order(Freq)])) %>%
  plot_ly(x = ~Freq, y = ~Crime, type = "bar") %>%
  layout(title = "Frequency of Incidents in Chicago by Type of Crime",
         xaxis = list(title = ""), 
         yaxis = list(title = ""),
         annotations = list(
                        list(
                          x = 0.5,
                          y = -0.1,  
                          xref = "paper",
                          yref = "paper",
                          text = "Record by City of Chicago between January 1st, 2001 and November 7th, 2023", 
                          showarrow = FALSE,
                          yanchor = "bottom"  
                        )
                      )
  )

Bar Plot - Incidents by Year:

Code

crimes %>% 
  mutate(Year = as.factor(year(Date))) %>% 
  group_by(Year) %>%
  summarize(Freq = n()) %>% 
  plot_ly(x = ~Freq, y = ~Year, type = "bar") %>%
  layout(title = "Frequency of Incidents in Chicago by Year", 
         xaxis = list(title = ""),
         yaxis = list(title = ""),
         annotations = list(
                        list(
                          x = 0.5,
                          y = -0.1,  
                          xref = "paper",
                          yref = "paper",
                          text = "Recorded instances are those identified by the City of Chicago as the types listed in the plot above", 
                          showarrow = FALSE,
                          yanchor = "bottom"  
                        )
                      )
         )

Task 1: General Cluster Analyses

Geospatial Cluster Analysis - General Crime Pattern Discernment

We want to see which types of crime are more popular in different areas of Chicago. To do that, we performed K-means clustering on crime observations based on their X and Y coordinates. We chose to look at Arson, Liquor Law Violation, Gambling, Kidnapping, Concealed Carry License Violations, Criminal Trespassing, Narcotics, Burglary, and Interference with a Public Officer. We chose a subset of crimes because there were many listed and our analysis may be clouded by the vast array of crime types. These crimes were selected as they had some of the highest proportions in the dataset, and because these are crimes that may be important in helping determine why someone would choose to live or not live in certain areas.

Code

subsetcrimes <- crimes %>%
  filter(`Primary Type` %in% c("ARSON", "LIQUOR LAW VIOLATION", "GAMBLING", "KIDNAPPING", "CONCEALED CARRY LICENSE VIOLATION", "CRIMINAL TRESPASS", "NARCOTICS", "BURGLARY", "INTERFERENCE WITH PUBLIC OFFICER", "CRIMINAL SEXUAL ASSAULT", "HOMICIDE"))

For visualization purposes, since the subset is 1,328,825 observations, which would clutter the visualization too much for any useful pattern to be discerned, we will be using an animation to section it off by year. When we performed K-means clustering, we used the initial 2002 centroids from this larger dataset with many crimes as the initial centroids for other years and plots. This allows the clusters to be more comparable over years and plots throughout this project. This plot with many crimes is still very hard to pull analysis from because there are so many observations so we also created an interactive table and graphic of these clusters and the proportions of types of crimes within each cluster each year. We will still show the plot below so the reader can visualize these clusters. We preferred frequency analysis of type of crime types in clusters over counts of crime types because these clusters are arbitrary and not based on population sizes or area.

Code

years <- c(2002:2023)

new_colnames <- append(colnames(subsetcrimes), "cluster")
new_crimes_w_clusters <- data.frame(matrix(ncol=24, nrow = 0))  # Thomas on stackoverflow https://stackoverflow.com/questions/25051528/error-in-adding-rows-to-an-empty-data-frame-in-r
colnames(new_crimes_w_clusters) <- new_colnames

get_clusters <- function(selected_year, prev_centroids=NULL){
  
  data_by_year <- subset(subsetcrimes, Year == selected_year)
  
  data_by_year_red <- select(data_by_year, `X Coordinate`, `Y Coordinate`)
  
  if (selected_year == 2002) {
    km <- kmeans(data_by_year_red, centers=5)
  } 
  else {
    km <- kmeans(data_by_year_red, centers=prev_centroids) ##line from chatgpt

  }
  
  data_by_year$cluster <-  km$cluster

  return(list(yr_df_w_centers = data_by_year, centroids=km$centers))

}


get_clusters_helper <- function(yr){
  
  if(yr == 2002){
    prev_centroids = NULL
    retval <- get_clusters(yr, prev_centroids)
    first_centroids <<- retval$centroids
  }
  
  else{
    retval <- get_clusters(yr, first_centroids)
  }

  
  new_crimes_w_clusters <<- rbind(new_crimes_w_clusters, retval$yr_df_w_centers)
## global variables bc of map function - can change to a for loop where i return retval and add in the loop lmk if u think of a better way to do this
  
}

prev_centroids <- NULL
map(years, get_clusters_helper)

Code

animatedplot <- ggplot(new_crimes_w_clusters, aes(x = `X Coordinate`, y = `Y Coordinate`, group = interaction(Year, `X Coordinate`), color=as.factor(cluster))) +
  geom_point(alpha=0.15) +
  transition_time(Year) +
    scale_color_manual(name = "Cluster", values = c("lightblue", "orange", "green", "pink","plum"))+
  labs(color = "Cluster", 
       subtitle = "Year: {str_sub(frame_time, 1, 4)}", 
       title = "Many Crimes on X and Y Coordinates by Year with K Means Clusters", x = "X Coordinate", y = "Y Coordinate") 


#Jon Spring on Stack Overflow https://stackoverflow.com/questions/56411604/how-to-make-dots-in-gganimate-appear-and-not-transition
#finnstats on R bloggers https://www.r-bloggers.com/2021/05/animated-graph-gif-with-gganimate-ggplot/

Code

animate(animatedplot,  height = 500, width = 800,
        duration = 20, end_pause = 10, res = 100)