Author

Bena Smith, Brandon Kim, Amanda Belden

Introduction

Chicago is considered a great place to visit by many, with many tourist attractions and a big city culture, but it’s also known for having high crime rates. Our project aims to use Chicago crime data to determine whether certain neighborhoods of Chicago have associations with certain types of crimes being committed there. We plan to use various modeling and visualization techniques in order to better explore this topic, and provide Chicago locals, tourists, and police with a better understanding of the crime in their city.

Our three tasks will be:

  1. Using multiple non-supervised clustering methods of varying precision to determine if specific locations, based on latitude and longitude, and times are associated with the frequency of certain crimes and the arrest rates of these crimes. (Particularly about homicides, sexual assault and officer interference).
  2. Developing a gradient boosted decision tree model to make predictions on which type of crime may be reported given a location, time and other factors.
  3. Developing a gradient boosted decision tree model to make predictions on whether an arrest was made given the same factors.

Our goal with the first task is to help tourists and any unfamiliar or safety conscious people of Chicago to be better informed about the trends of different crimes. Our clustering analysis aims to be able to find specific times and locations in which certain crimes are significantly more common than other crimes. Determining the centroids of these clustered points and finding their range in coordinates, times and dates will help these people to be more wary of certain crimes depending on location. We also wanted to hone our analysis on homicides, sexual assault, and officer interferences, by doing separate, more specific cluster analyses to determine trends and other discernments. Insights on these three incident types can be vital information for Chicago residents as homicides and sexual assaults are very disturbing, and officer interference can help gauge the effectiveness of the Chicago Police Department in specific areas.

Although beneficial, this type of analysis has a lot of biases and can be misleading to those that are less informed. For instance, it is important to realize that we are performing the clustering on the type of crimes, rather than the frequency. This means that we will be able to determine the most popular crimes given a place and time, rather than the odds of a certain crime happening. However, using the size of a cluster will help give people an idea of how often crimes happen in a given area.

Our goal with the second task is to allow locals, citizens, and police to choose their own location and time, and we will predict which type of crime may occur there. This will allow locals to potentially choose places to live or routes to work, and allow tourists to make informed choices on where they should or should not visit. Additionally, it could allow police to send additional resources for that specific crime to the given location.

Our third task is to make a model to predict if an arrest was made based on information about the crime including type of crime, time of day, day of the week, and location. This will allow Chicago police to know how they might reorganize their efforts. This will also give future researchers insights into how arrests are made in Chicago.

Overall, the findings of this project are intended to benefit multiple groups of people. Tourists planning to visit Chicago could benefit from a better understanding of safe and potentially risky areas. Residents will have more information which will allow them to make better decisions on their living situation and daily activities. Finally, the Chicago police will have more insight into crime hotspots, enabling them to better distribute their resources as necessary.

Data Source and Context

The datasets used in this analysis are from the Chicago Data Portal, specifically the Crimes dataset (Chicago Data Portal) and the Homicide Map dataset (Chicago Data Portal). These are compiled by the Chicago Police Department using their system called CLEAR - Citizen Law Enforcement Analysis and Reporting System, which is available to the public.

While this data includes all reported crimes published by the Chicago Police Department, there may be inherent bias involved. It is possible that racial profiling used by police departments may over-inflate the crimes reported for people of color. Additionally, those who live in lower income or gang related areas may also be more likely to be arrested for potentially committing a crime as compared to people from more affluent neighborhoods. It is also incredibly important to note that this data reflects the reported crimes in Chicago, not the true number of crimes. Any crimes that tend to be reported less often are unlikely to be adequately represented with this analysis. For example, domestic assault or sexual assault cases may be represented incorrectly, as these types of crimes tend to be reported less often. Thus, it is important that we include these biases in our conclusions and limit our generalizations, to ensure that the mentioned groups, as well as others, are not negatively affected by our conclusions. We will also investigate how crimes change over time but we realize that this may be because there is greater or less focus on policing these crimes over time and not necessarily that people are committing these crimes at different rates. We recognize that this data regards sensitive information, and we will do everything possible to ensure that our results are not used in a way which may harm certain groups of people.

In conclusion, this project seeks to find patterns of crime throughout Chicago and contribute to the greater good by giving the people of Chicago access to easily understandable information about the safety of their surroundings.

Previous Work

(Schreck, McGloin, and Kirk 2009) studied 300 neighborhoods in Chicago over 2 years (1995-1996) and found that these neighborhoods “clearly distinguish themselves based upon the types of crimes that occur there.” Some neighborhoods are more likely to have higher rates of violent or non violent crimes and these trends were often the same over years. We want to look at more recent years and larger timeline and see if these findings remain true. We would also like to focus on more specific types of crime instead of the more broad categories of violent and nonviolent.

Reasoning for the different distributions of different types of crime is proposed in “Street Gang Crime in Chicago” (Block and Block 1993) which finds that street gangs in Chicago have different areas that they are concentrated in and different crimes that they engage in. “Most of the criminal activity in smaller street gangs centered on representation turf defense. The most lethal street gang hot spot areas are along disputed boundaries between small street gangs…Street gangs specializing in instrumental violence were strongest in disrupted and declining neighborhoods. Street gangs specializing in expressive violence were strongest and most violent in relatively prosperous neighborhoods with expanding populations.”

This study was carried out in 1993 but “An analysis of police responses to gangs in Chicago” (Lemmer, Bensinger, and Lurigio 2008) finds that this hot-spot nature of street gangs has continued. This is interesting when finding the areas that violent crimes and non violent crimes occur in Chicago. There are likely underlying reasons why certain crimes occur in certain areas. When we perform cluster analysis on latitude and longitude, we want to compare this to an actual map of Chicago and potentially current information about street gangs.

Ba finds in “The Role of Officer Race and Gender in Police-Civilian Interactions in Chicago” that “Chicago is also heavily segregated, [and] has a history of racial tensions between residents and police” (Ba et al. 2021). This is potentially impactful to arrest rates as Smith writes in “Racial Profiling? A Multivariate Analysis of Police Traffic Stop Data” that in the US, “Historically, minorities, and particularly African Americans, have had physical force used against them or have been arrested or stopped by police at rates exceeding their percentage in the population.” (Smith and Petrocelli 2001) We would like to study the distribution of arrest rates in different locations of Chicago and regarding different types of Crime. When analyzing these rates and locations, we would like to reference maps including demographic information about the people living in these areas.

Different types of crimes may also occur at different times of the day. The US Department of Justice (“Violent Crime Time of Day(per 1,000 in Age Group)” 2022) finds that, “In general, the number of violent crimes ted by adults increases hourly from 6 a.m. through the afternoon and evening hours, peaks at 9 p.m., and then drops to a low point at 5 a.m. In contrast, violent crimes ted by youth peak in the afternoon between 3 p.m. and 4 p.m., the hour at the end of the school day. More than one-third (37%) of all violent crime ted by youth occurs in the 5 hour period between noon and 5 p.m. In comparison, 30% of all violent crime ted by adults occurs between 6 p.m. and 11 p.m.” We want to perform analysis on the time of day that certain types of crimes occur in Chicago and compare them to this data that is an aggregate of 45 states and DC.

(Jenness and Grattet 1996) studied crime in Denver, Colorado and Los Angeles, California. They created a model to predict types of crimes occurring at a certain location, time, and day of the week. They used a Decision Tree classifier and a Naïve Bayesian classifier and achieved 51% prediction accuracy in Denver and 54% prediction accuracy in Los Angeles for predicting the type of crime.

We would like to perform inference about crime trends regarding location type, location and time of day and we would also like to create a predictive model that attempts to predict crime type using time of day, location, and location type variables as predictors. We must be very careful about the use of this model because we should not assume the crime that someone predicted based only on these non-descriptive predictors. This model should be used only for personal interest and resource organization and should not be used for prosecution in any manner. We should be very careful about the ethical implications of police focus. We address these concerns further in the ethical implications section.

Data Cleaning

Code
crimes <- crimes %>%
  select(-c(`District`, `Community Area`)) %>%
  bind_rows(homicides) %>%
  drop_na() %>% 
  filter(`X Coordinate` != 0) %>%
  mutate(`Primary Type` = case_when(
    `Primary Type` == "CRIM SEXUAL ASSAULT" ~ "CRIMINAL SEXUAL ASSAULT", 
    `Primary Type` == "NON-CRIMINAL (SUBJECT SPECIFIED)" ~ "NON-CRIMINAL",
    `Primary Type` == "OTHER NARCOTIC VIOLATION" ~ "NARCOTICS",
    `Primary Type` == "SEX OFFENSE" ~ "CRIMINAL SEXUAL ASSAULT",
    `Primary Type` == "NON - CRIMINAL" ~ "NON-CRIMINAL",
    .default = `Primary Type`), 
    Date = mdy_hms(Date), 
    Month = as.factor(month(Date)),
    Hour = as.factor(hour(Date)), 
    Weekday = weekdays(Date),
    Ward = as.factor(Ward),
    Arrest = as.factor(Arrest), 
    `Location Description` = case_when(
      grepl("airport", `Location Description`, ignore.case = T) ~ "Airport/Aircraft",
      grepl("aircraft", `Location Description`, ignore.case = T) ~ "Airport/Aircraft",
      grepl("Tavern", `Location Description`, ignore.case = T) ~ "Tavern", 
      grepl("CHA ", `Location Description`, ignore.case = T) ~ "CHA",
      grepl("College", `Location Description`, ignore.case = T) ~ "College", 
      grepl("CTA", `Location Description`, ignore.case = T) ~ "CTA",
      grepl("RESIDENCE", `Location Description`, ignore.case = T) ~ "Residence",
      grepl("SCHOOL", `Location Description`, ignore.case = T) ~ "School",
      `Location Description` == "VEHICLE - OTHER RIDE SERVICE" ~ "VEHICLE - OTHER RIDE SHARE SERVICE (LYFT, UBER, ETC.)", 
      `Location Description` == "VEHICLE - OTHER RIDE SHARE SERVICE (E.G., UBER, LYFT)" ~ "VEHICLE - OTHER RIDE SHARE SERVICE (LYFT, UBER, ETC.)",
      `Location Description` == "PARKING LOT/GARAGE(NON.RESID.)" ~ "PARKING LOT/GARAGE (NON RESIDENTIAL)", 
      `Location Description` == "POLICE FACILITY/VEH PARKING LOT" ~ "POLICE FACILITY/VEHICLE PARKING LOT",
      `Location Description` == "NURSING HOME/RETIREMENT HOME" ~ "NURSING/RETIREMENT HOME",
      `Location Description` == "VEHICLE-COMMERCIAL: TROLLEY BUS" ~ "VEHICLE-COMMERCIAL",
      .default = `Location Description`),
      `Location Description` = gsub(" / ", "/", `Location Description`), 
      `Location Description` = gsub(" - ", "-", `Location Description`),
      `Location Description` = toupper(`Location Description`))

We wanted the data sets to have equal representation in the features, so we got rid of district and community area. We also decided to drop all NA values as there were less than 1% of observations that had an NA value in a feature we cared about.

Additionally, in the big data set, the crime type has multiple representations of a type of crime used, so we used a case_when to merge the like ones together.

Code
homicides <- homicides %>%
  janitor::clean_names() %>%
  drop_na() %>%
  filter(x_coordinate != 0) %>%
  mutate(date = as.POSIXct(date, format = "%m/%d/%Y %I:%M:%S %p"),
         year = year(date))

We would also like to clean the homicides data set for some surface level analysis as well and remove seemingly incorrect entries that have x and y coordinates = 0.

Exploratory Analysis

We need to perform some surface level analysis before fitting cluster models, to gauge the effectiveness of geospatial cluster analysis.

We can create a map view of our collected records of incidents and informally examine (eyeball) if any hotspots are visibly recognizable just from their coordinates, as we would hope these hotspots would become clusters in our analysis. In order to mitigate visual clutter in the map visualization, we can create multiple map visualizations off of all permutations of some of our interested features (i.e. day, month, year, type of crime). This lets us examine the variability that each hotspot has in size, range and location, which would better allow us to determine how many clusters we should use in our cluster analysis. Specifically, we visualized this using an R Shiny app where one can select specific features and view the hotspots of incidents.

For simplicity’s sake (and since shinyapps.io doesn’t allow data that exceeds 100 mb), we will be using data that only relates to homicides, sexual assault and officer interference, as those are the crimes we particularly care about. In addition, we will also be including assault, robbery, motor vehicle theft and prostitution, as those are some general crimes that people are especially weary of in Chicago.

Leaflet Shiny App:

Code
include_app("https://b7iuz3-brandon-kim.shinyapps.io/ChicagoCrimeSubset/", height = "925px")

Looking at multiple clusters across different inputs, we can see that the clusters do vary in size, shape and range.

Additionally, creating some bar plots to see the distribution of the number of incidents per different categories could definitely tell us about some of the trends we could potentially investigate.

Bar Plot - Incidents by Crime Type:

Code
data.frame(table(crimes$`Primary Type`), stringsAsFactors = FALSE) %>%
  mutate(Crime = factor(Var1, levels = unique(Var1)[order(Freq)])) %>%
  plot_ly(x = ~Freq, y = ~Crime, type = "bar") %>%
  layout(title = "Frequency of Incidents in Chicago by Type of Crime",
         xaxis = list(title = ""), 
         yaxis = list(title = ""),
         annotations = list(
                        list(
                          x = 0.5,
                          y = -0.1,  
                          xref = "paper",
                          yref = "paper",
                          text = "Record by City of Chicago between January 1st, 2001 and November 7th, 2023", 
                          showarrow = FALSE,
                          yanchor = "bottom"  
                        )
                      )
  ) 

Bar Plot - Incidents by Year:

Code
crimes %>% 
  mutate(Year = as.factor(year(Date))) %>% 
  group_by(Year) %>%
  summarize(Freq = n()) %>% 
  plot_ly(x = ~Freq, y = ~Year, type = "bar") %>%
  layout(title = "Frequency of Incidents in Chicago by Year", 
         xaxis = list(title = ""),
         yaxis = list(title = ""),
         annotations = list(
                        list(
                          x = 0.5,
                          y = -0.1,  
                          xref = "paper",
                          yref = "paper",
                          text = "Recorded instances are those identified by the City of Chicago as the types listed in the plot above", 
                          showarrow = FALSE,
                          yanchor = "bottom"  
                        )
                      )
         )

Task 1: General Cluster Analyses

Geospatial Cluster Analysis - General Crime Pattern Discernment

We want to see which types of crime are more popular in different areas of Chicago. To do that, we performed K-means clustering on crime observations based on their X and Y coordinates. We chose to look at Arson, Liquor Law Violation, Gambling, Kidnapping, Concealed Carry License Violations, Criminal Trespassing, Narcotics, Burglary, and Interference with a Public Officer. We chose a subset of crimes because there were many listed and our analysis may be clouded by the vast array of crime types. These crimes were selected as they had some of the highest proportions in the dataset, and because these are crimes that may be important in helping determine why someone would choose to live or not live in certain areas.

Code
subsetcrimes <- crimes %>%
  filter(`Primary Type` %in% c("ARSON", "LIQUOR LAW VIOLATION", "GAMBLING", "KIDNAPPING", "CONCEALED CARRY LICENSE VIOLATION", "CRIMINAL TRESPASS", "NARCOTICS", "BURGLARY", "INTERFERENCE WITH PUBLIC OFFICER", "CRIMINAL SEXUAL ASSAULT", "HOMICIDE"))

For visualization purposes, since the subset is 1,328,825 observations, which would clutter the visualization too much for any useful pattern to be discerned, we will be using an animation to section it off by year. When we performed K-means clustering, we used the initial 2002 centroids from this larger dataset with many crimes as the initial centroids for other years and plots. This allows the clusters to be more comparable over years and plots throughout this project. This plot with many crimes is still very hard to pull analysis from because there are so many observations so we also created an interactive table and graphic of these clusters and the proportions of types of crimes within each cluster each year. We will still show the plot below so the reader can visualize these clusters. We preferred frequency analysis of type of crime types in clusters over counts of crime types because these clusters are arbitrary and not based on population sizes or area.

Code
years <- c(2002:2023)

new_colnames <- append(colnames(subsetcrimes), "cluster")
new_crimes_w_clusters <- data.frame(matrix(ncol=24, nrow = 0))  # Thomas on stackoverflow https://stackoverflow.com/questions/25051528/error-in-adding-rows-to-an-empty-data-frame-in-r
colnames(new_crimes_w_clusters) <- new_colnames

get_clusters <- function(selected_year, prev_centroids=NULL){
  
  data_by_year <- subset(subsetcrimes, Year == selected_year)
  
  data_by_year_red <- select(data_by_year, `X Coordinate`, `Y Coordinate`)
  
  if (selected_year == 2002) {
    km <- kmeans(data_by_year_red, centers=5)
  } 
  else {
    km <- kmeans(data_by_year_red, centers=prev_centroids) ##line from chatgpt

  }
  
  data_by_year$cluster <-  km$cluster

  return(list(yr_df_w_centers = data_by_year, centroids=km$centers))

}


get_clusters_helper <- function(yr){
  
  if(yr == 2002){
    prev_centroids = NULL
    retval <- get_clusters(yr, prev_centroids)
    first_centroids <<- retval$centroids
  }
  
  else{
    retval <- get_clusters(yr, first_centroids)
  }

  
  new_crimes_w_clusters <<- rbind(new_crimes_w_clusters, retval$yr_df_w_centers)
## global variables bc of map function - can change to a for loop where i return retval and add in the loop lmk if u think of a better way to do this
  
}

prev_centroids <- NULL
map(years, get_clusters_helper)
Code
animatedplot <- ggplot(new_crimes_w_clusters, aes(x = `X Coordinate`, y = `Y Coordinate`, group = interaction(Year, `X Coordinate`), color=as.factor(cluster))) +
  geom_point(alpha=0.15) +
  transition_time(Year) +
    scale_color_manual(name = "Cluster", values = c("lightblue", "orange", "green", "pink","plum"))+
  labs(color = "Cluster", 
       subtitle = "Year: {str_sub(frame_time, 1, 4)}", 
       title = "Many Crimes on X and Y Coordinates by Year with K Means Clusters", x = "X Coordinate", y = "Y Coordinate") 


#Jon Spring on Stack Overflow https://stackoverflow.com/questions/56411604/how-to-make-dots-in-gganimate-appear-and-not-transition
#finnstats on R bloggers https://www.r-bloggers.com/2021/05/animated-graph-gif-with-gganimate-ggplot/ 
Code
animate(animatedplot,  height = 500, width = 800,
        duration = 20, end_pause = 10, res = 100)

Interestingly, it looks like the number of crimes overall is decreasing over time or these crimes are being reported less.

The following table shows the proportion of each crime in each cluster for each year. Each row adds up to 1. This will allow us to see which types of crimes are more popular in which areas and how these trends change over time.

Code
## chat gpt and tb on Stack Overflow https://stackoverflow.com/questions/22767893/count-number-of-rows-by-group-using-dplyr

df_proportions <- new_crimes_w_clusters %>% 
  group_by(Year, cluster, `Primary Type`) %>%
  summarise(Count = n()) %>% 
  group_by(Year, cluster) %>%
  mutate(TotalCount = sum(Count)) %>%
  mutate(Proportion = Count / TotalCount) %>%
  select(Year, cluster, `Primary Type`, Proportion)


df_proportions$cluster <- as.character(df_proportions$cluster)

## one row for year, cluster combination
df_proportions %>%
  pivot_wider(names_from = `Primary Type`, values_from = Proportion, values_fill = 0) %>%
  datatable(options = list(pageLength = 5))

The following plot visualizes these proportions of crime types in each cluster

Code
include_app("https://3efmzi-bena-smith.shinyapps.io/ShinyAppCrimeProportions/", height = "725px")

Here are some of the main observations from our table and visual:

  • The proportion of Interference with a Public Officer crimes increases over time in all clusters

  • The proportion of Narcotics crimes decreases over time in all clusters

  • The proportion of Criminal Sexual Assault crimes increases over time in all clusters

  • The proportion of Homicide crimes increases over time in all clusters but increases the least in Cluster 2

  • Cluster 4 has a higher proportion of Interference with a Public Officer crimes than other clusters in recent years

  • In most years, Cluster 1 looks to have a higher frequency of Burglary crimes than other clusters

  • The frequency of Gambling crimes has decreased over time overall but used to have the highest frequencies in Clusters 2 and 3.

Based on this analysis, we cannot tell citizens which cluster or location is the safest, however they could use this information for themselves to determine where they would feel the most comfortable living. The overall decrease in Narcotics crimes over time tells us that drug abuse prevention techniques may be working in Chicago, but the rise in homicies and sexual assault tells us that not enough is being done to prevent these crimes. The Chicago police department shuld divert additional funding to investigating and preventing these types of crimes. Again, it is important to note that the rise in proportion of sexual assault crimes may be truly occurring, or there could be a rise in people willing to report sexual assault.

We would also like to focus on three specific types of crime using this geospatial cluster analysis, specifically Homicides, Criminal Sexual Assaults, and Interference with a Public Officer. We believe that these crimes are important to highlight as they can be threats to personal safety. The first two, homicides and sexual assault, are incredibly dangerous crimes that would have a large influence on whether or not someone would feel comfortable living in a certain area.

Interference with a Public Officer was also selected due to recent events regarding policing. As we all know, the use of racial profiling and the unequal treatment of black people by police has been a huge problem in recent years and is drawing more and more attention. Additonally, the immunity of police officers for ting crimes is continuously questioned by the media and the public. We selected this crime, Interference of a Public Officer, to see whether there is an inclination for police officers to protect their own safety and ability to carry out their jobs more than they protect the lives of citizens and victims of crimes. Although this crime type does not exactly measure this idea, we thought it may still have valuable insights into how the interfence of police officers is treated differently or similarly to more dangerous crimes, such as homicides and sexual assault.

Geospatial Cluster Analysis - Homicides Pattern Discernment

We also wanted to look more in depth at homicide locations in Chicago, especially because the frequency of homicides increases over time in each cluster. We made an animated plot of the locations of homicides in Chicago based on x and y coordinates. We clustered this data using K-means with 5 clusters. We also looked at the arrest rates in each cluster. Although a homicide does not necessarily mean that someone is liable for a crime and can be arrested, we used arrests as a metric representing the effectiveness of police in the area. The arrest rate does not need to be close to 1 for us to deem the police as effective because again a homicide does not necessarily mean that someone is liable for a crime. However, if the arrest rate is lower in one cluster compared to other clusters, this may be evidence that police need to focus more on these areas.

Code
## do k means on homicide data by year

set.seed(1)
years <- c(2001:2023)

new_colnames <- append(colnames(homicides), "cluster")
new_homicides_w_clusters <- data.frame(matrix(ncol=21, nrow = 0)) # Thomas on stackoverflow https://stackoverflow.com/questions/25051528/error-in-adding-rows-to-an-empty-data-frame-in-r
colnames(new_homicides_w_clusters) <- new_colnames

get_clusters <- function(selected_year, prev_centroids=NULL){
  data_by_year <- subset(homicides, year == selected_year)
  
  data_by_year_red <- select(data_by_year, x_coordinate, y_coordinate)
  km <- kmeans(data_by_year_red, centers=first_centroids) ##line from chatgpt
  
  data_by_year$cluster <-  km$cluster

  return(list(yr_df_w_centers = data_by_year, centroids=km$centers))

}


get_clusters_helper <- function(yr){
  if(yr == 2001){
    retval <- get_clusters(yr,NULL)
    
  }
  else{

    retval <- get_clusters(yr, prev_centroids)
  }
  
  new_homicides_w_clusters <<- rbind(new_homicides_w_clusters, retval$yr_df_w_centers)
  prev_centroids <<- retval$centroids ## global variables bc of map function - can change to a for loop where i return retval and add in the loop lmk if u think of a better way to do this
  
}

prev_centroids <- NULL
map(years, get_clusters_helper)
Code
## get arrest rates for every year/ cluster combo 

arrest_rates <- data.frame(matrix(ncol=3, nrow = 0)) # Thomas on stackoverflow https://stackoverflow.com/questions/25051528/error-in-adding-rows-to-an-empty-data-frame-in-r
colnames(arrest_rates) <- c("year, cluster, arrest_rate")


years <- c(2001:2023)
clusters <- c(1:5)

  
get_arrest_rate <- function(combo){
  sel_year <- combo[1]
  sel_cluster <- combo[2]
  
  data_by_year <- subset(new_homicides_w_clusters, year == sel_year)
  data_by_year_and_cluster <- subset(data_by_year, cluster == sel_cluster)
  
  num_arrests <- sum(data_by_year_and_cluster$arrest == "TRUE")
  total_rows <- nrow(data_by_year_and_cluster)
  
  arrest_rate <- num_arrests/total_rows
  
  arrest_rates <<- rbind(arrest_rates,  list(year = sel_year, cluster=sel_cluster, arrest_rate=arrest_rate))

  
}

combos <- expand.grid(years, clusters) # Onyejiaku Theophilus Chidalu from educative https://www.educative.io/answers/what-is-the-expandgrid-function-in-r

apply(combos, 1,get_arrest_rate)
Code
percentile_25 <- quantile(arrest_rates$arrest_rate, 0.25)

The 25th percentile of arrest rates for homicides in clusters over years is 0.4033951. We colored the arrest rates on the plot based on if they are over or under this rate.

Code
## add column for mean x/ y coordinate as the coordinate to display the arrest rate 

merged_df <- merge(new_homicides_w_clusters, arrest_rates, by = c("year", "cluster"))

merged_df <- merged_df %>%
  group_by(year, cluster) %>%
  mutate(mean_x_coordinate = mean(x_coordinate))%>%
  mutate(mean_y_coordinate = mean(y_coordinate))%>% 
  ungroup()
Code
graph1 <- merged_df %>% ggplot()+
  xlim(1110000, 1225000)+
  geom_point(alpha=0.15, aes(x = x_coordinate, y = y_coordinate, group = interaction(year, x_coordinate), color=as.factor(cluster)))+
  scale_color_manual(name = "Cluster", values = c("lightblue", "orange", "green", "pink","plum")) +
  new_scale_color()+
    geom_text(aes(x=mean_x_coordinate,y=mean_y_coordinate, label = paste("Arrest Rate: ", round(arrest_rate, 2) ), color = ifelse(arrest_rate > 0.4, "Above 0.4", "Below or equal to 0.4"), group = interaction(year, mean_x_coordinate))) +
    scale_color_manual(name = "Arrest Rate", values = c("purple", "hotpink"))

#Jon Spring on Stack Overflow https://stackoverflow.com/questions/56411604/how-to-make-dots-in-gganimate-appear-and-not-transition
Code
graph1.animation <- graph1 +
  theme_minimal()+
  transition_time(year) +
  labs(subtitle = "Year: {str_sub(frame_time, 1, 4)}", color ="Arrest Rate", title="Homicides in Chicago by Year with KMeans Clusters",  x = "X Coordinate", y = "Y Coordinate")

#finnstats on R bloggers https://www.r-bloggers.com/2021/05/animated-graph-gif-with-gganimate-ggplot/ 
#dc37 on Stack Overflow https://stackoverflow.com/questions/11838278/plot-with-conditional-colors-based-on-values-in-r 
Code
animate(graph1.animation, height = 500, width = 800, fps = 30, duration = 30,
        end_pause = 60, res = 100)

Code
anim_save("homicide_cluster_analysis.gif")

#finnstats on R bloggers https://www.r-bloggers.com/2021/05/animated-graph-gif-with-gganimate-ggplot/ 

Arrest rates colored in purple are above 0.4 for that cluster. Arrest rates below or equal to 0.4 are colored in pink for that cluster. As time goes on, we can see in the visualization that more clusters have lower arrest rates (shown in pink). This may be because crime is increasing, criminals are becoming better at hiding evidence, or may be due to changes in police organization and investigative techniques. It is also interesting that Cluster 2 has a higher arrest rates than the other clusters during most timeframes and especially in later years. We will also investigate other crimes individually to see if the spacial distribution of crimes differs by crime type.

In the future, we would like to look at numbers of homicides and see if they increase over time. It would be interesting to do more research about the area of Cluster 2. We could look at demographic information including earnings data in this area to see if there are factors that are associated with their higher arrest rates.

Geospatial Cluster Analysis - Criminal Sexual Assault Pattern Discernment

We also want to investigate Criminal Sexual Assaults as they are a crime that most people would find highly disturbing to live near and especially because rates of Criminal Sexual Assaults are increasing over time in each cluster.

Code
sexassault_subsetcrimes <- crimes %>%
  filter(`Primary Type` %in% c("CRIMINAL SEXUAL ASSAULT"))
Code
years <- c(2002:2023)

new_colnames <- append(colnames(sexassault_subsetcrimes), "cluster")
new_sexassault_w_clusters <- data.frame(matrix(ncol=24, nrow = 0))  # Thomas on stackoverflow https://stackoverflow.com/questions/25051528/error-in-adding-rows-to-an-empty-data-frame-in-r
colnames(new_sexassault_w_clusters) <- new_colnames

get_clusters <- function(selected_year, prev_centroids=NULL){
  
  data_by_year <- subset(sexassault_subsetcrimes, Year == selected_year)
  
  data_by_year_red <- select(data_by_year, `X Coordinate`, `Y Coordinate`)
  km <- kmeans(data_by_year_red, centers=first_centroids) ##line from chatgpt

  
  data_by_year$cluster <-  km$cluster

  return(list(yr_df_w_centers = data_by_year, centroids=km$centers))

}


get_clusters_helper <- function(yr){

  retval <- get_clusters(yr, prev_centroids)
  
  new_sexassault_w_clusters <<- rbind(new_sexassault_w_clusters, retval$yr_df_w_centers)
  prev_centroids <<- retval$centroids  ## global variables bc of map function - can change to a for loop where i return retval and add in the loop lmk if u think of a better way to do this
  
}

prev_centroids <- NULL
map(years, get_clusters_helper)
Code
## get arrest rates for every year/ cluster combo 

arrest_rates <- data.frame(matrix(ncol=3, nrow = 0)) # Thomas on stackoverflow https://stackoverflow.com/questions/25051528/error-in-adding-rows-to-an-empty-data-frame-in-r
colnames(arrest_rates) <- c("year, cluster, arrest_rate")


years <- c(2002:2023)
clusters <- c(1:5)

  
get_arrest_rate <- function(combo){
  sel_year <- combo[1]
  sel_cluster <- combo[2]
  
  data_by_year <- subset(new_sexassault_w_clusters, Year == sel_year)
  data_by_year_and_cluster <- subset(data_by_year, cluster == sel_cluster)
  
  num_arrests <- sum(data_by_year_and_cluster$Arrest == "TRUE")
  total_rows <- nrow(data_by_year_and_cluster)
  
  arrest_rate <- num_arrests/total_rows
  
  arrest_rates <<- rbind(arrest_rates,  list(Year = sel_year, cluster=sel_cluster, arrest_rate=arrest_rate))

  
}

combos <- expand.grid(years, clusters) # Onyejiaku Theophilus Chidalu from educative https://www.educative.io/answers/what-is-the-expandgrid-function-in-r

apply(combos, 1,get_arrest_rate)
Code
## add column for mean x/ y coordinate as the coordinate to display the arrest rate 

merged_df <- merge(new_sexassault_w_clusters, arrest_rates, by = c("Year", "cluster"))

merged_df <- merged_df %>%
  group_by(Year, cluster) %>%
  mutate(mean_x_coordinate = mean(`X Coordinate`))%>%
  mutate(mean_y_coordinate = mean(`Y Coordinate`))%>% 
  ungroup()
Code
percentile_25 <- quantile(arrest_rates$arrest_rate, 0.25)

The 25th percentile of arrest rates for Criminal Sexual Assaults in clusters over years is 0.1207512. We colored the arrest rates on the plot based on if they are over or under this rate.

Code
graph1 <- merged_df %>% ggplot()+
  xlim(1110000, 1225000)+
  geom_point(alpha=0.15, aes(x = `X Coordinate`, y = `Y Coordinate`, group = interaction(Year, `X Coordinate`), color=as.factor(cluster)))+
  scale_color_manual(name = "Cluster", values = c("lightblue", "orange", "green", "pink","plum")) +
  new_scale_color()+
    geom_text(aes(x=mean_x_coordinate,y=mean_y_coordinate, label = paste("Arrest Rate: ", round(arrest_rate, 2) ), color = ifelse(arrest_rate > 0.12, "Above 0.12", "Below or equal to 0.12"), group = interaction(Year, mean_x_coordinate))) +
    scale_color_manual(name = "Arrest Rate", values = c("purple", "hotpink"))

#Jon Spring on Stack Overflow https://stackoverflow.com/questions/56411604/how-to-make-dots-in-gganimate-appear-and-not-transition
Code
graph1.animation <- graph1 +
  theme_minimal()+
  transition_time(Year) +
  labs(subtitle = "Year: {str_sub(frame_time, 1, 4)}", color ="Arrest Rate", title="Criminal Sexual Assaults in Chicago by Year with KMeans Clusters",  x = "X Coordinate", y = "Y Coordinate")

#finnstats on R bloggers https://www.r-bloggers.com/2021/05/animated-graph-gif-with-gganimate-ggplot/ 
#dc37 on Stack Overflow https://stackoverflow.com/questions/11838278/plot-with-conditional-colors-based-on-values-in-r 
Code
animate(graph1.animation, height = 500, width = 800, fps = 30, duration = 30,
        end_pause = 60, res = 100)

Code
anim_save("sexassault_cluster_analysis.gif")

#finnstats on R bloggers https://www.r-bloggers.com/2021/05/animated-graph-gif-with-gganimate-ggplot/ 

Arrest rates colored in purple are above 0.12 for that cluster. Arrest rates below or equal to 0.12 are colored in pink for that cluster. One observation is that as time goes on, more clusters have lower arrest rates (shown in pink). In earlier years, the northern most cluster has the highest arrest rates but over time, all clusters have a similar low arrest rate from 0.1-0.05 in 2022 and 2023.

The decrease in arrests could be due to an increase in crimes without an aswering increase in policing, criminals becoming better at hiding evidence, or changes in police organization and investigative techniques. This tells us that the Chicago Police Department needs to allocate more resources to investigating Criminal Sexual Assaults, to get these criminals off of the street and prevent more criminals of this type from being able to more sexual assaults.

Geospatial Cluster Analysis - Public Officer Interference Pattern Discernment

We also want to investigate Public Officer Interference because this can show the respect and effectiveness of police as seen by the citizens and especially because rates of Public Officer Interference increase over time in each cluster.

Code
interferencewPO_subsetcrimes <- crimes %>%
  filter(`Primary Type` %in% c("INTERFERENCE WITH PUBLIC OFFICER"))
Code
years <- c(2002:2023)

new_colnames <- append(colnames(interferencewPO_subsetcrimes), "cluster")
new_interferencewPO_w_clusters <- data.frame(matrix(ncol=24, nrow = 0))  # Thomas on stackoverflow https://stackoverflow.com/questions/25051528/error-in-adding-rows-to-an-empty-data-frame-in-r
colnames(new_interferencewPO_w_clusters) <- new_colnames

get_clusters <- function(selected_year, prev_centroids=NULL){
  
  data_by_year <- subset(interferencewPO_subsetcrimes, Year == selected_year)
  
  data_by_year_red <- select(data_by_year, `X Coordinate`, `Y Coordinate`)
  km <- kmeans(data_by_year_red, centers=first_centroids) ##line from chatgpt

  
  data_by_year$cluster <-  km$cluster

  return(list(yr_df_w_centers = data_by_year, centroids=km$centers))

}


get_clusters_helper <- function(yr){

  retval <- get_clusters(yr, prev_centroids)
  
  new_interferencewPO_w_clusters <<- rbind(new_interferencewPO_w_clusters, retval$yr_df_w_centers)
  ## global variables bc of map function - can change to a for loop where i return retval and add in the loop lmk if u think of a better way to do this
  
}

prev_centroids <- NULL
map(years, get_clusters_helper)
Code
## get arrest rates for every year/ cluster combo 

arrest_rates <- data.frame(matrix(ncol=3, nrow = 0)) # Thomas on stackoverflow https://stackoverflow.com/questions/25051528/error-in-adding-rows-to-an-empty-data-frame-in-r
colnames(arrest_rates) <- c("year, cluster, arrest_rate")


years <- c(2002:2023)
clusters <- c(1:5)

  
get_arrest_rate <- function(combo){
  sel_year <- combo[1]
  sel_cluster <- combo[2]
  
  data_by_year <- subset(new_interferencewPO_w_clusters, Year == sel_year)
  data_by_year_and_cluster <- subset(data_by_year, cluster == sel_cluster)
  
  num_arrests <- sum(data_by_year_and_cluster$Arrest == "TRUE")
  total_rows <- nrow(data_by_year_and_cluster)
  
  arrest_rate <- num_arrests/total_rows
  
  arrest_rates <<- rbind(arrest_rates,  list(Year = sel_year, cluster=sel_cluster, arrest_rate=arrest_rate))

  
}

combos <- expand.grid(years, clusters) # Onyejiaku Theophilus Chidalu from educative https://www.educative.io/answers/what-is-the-expandgrid-function-in-r

apply(combos, 1,get_arrest_rate)
Code
## add column for mean x/ y coordinate as the coordinate to display the arrest rate 

merged_df <- merge(new_interferencewPO_w_clusters, arrest_rates, by = c("Year", "cluster"))


merged_df <- merged_df %>%
  group_by(Year, cluster) %>%
  mutate(mean_x_coordinate = mean(`X Coordinate`))%>%
  mutate(mean_y_coordinate = mean(`Y Coordinate`))%>% 
  ungroup()
Code
percentile_25 <- quantile(arrest_rates$arrest_rate, 0.25)

The 25th percentile of arrest rates for Public Officer Interference in clusters over years is 0.8682178. We colored the arrest rates on the plot based on if they are over or under this rate.

Code
graph1 <- merged_df %>% ggplot()+
  xlim(1110000, 1225000)+
  geom_point(alpha=0.15, aes(x = `X Coordinate`, y = `Y Coordinate`, group = interaction(Year, `X Coordinate`), color=as.factor(cluster)))+
  scale_color_manual(name = "Cluster", values = c("lightblue", "orange", "green", "pink","plum")) +
  new_scale_color()+
    geom_text(aes(x=mean_x_coordinate,y=mean_y_coordinate, label = paste("Arrest Rate: ", round(arrest_rate, 2) ), color = ifelse(arrest_rate > 0.86, "Above 0.86", "Below or equal to 0.86"), group = interaction(Year, mean_x_coordinate))) +
    scale_color_manual(name = "Arrest Rate", values = c("purple", "hotpink"))

#Jon Spring on Stack Overflow https://stackoverflow.com/questions/56411604/how-to-make-dots-in-gganimate-appear-and-not-transition
Code
graph1.animation <- graph1 +
  theme_minimal()+
  transition_time(Year) +
  labs(subtitle = "Year: {str_sub(frame_time, 1, 4)}", color ="Arrest Rate", title="Interference with Public Officer in Chicago by Year with KMeans Clusters",  x = "X Coordinate", y = "Y Coordinate")

#finnstats on R bloggers https://www.r-bloggers.com/2021/05/animated-graph-gif-with-gganimate-ggplot/ 
#dc37 on Stack Overflow https://stackoverflow.com/questions/11838278/plot-with-conditional-colors-based-on-values-in-r 
Code
animate(graph1.animation, height = 500, width = 800, fps = 30, duration = 30,
        end_pause = 60, res = 100)

Code
anim_save("sexassault_cluster_analysis.gif")

#finnstats on R bloggers https://www.r-bloggers.com/2021/05/animated-graph-gif-with-gganimate-ggplot/ 

Arrest rates colored in purple are above 0.86 for that cluster. Arrest rates below or equal to 0.86 are colored in pink for that cluster. As time goes on, we can see in our visualization that more clusters have higher arrest rates (shown in purple). It also looks like the northern most cluster has the lowest arrest rates for the most part.

This increase may be due to fewer crimes occurring, criminals becoming more obvious or intense, or a change in police organization and investigative techniques. We believe that the northern most cluster having the lowest arrest rates may have to do with demographics in this area. Police may not arrest certain groups of people for police interference, potentially more affluent groups of people. In future analysis it would be interesting to look at earning and demographic distributions in Chicago to investigate this.

It is also important to compare this cluster analysis to the analyses on Homicides and Criminal Sexual Assault. As discusses previously, the arrest rates for both Homicide and Criminal Sexual Assault are decreasing, while the arrest rates for Interference with a Public Officer are increasing. This suggests that the Chicago police are more concerned with ensuring their own safety and ability to move freely without being worried about their safety than they are with ensuring the safety of people who are or could be the victims of Homicide and Sexual Assault. We would like the police department to reevaulate their allocation of resources to investigating and arresting people for these crimes, as the least harmful of these crimes seems to be treated the most seriously by the police, based on this analysis.

Multi-Factor Cluster Analysis - General Crime Pattern Discernment

We would like to see if we can find clusters that are more pure in terms of type of crime. This means that we would like to find clusters that are able to distinugish between types of crimes very distinctly, such that ideally each type of crime would have its own cluster.

To unpack and analyze the variations in crimes frequencies when it comes to factors beyond coordinates, a different clustering algorithm other than K-means must be used to accommodate for categorical (data that has categories incteas of being numeric) data. The clustering algorithm we have decided upon is K-prototypes Clustering, a cluster analysis method that combines the methods of both K-means for numerical data and K-modes for categorical data. The utilization of this type of analytical approach will be used to determine if certain crimes can be be grouped into specific groups, drawing inferences on the general timing and setting of specific crimes.

The issue with our current subset of data is that the frequencies of the crimes can hinder the inferential power of the algorithm. Since K-prototypes clusters observations around centroids, if a certain crime appears significantly more often with relatively moderate variability, a lot of the clusters can have a high influx of observations of that crime. To combat this, we will be using a sample of that subset in which each crime is represented equally within the data.

Crime frequencies:

Code
table(subsetcrimes$`Primary Type`) %>%
  kable %>% 
  kable_styling("striped")
Var1 Freq
ARSON 12160
BURGLARY 394447
CONCEALED CARRY LICENSE VIOLATION 1183
CRIMINAL SEXUAL ASSAULT 58093
CRIMINAL TRESPASS 198558
GAMBLING 13417
HOMICIDE 25108
INTERFERENCE WITH PUBLIC OFFICER 18131
KIDNAPPING 6103
LIQUOR LAW VIOLATION 12883
NARCOTICS 671943
Code
proto_data <- subsetcrimes %>%
  group_by(`Primary Type`) %>%
  sample_n(1000) %>% 
  ungroup() %>% 
  select(`Primary Type`, `X Coordinate`, `Y Coordinate`, `Location Description`, Beat, Ward, Month, Hour, Weekday)
Code
proto_analysis <- kproto(select(proto_data, -`Primary Type`), 9)
Code
proto_data$clusters <- proto_analysis$cluster

proto_data %>%
  count(clusters, `Primary Type`) %>%
  spread(key = `Primary Type`, value = n) %>%
  mutate(across(-1, ~./sum(.))) %>%
  rowwise() %>%
  mutate(HighestColumn = names(.)[-1][which.max(c_across(-1))]) %>%
  cbind(proto_analysis$centers) %>%
  kable %>% 
  kable_styling("striped")
clusters ARSON BURGLARY CONCEALED CARRY LICENSE VIOLATION CRIMINAL SEXUAL ASSAULT CRIMINAL TRESPASS GAMBLING HOMICIDE INTERFERENCE WITH PUBLIC OFFICER KIDNAPPING LIQUOR LAW VIOLATION NARCOTICS HighestColumn X Coordinate Y Coordinate Location Description Beat Ward Month Hour Weekday
1 0.183 0.115 0.073 0.130 0.107 0.245 0.145 0.182 0.145 0.138 0.227 GAMBLING 1144999 1907134 SIDEWALK 1522 28 8 14 Friday
2 0.100 0.097 0.079 0.084 0.063 0.079 0.138 0.104 0.136 0.047 0.093 HOMICIDE 1178286 1833753 SIDEWALK 2211 34 5 19 Sunday
3 0.120 0.140 0.089 0.149 0.225 0.086 0.094 0.118 0.080 0.223 0.127 CRIMINAL TRESPASS 1161649 1907130 APARTMENT 1122 27 10 0 Monday
4 0.085 0.154 0.041 0.147 0.127 0.056 0.049 0.073 0.107 0.196 0.081 LIQUOR LAW VIOLATION 1158042 1931712 VEHICLE NON-COMMERCIAL 1434 49 11 21 Monday
5 0.013 0.010 0.209 0.011 0.021 0.001 0.007 0.007 0.023 0.009 0.005 CONCEALED CARRY LICENSE VIOLATION 1108230 1934071 RESIDENCE 1623 41 8 8 Wednesday
6 0.147 0.168 0.240 0.142 0.108 0.167 0.169 0.164 0.163 0.089 0.117 CONCEALED CARRY LICENSE VIOLATION 1163690 1859644 RESIDENCE 0611 17 7 11 Friday
7 0.126 0.082 0.071 0.114 0.134 0.113 0.103 0.104 0.121 0.091 0.098 CRIMINAL TRESPASS 1171212 1875393 RESIDENCE 0913 3 5 19 Thursday
8 0.119 0.083 0.106 0.096 0.072 0.109 0.139 0.110 0.079 0.135 0.136 HOMICIDE 1154875 1889370 APARTMENT 1022 24 8 23 Wednesday
9 0.107 0.151 0.092 0.127 0.143 0.144 0.156 0.138 0.146 0.072 0.116 HOMICIDE 1184955 1855021 VEHICLE NON-COMMERCIAL 0333 8 9 19 Monday

We would like to see if we can find clusters that are more pure in terms of type of crime. Our current run shows a big plurality within each cluster. The above table shows the centroids of these clusters and the frequencies of crime in each cluster. Thus, we were unable to find a clear separation between types of crime based on variables about location and time.

Tasks 2 and 3: Predictive Modeling for Crime Type and Arrests

The following tasks will use gradient boosted (XGBoosted) trees. These trees use multiple weak learners, usually simple decision trees, and combine their predictions to make more robust and accurate models. This occurs with sequential decision tree creation, where each decision tree learns from the mistakes of the previous and tries to optimize a metric or function to determine whether or not the trees are improving the model. Using this method we can improve the accuracy of our predictions from just one decision tree by learning from its mistakes and trying new trees. Gradient boosted trees tend to be more accurate than single decision trees and random forests, which is when many decision trees are created and a majority vote of these trees determines the overall classifications.

Task 2: Predicting Crime Types Using XGBoosted Trees

We are only using a training data of 100,000 observations, randomly sampled from our data for simplicity’s sake and runtime.

Code
subsetted_crimes <- crimes %>%
  sample_n(100000) 

subsetted_crimes <- na.omit(subsetted_crimes)

subsetted_crimes$`Primary Type` <- as.factor(subsetted_crimes$`Primary Type`)

splits <- subsetted_crimes %>% 
  initial_split(0.9, strata = `Primary Type`)

training_data <- splits %>% training()
testing_data <- splits %>% testing()
Code
training_data$`Primary Type` <- as.factor(training_data$`Primary Type`)

crimes_rec <- recipe(`Primary Type` ~ `Location Description` + Beat + 
                     Ward + `X Coordinate` + `Y Coordinate` + Year + Month + 
                     Hour + Weekday + Arrest
                     , data = training_data) %>%
  step_dummy(all_nominal_predictors()) 


xtrees <- boost_tree() %>%
  set_mode("classification") %>%
  set_engine("xgboost") 

gbdt <- workflow() %>%
  add_model(xtrees) %>%
  add_recipe(crimes_rec) %>%
  fit(training_data)
Code
testing_data <- testing_data %>%
  mutate(pred = predict(gbdt, new_data = testing_data)$.pred_class)
Code
training_data$`Primary Type` <- as.factor(training_data$`Primary Type`)

prec <- testing_data %>% precision(truth = `Primary Type`, estimate = pred) 
sens <- testing_data %>% sensitivity(truth = `Primary Type`, estimate = pred)
spec <- testing_data %>% specificity(truth = `Primary Type`, estimate = pred)
acry <- testing_data %>% accuracy(truth = `Primary Type`, estimate = pred) 

rf_class_metrics <- bind_rows(prec, sens, spec, acry)
rf_class_metrics %>%
  kable %>%
  kable_styling("striped")
.metric .estimator .estimate
precision macro 0.4046626
sensitivity macro 0.1310069
specificity macro 0.9742527
accuracy multiclass 0.3796620

As shown in this gradient boosted model, we have fairly low accuracy in predicting the type of crime that occurred based on locaion and time. This tells us that more information would be needed for police officers to accurately predict which crime is occurring in a certain location, such as information provided in a 911 call. This also tells us that the types of crimes occurring in Chicago are not highly separated across neighborhoods, and thus citizens and tourists may need to look at overall crime rates and crime type distributions for each neighborhood to determine where they would feel the most comfortable.

Task 3: Predicting Arrest Rates Using XGBoosted Trees

We are only using a training data of 100,000 observations, randomly sampled from our data for simplicity’s sake and runtime.

Code
subsetted_crimes <- crimes %>%
  sample_n(100000) 

subsetted_crimes <- na.omit(subsetted_crimes)

subsetted_crimes$`Primary Type` <- as.factor(subsetted_crimes$`Primary Type`)

splits <- subsetted_crimes %>% 
  initial_split(0.9, strata = `Primary Type`)

training_data <- splits %>% training()
testing_data <- splits %>% testing()
Code
crimes_rec <- recipe(Arrest ~ `Primary Type` + `Location Description` + Beat + 
                     Ward + `X Coordinate` + `Y Coordinate` + Year + Month + 
                     Hour + Weekday, data = training_data) %>%
  step_dummy(all_nominal_predictors())

xtrees <- boost_tree() %>%
  set_mode("classification") %>%
  set_engine("xgboost", objective = "reg:squarederror") 

gbdt <- workflow() %>%
  add_model(xtrees) %>%
  add_recipe(crimes_rec) %>%
  fit(training_data)
Code
testing_data <- testing_data %>%
  mutate(pred = predict(gbdt, new_data = testing_data)$.pred_class)
Code
prec <- testing_data %>% precision(truth = Arrest, estimate = pred) 
sens <- testing_data %>% sensitivity(truth = Arrest, estimate = pred)
spec <- testing_data %>% specificity(truth = Arrest, estimate = pred)
acry <- testing_data %>% accuracy(truth = Arrest, estimate = pred) 

rf_class_metrics <- bind_rows(prec, sens, spec, acry)
rf_class_metrics %>%
  kable %>%
  kable_styling("striped")
.metric .estimator .estimate
precision binary 0.8744563
sensitivity binary 0.9775767
specificity binary 0.6000770
accuracy binary 0.8795120

As shown in our gradient boosted model, we have fairly high accuracy in predicting whether or not an arrest will be made in these Chicago crimes based on only a few predictors. This analysis can benefit police departments by helping inform them of which crimes may need extra resources or processes devoted to finding the offenders. Additionally, this analysis can keep the citizens of Chicago informed about which crimes tend to go unresolved, indicating they may need to be wary of repeat offenders of those crimes.

Discussion

Cluster Analysis

Through our cluster analysis we found that different areas have different frequencies of crime types. From our initial cluster analysis over many crime types, we can view our table of frequencies of crimes in each of our clusters over time and see that:

  • The proportion of Interference with a Public Officer crimes increases over time in all clusters

  • The proportion of Narcotics crimes decreases over time in all clusters

  • The proportion of Criminal Sexual Assault crimes increases over time in all clusters

  • The proportion of Homicide crimes increases over time in all clusters but increases the least in Cluster 2

  • Cluster 4 has a higher frequency of Interference with a Public Officer crimes than other clusters in recent years than other clusters

  • In most years, Cluster 1 looks to have a higher frequency of Burglary crimes than other clusters

  • The frequency of Gambling crimes has decreased over time overall but used to have the highest frequencies in Clusters 2 and 3.

We also find that arrest rates differ between different crimes and different areas. Arrest rates for homicides decrease over time for all clusters. With the northern most cluster having the highest arrest rate especially in recent years.

Criminal Sexual Assaults also have decreasing arrest rates over time for all clusters. In earlier years, the northern most cluster has the highest arrest rates but over time, all clusters have a similar low arrest rate from 0.1-0.05 in 2022 and 2023.

Interference with a Public Officer has an opposite trend with arrest rates increasing over time. It also looks like the northern most cluster has the lowest arrest rates for the most part. Our hypothesis is that this may have to do with demographics in this area. It is possible that police may not arrest certain groups of people for police interference, such as more affluent groups of people.

In future analysis, it would be interesting to perform ANOVA testing between clusters for frequencies of crimes and arrest rates to find the statistical significance of these findings. This could show us whether there is a significant difference between these clusters and help us determine what factors are influencing this difference.

Predicting Crime Type

Our boosted model is able to correctly predict 0.3792947 of crime types. We do not do a very good job of predicting if a certain type of crime was committed (0.1501741 sensitivity). Our model does do a good of a job predicting if the crime committed was not a specific type (0.9741298 specificity). Out of all observations predicted to be a certain crime type, 36.33% of them were actually were that crime type (0.3632954 precision). This is a low number, indicating our model is not doing well at correctly predicting crime types based on only a few predictors.

Overall, this suggests that this model may not be very helpful to the public, as it does not perform well in predicting the type of crime being committed based on location and time. Instead, we would suggest that citizens use overall crime rates and distribution of crime types for each specific area in order to determine where they feel the most comfortable.

Predicting Arrests

Our boosted model is able to correctly predict 0.8743000 of arrests. We do a good job of predicting if someone was arrested correctly (0.9769044 sensitivity). Our model does not do as good of a job predicting if someone was not arrested (0.5816641 specificity). Out of all arrest predictions, 0.8694555 of them were actually arrests (0.8694555 precision).

The police department may use this model to look at specific scenarios involving crime type, location, and time to see how they can better improve their arrest rates. Citizens can use this model to examine the arrest rates for specific locations and crimes to see how well their local police departments are doing and to inform them of potential unresolved crimes in their area.

Limitations and Ethics Considerations

As mentioned in the introduction, it is crucial to assess and acknowledge the limitations associated with the dataset and address the ethical implications that arise from analyses of crimes. This section aims to provide a comprehensive understanding of the constraints and ethical considerations in our analysis.

  1. Inherent Bias and Racial Profiling: This dataset, comprising all reported crimes published by the Chicago Police Department, is susceptible to inherent bias. A notable concern is the potential impact of racial profiling and racism in law enforcement practices, which may result in an overestimation of reported crimes for people of color and in neighborhoods with a majority of people of color. This bias can skew the interpretation of crime rates and contribute to an inaccurate portrayal of crime distribution among different racial groups and neighborhoods.

  2. Socioeconomic and Geographic Biases: People residing in lower income or gang related areas may face a higher likelihood of arrest, introducing socioeconomic and geographic biases into the dataset. This bias could potentially lead to an overrepresentation of reported crimes in specific neighborhoods, influencing the overall crime statistics. This can also exacerbate the overestimation for people of color, as the long-lasting effects of red-lining, gentrification, and generational wealth mean that people of color may be overrepresented in lower income areas, compounding the effects of both racism and socioeconomic biases.

  3. Reporting Discrepancies: It is also incredibly important to recognize that the dataset represents reported crimes, not the true number of crimes. Crimes that are less frequently reported, such as domestic assault and sexual assault, may be inadequately represented in this dataset. This reporting discrepancy poses a challenge in accurately assessing the prevalence of certain types of crimes and limits the generalizability of our results to only those crimes that are represented in our data.

  4. Generalizability and Limitations of Conclusions: Given the biases and reporting discrepancies, we must be cautious in generalizing our findings. Limitations stemming from biased reporting and potential underrepresentation of specific crime types means there needs to be careful consideration in generalizing conclusions beyond the dataset. The validity of our conclusions is contingent upon the awareness of the limitations stated above.

  5. Ethical Sensitivity: The data analyzed contains sensitive information, and we recognize the ethical responsibility associated with handling such data. Special care has been taken to ensure that our analysis is conducted with the utmost integrity and respect for privacy. This includes a commitment to preventing any misuse of our results that could adversely impact certain demographic groups.

  6. Implications for Policy and Practice: Acknowledging the limitations and biases in the dataset, it is imperative to approach any policy or practice recommendations with caution. Recommendations should be guided by an understanding of the potential biases and limitations, ensuring that they do not inadvertently harm specific groups or perpetuate existing disparities.

In conclusion, our exploration of this data requires a nuanced awareness of its limitations and ethical considerations. By addressing these issues, we aim to contribute responsibly to the broader discourse on crime while recognizing the complexity of the social and ethical landscapes in which our analysis is situated.

Bibliography

  1. Ashby, Matthew P. J. 2020. “Initial Evidence on the Relationship between the Coronavirus Pandemic and Crime in the United States.” Crime Science 9 (1). https://doi.org/10.1186/s40163-020-00117-6.

  2. Ba, Bocar A., Dean Knox, Jonathan Mummolo, and Roman Rivera. 2021. “The Role of Officer Race and Gender in Police-Civilian Interactions in Chicago.” Science 371 (6530): 696–702. https://doi.org/10.1126/science.abd8694.

  3. Block, Carolyn R., and Richard Block. 1993. Street Gang Crime in Chicago. Google Books. U.S. Department of Justice, Office of Justice Programs, National Institute of Justice. https://books.google.com/books?hl=en&lr=&id=cozaAAAAMAAJ&oi=fnd&pg=PA6&dq=chicago+crime&ots=qNTfmVtjVa&sig=3KBxL5jazUY-ZBubpi1DeqlKq20#v=onepage&q=chicago%20crime&f=false.

  4. Campedelli, Gian Maria, Serena Favarin, Alberto Aziani, and Alex R. Piquero. 2020. “Disentangling Community-Level Changes in Crime Trends during the COVID-19 Pandemic in Chicago.” Crime Science 9 (1). https://doi.org/10.1186/s40163-020-00131-8.

  5. Chicago Police Department. 2011. “Crimes - 2001 to Present.” Cityofchicago.org. September 30, 2011. https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2.

  6. dc37. n.d. “Plot with Conditional Colors Based on Values in R.” Stack Overflow. https://stackoverflow.com/questions/11838278/plot-with-conditional-colors-based-on-values-in-r.

  7. finnstats. 2021. “Animated Graph GIF with Gganimate & Ggplot | R-Bloggers.” R Bloggers. May 15, 2021. https://www.r-bloggers.com/2021/05/animated-graph-gif-with-gganimate-ggplot/.

  8. “Homicide Map | City of Chicago | Data Portal.” n.d. Chicago. https://data.cityofchicago.org/Public-Safety/Homicide-Map/53tx-phyr.

  9. Huynh, Y. Wendy. n.d. 6.3 Group_by() and Ungroup() | R for Graduate Students. Bookdown.org. https://bookdown.org/yih_huynh/Guide-to-R-Book/groupby.html.

  10. Jenness, Valerie, and Ryken Grattet. 1996. “The Criminalization of Hate: A Comparison of Structural and Polity Influences on the Passage of ‘Bias-Crime’ Legislation in the United States.” Sociological Perspectives 39 (1): 129–54. https://doi.org/10.2307/1389346.

  11. Lemmer, Thomas J., Gad J. Bensinger, and Arthur J. Lurigio. 2008. “An Analysis of Police Responses to Gangs in Chicago.” Police Practice and Research 9 (5): 417–30. https://doi.org/10.1080/15614260801980836.

  12. “Return the Index of the First Maximum Value of a Numeric Vector in R Programming - Which.max() Function.” 2020. GeeksforGeeks. June 6, 2020. https://www.geeksforgeeks.org/return-the-index-of-the-first-maximum-value-of-a-numeric-vector-in-r-programming-which-max-function/.

  13. RHertel. n.d. “R - Extract Year from Date.” Stack Overflow. https://stackoverflow.com/questions/36568070/extract-year-from-date.

  14. Schreck, Christopher J., Jean Marie McGloin, and David S. Kirk. 2009. “On the Origins of the Violent Neighborhood: A Study of the Nature and Predictors of Crime‐Type Differentiation across Chicago Neighborhoods.” Justice Quarterly 26 (4): 771–94. https://doi.org/10.1080/07418820902763079.

  15. Smith, Michael R., and Matthew Petrocelli. 2001. “Racial Profiling? A Multivariate Analysis of Police Traffic Stop Data.” Police Quarterly 4 (1): 4–27. https://doi.org/10.1177/1098611101004001001.

  16. Spring, Jon. n.d. “How to Make Dots in Gganimate Appear and Not Transition.” Stack Overflow. Accessed November 18, 2023. https://stackoverflow.com/questions/56411604/how-to-make-dots-in-gganimate-appear-and-not-transition.

  17. Thomas. n.d. “Error in Adding Rows to an Empty Data Frame in R.” Stack Overflow. Accessed November 18, 2023. https://stackoverflow.com/questions/25051528/error-in-adding-rows-to-an-empty-data-frame-in-r.

  18. “Violent Crime Time of Day(per 1,000 in Age Group).” 2022. Www.ojjdp.gov. 2022. https://www.ojjdp.gov/ojstatbb/offenders/qa03401.asp#:~:text=In%20general%2C%20the%20number%20of.