Following more than twenty years of decreasing rates of violent crime, the FBI reports that, violent crime, including murder, has risen for the second year in a row. In particular, the FBI 2016 report estimates that a total of 1,248,185 violent crimes occurred in the US that year, a 4.1 increase from the 2015 estimate. These estimates also report that the total number of murders in the nation totaled 17,250, an 8.6 percent increase from the 2015 estimate. This increase in crime and particularly homicides, spotlights the importance of predicting murder rates to both understand its correlatives and divert resources to areas with higher crime.
Thus, following recent reports revealing increasing murder rates, we build a series of models to predict murders rates from Census data at the MSA level. Our ability to predict murder rates can help us target important features correlated with crime. Furthermore, these models may be used to make projections about murder rates in the future. Therefore, this can serve the purpose of inference and prediction.
One of the major challenges we hope to address is how to deal with murder count outliers in less populated MSAs. Some years may yield an unusual amount of murders for a less populated MSA, making our models subject to high variance. To address this concern, we use a series of different model approaches and select the best one. An additional challenge is that modeling crime rates will not reveal a causal relationship with our features, which means that a model of this kind may not be highly insightful in determining the cause of murder rate.
​
​
Data Processing
​
Our data consists of FBI murder counts by MSA from 2006 to 2016, Census data by MSA from the same years, and an additional dataset that includes GDP per capita from the Bureau of Economic Analysis..
Response Variable:
-
murder rate
​
Features:
-
percentage of MSA that is male and female
-
percentage of MSA population in certain age groups
-
percentage of MSA population over 25 with a high school degree
-
percentage of MSA population over 25 with a bachelor's degree
-
percentage of MSA population in poverty
-
the median income of the MSA population
-
the percentage of MSA population who are a given race
-
GDP/capita
​
​
Consisting exclusively of percentages and counts, all aspects of our data are numerical.
In building our tables -- one murder count and feature table for each year-- we quickly realized that joining the dataframes would require some name cleaning to minimize the need for imputation. Due to varying naming conventions between the FBI and Census Data, we cleaned the names and defined a process to match each FBI MSA name to the Census MSA name for each year. To do this, we defined a scoring function using the fuzzywuzzy package that compared an overall similarity score between the strings with an average of similarity scores, given that the MSAs being compared were in the same state. After matching the names, we noticed that some MSAs appeared in only one of the two datasets (FBI and Census). In these cases, we dropped the unmatchable MSAs (~ 20 in each year). We also matched the feature GDP/capita to each year's dataframe, and finally, we added each combined dataframe into a dictionary with the year as a key.
​
EDA
​
Our EDA revealed three major findings that guided us in our modeling strategy. We find that murder counts is highly correlated with population, leading us to change our predictor to murder count per 100,000 people. We find that the majority of our features do not change over time, suggesting that an averaged dataset to train and test may be feasible. Last, we find that many of our features are correlated, suggesting that we need some kind of feature selection.
While the data from the FBI gives us murder counts, our EDA shows us that population is the largest driver of murder counts. Thus we use population and murder counts to find the number of murders per 100,000 people, and use this column to be our dependent variable when building our models. Below we show you how population is strongly correlated with murder, and we plot the histograms of murder counts and murder count per 100,000 below.
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
In addition, from the plot below, we can see that the percent of white people does not have a clear trend through time. In fact the majority of features do not change dramatically year to year, suggesting that building a model from an average of the time-series data may perform well.
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
From a correlation heat map, we find that many of our features are correlated. We also notice that it may be beneficial to collapse some of our age columns. Thus, we collapse the age features to twenty year increments.
​
​
​
​
​
​
Introduction and Description of Data
Problem Statement and Motivation
In this report, we document and show our process in building and selecting models, given specific features (gender, age, race, income, etc), to predict murder counts per 100,000 people in MSAs across the US. We do this in two ways, first by training on data from 2006-2015 and testing on 2016 data; second, we average over all the years (2006-2016) and train and test on a subset of our averaged data.
There is a wealth of literature in various fields investigating differences in crime trends and in trying to predict crime. We discuss two papers that speak to the relevance of our project.
Edward L. Glaeser's, "Why is there More Crime in Cities" develops a theory on differences in crime rates in big versus small cities. He theorizes that the higher crime in more populated areas may be attributed to three reasons: crimes’ higher returns in urban areas, the lower probability of arrest, and urban area's attraction of crime-prone areas. Using data from the National Crime Victimization Survey, the FBI, and the National Longitudinal Survey of Youth, Glaeser uses individual level regressions to explain the effects of city size on crime. In particular, he finds that higher returns of crime account for about 13 percent of the effect of city size on crime, lower probability of arrests accounts for about one third of the effect of city size on crime, and urban area's attraction of crime-prone areas (individual characteristics) accounts for about 30 percent of the city-crime effect.
Remi Boivin and Maurizio D'Elia's "A Network of Neighborhoods: Predicting Crime Trips in a Large Canadian City" attempts to distinguish between elements in an area that would uniquely detract or attract an individual to commit crime and elements that would detract or attract anyone. The authors use data from police recorded statistics from 2013 for a city in Eastern Canada and analyze violent crimes and property crimes between over 500 census tracts (small statistical areas within a county). Using multilevel negative binomial regressions, the authors find that differences in reward, effort, and risk are significantly related to the number of crime associated trips per pair of tract.
We can see that both the results from Glaeser and Boivin and D'Elia's papers are consistent with each other, as both conclude that reward and risk of committing a crime are significantly related to crime. Glaeser is able to quantify these effects in large cities by attributing the effect of reward as 13 percent and the effect of risk as one third the effect of city size on crime. While controlling the rewards of crime remains unfeasible from a preventative standpoint, increasing the risk of committing crime through predictive police enforcement allocation may be an effective means of reducing crime, since Glaeser shows that increasing the probability of being arrested is correlated with crime. This makes predicting crime rates crucial for crime deterrence.
Literature Review and Related Works
One of our main objectives was to tackle the challenge of working with time-series data. While our initial goal was to simply make murder rate predictions in 2016 using data from previous years, we have also realized the importance of aggregating the data to make accurate predictions of murder rates over all the years. Thus, we use two different approaches to train and test our data: in the first, we train on data from 2006-2015 and test on 2016; in the second, we average over data from all the years and use a subset of this data to train and test. We use three different methods, all of which fit into one of the two train-test splits.
​
The base models we discuss for each method are composed of age, sex, and gender.
When we reference additional features we include marriage status, education, race, and gdp/capita.
Method 1
In this method, we took data from 2006 to 2015 and built a regression model for each year to determine the correlation coefficients. We then measured trends in these coefficients over the years and projected them out to 2016 in order to predict the coefficients for all features in 2016. We then use these predicted coefficients to make murder rate predictions on 2016 feature data. In taking this approach, we were able to capture information about changes in the correlations over the years.
​
​
​
​
Because this method explicitly models the coefficients to make a prediction on the 2016 coefficients, we are limited to regression type models. From the EDA, we find that many of our features are correlated, leading us to use the lassocv and ridgecv model. In our cross validation, we use three fold splits.
The figures above illustrate the general trends in the coefficient of the feature ‘percent male’ over time from both a lassocv and ridgecv regression model. One thing to note is that the observations in each year vary since MSA’s change over time, so this may be one contributing factor to the significant visible variation in the value of the coefficient; we discuss this more in results. Using this trend, we were able to project the coefficients to 2016 and use that to make predictions of murder counts per 100,000 people. In this framework, we build a base model using lassocv and ridgecv, and we also build a model with the additional features.
​
​
Method 2
In this method, we simply aggregated the time series data by taking the mean for each feature across all the years for a given MSA, and using the aggregated data as our features, rather than the raw time-series data. Then, we simply selected a subset of our MSA’s to be our test set and used lasso, ridge, random forest and PCA to model this on our base model, base + additional features and on our polynomial features of order three. The unique aspect to doing this is that it minimizes outliers through averaging while maintaining information about the time series nature of the problem.
Method 3
In this method we simply use 2015 data as our training set and 2016 as our test set. By ignoring the previous years we assumed that time series data was not relevant to the problem at hand. While the approach is simple, we thought it would be a great way to benchmark the time series models.
Modeling Approach and Project Trajectory
Since we developed three different methods, we evaluate our results according to each of them.
​
Method 1: Results
​
​
​
​
​
Method 2: Results
Method 3: Results
​
​
​
​
​
Feature Importance
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
One of the main things we notice in all our tables is a marked improvement in test R^2 through the addition of more features. One way to visualize the difference in performance is through the figures presented above. The green dotted line represents perfect prediction and we can see that with our base model we have very poor performance on both the training and test sets. However, with the addition of features such as race, income and gdp there is an improvement in performance and further alignment with the dotted green line. The natural question that arises through the addition of these features is: which specific features are driving up performance?
​
One way to check this is to run a random forest regression model that is able to determine significant predictors. In the figure above, we plot the top features and their relevant importance to the model. We quickly see that the important features are race, marriage status, and the age group 20 to 40. Within race we find that black/african american percent is highly correlated to murder rates. The fact that race is a dominant feature does not imply a causal relationship. In fact, it may just be measuring an individual's level of income, as we cannot capture that through MSA level data.
​
​
Strengths and Shortcomings of Each Method
​
Method 1: Predictions the Betas:
​
One major assumption we made was that the coefficients would obey a linear trend. Our conclusion was that this does not seem unrealistic given that strong predictors of murder rates would likely remain strong predictors and that coefficients would not drastically change. This is a limitation of this model since it is possible for the coefficients to follow higher polynomial relationships. Furthermore, an additional shortcoming of this model is that it depends on having data from multiple years. This model becomes increasingly more variable as the number of years that data exists decreases-- in fact, this is the only model that requires at least two years. Because MSA’s are subject to having years of unordinarily high or low murder counts and because each coefficient is calculated from a year, every instance of modeling the coefficients is extremely sensitive to outliers. Last, this model is limited to making projections a few years out. This model’s performance would decrease if we projected 2050’s murder rates.
One strength of this model is that it is the only model that explicitly accounts for time series. While method 2, averaging across the years, accounts for time more implicitly, we model the coefficients as a function of time. By modeling the coefficients, we quantify how our different features change with time.
We find that our RidgeCV model performs the best for this method with a .44 test r-squared. This definitely beats our base model which had performance around .05 test r-squared. This demonstrates that the additional features have more predictive power than our base features. This approach ultimately does the worst compared to the three methods. This is probably because using data from 2006-2015 may not be enough to accurately model the betas.
Because this model is dependent on having data from various years, if we had more time we would use data from more years, perhaps from 1995, and model some coefficients with non-linear models.
​
Method 2: Averaging
By averaging over the years and training and testing on a subset of this averaged data, our model falls short because additional prediction using our model would require averaged data from the same years. Furthermore, the relevance of our model comes into question, as we do not know how our model performs when projecting murder rates in the next year (2017). Because we train and test on an averaged data frame, we only know how our model performs on data that has been averaged for those same years.
One major strength of this method is its model’s overall performance. We obtain the highest test R-squared of .64 from the RidgeCV model. This superior model performance can likely be attributed to a major strength of this model: its ability to neutralize outliers. By averaging over all the years, MSAs with abnormal murder rates for a given year will be averaged out to more “usual” murder rates, as we average over more and more years. This makes this model the most stable and minimizes variance.
Just like method 1, if given more time, we would use more data to build our model. Because this approach’s superior performance is dependent on its minimization of variance, after having a certain number of datasets from various years, we will not be able to perform the performance of our model. In contrast, since method 1’s performance is determined by how well we model each feature's coefficient, having more data from more years is important as the number of coefficients can use to model has a one to one relationship with the number of years we have data for. In addition, to determine the performance of our model in forecasting murder counts for a given year, we may average over years 2006-2015 and use our 2016 data as our test set.
​
Method 3: Using 2015
​
This method is by far the most naive approach to predicting murder rates. One assumption of this model is that only data from the previous year matters in predicting murder for a given year, discounting any time series approach. This model is highly variable to each year, since every year has MSA’s with an unusual amount of murder counts.
The best model from this method, the random forest, gives a test r-squared of .573, which is much higher than the highest r-squared in method 1 (.44) and not too much worse than the r-squared in method 2 (.64). Due to its simplicity and computational efficiency (relative to the other methods), this model is a good choice in projecting murder counts. The comparable performance of this model to method 2’s best model may suggest that data from previous years does not matter.
​
Given more time, we would like to explore the importance of using time series versus this simple approach. We would obtain more data and for various subsets of years build models with the three approaches we have defined and compare performance. If we cannot obtain more data, for method 1 we would use data from 2006-2014, predicting on 2015. In method 2, we would average over the years 2006-2015, and perform a train test split. In method 3, we would only train with 2014 data to predict 2015.
​
Last Summary
​
In the tree below we outline our entire approach and highlight the best models in each method within each train-test split. These methods have helped us gain better intuition for murder rate correlatives as well as how time series data can be used to influence model building.
​
​
​
​
​
​
​
​