Predicting Crime in Detroit, Michigan

Predictive Analytics and utilizing gun data in Detroit, MI from 1961-1973.

The Detroit dataset is from a book called “Subset selection in Regression”. The original data was collected by J.C. Fisher and used in his paper “Homicide in Detroit: The Role of Firearms”, Criminology, vol.14, 387-400 (1976). The data collected represents homicide data in Detroit from 1961-1973. The data was obtained from Detroit Dataset The following table lists each variable and their descriptions.

Detroit Dataset Description

Type in the variable search term below:

Variable Scale Predictor/Output Measurement
FTP(Full-Time Police) Predict future homicides Predictor Quantitative
UEMP(Unemployment) 3.2%-11% Predictor/Output Quantitative
LIC(Hangun Licences) 156.4-1134.2 Predictor Quantitative
GR(Handgun Registrations) 180.5-1029.8 Predictor/Output Quantitative
CLEAR(Homicides cleared) 58.90-94.40 Predictor/Output Quantitative
WM(White Males) 359647-558724 Predictor/Output Quantitative
NMAN(Non-manufacturing workers) 539.1-819.8 Predictor/Output Quantitative
GOV(# of government workers) 133.9-230.9 Predictor/Output Quantitative
HE(Average Hourly Earnings) 2.91-5.76 Predictor/Output Quantitative
WE(Average Weekly Earnings) 117.2-258.1 Predictor/Output Quantitative
HOM(# of homicides) 8.52-52.33 Predictor/Output Quantitative
ACC(Death rate in accidents) 39.17 Predictor/Output Quantitative
ASR(# of assaults) 218-473 Predictor/Output Quantitative

Methods of Analysis


Overview: For the Detroit dataset I have used several modeling techniques (Correlation, Simple Linear Regression, Multiple Linear Regression, Subset Selection, and Generalized Additive Models). The utilization of the plot function helped to determine what variables in the dataset would be significant and those which would not be significant. The correlation function was also used to determine which variables had a high correlation with the outcome variable of homicide. The models were chosen due to the continuous quantitative features of the Detroit dataset. Simple linear regression was utilized and helped to determine the number of predictors in the dataset that would be needed to predict homicide(s) in Detroit . GLM, LDA, QDA, and K-Nearest-Neighbors(KNN) were ignored due to the dataset having continuous variables which would not model well as low or high classification. Lastly the outcome variable is continuous and the above methods are utilized on qualitative outcome variables.

Correlation

Type in the variable search term below:

Model Learning Technique Variables Used Correlation: Scale = -1 to 1
Model 1 Correlation HOM - FTP 0.964
Model 2 Correlation HOM - UEMP 0.210
Model 3 Correlation HOM - MAN 0.546
Model 4 Correlation HOM - LIC 0.726
Model 5 Correlation HOM - GR 0.816
Model 6 Correlation HOM - CLEAR -0.968
Model 7 Correlation HOM - WM -0.952
Model 8 Correlation HOM - NMAN 0.955
Model 9 Correlation HOM - GOV 0.958
Model 10 Correlation HOM - HE 0.913
Model 11 Correlation HOM - WE 0.888
Model 12 Correlation HOM - ACC -0.204
Model 13 Correlation HOM - ASR 0.824

Best Model Correlation = Model 1- HOM~FTP. The correlation of 0.964 is the best predictor of correlation between homicide and all of the other predictors.



Simple Linear Regression

Simple Linear Regression Table - Detroit Dataset



Model Learning Technique Variables Used Multiple R-Squared Adjusted R-Squared
Model 14 Simple Linear Regression HOM - FTP 0.93 0.923
Model 15 Simple Linear Regression HOM - UEMP 0.04 -0.04
Model 16 Simple Linear Regression HOM - MAN 0.30 0.23
Model 17 Simple Linear Regression HOM - LIC 0.53 0.48
Model 18 Simple Linear Regression HOM - GR 0.67 0.636
Model 19 Simple Linear Regression HOM - CLEAR 0.94 0.93
Model 20 Simple Linear Regression HOM - WM 0.91 0.89
Model 21 Simple Linear Regression HOM - NMAN 0.91 0.91
Model 22 Simple Linear Regression HOM - GOV 0.92 0.91
Model 23 Simple Linear Regression HOM - HE 0.83 0.81
Model 24 Simple Linear Regression HOM - WE 0.79 0.76
Model 25 Simple Linear Regression HOM - ACC 0.04 -0.04
Model 26 Simple Linear Regression HOM - ASR 0.68 0.65

Best Model Simple Linear Regression: The best model is Model 19 and has an adjusted r-squared of 0.93.

summary(lm.fitdetroit6)

Call:
lm(formula = HOM ~ CLEAR, data = Detroit_Data_Ben_Gonzalez)

Residuals:
   Min     1Q Median     3Q    Max 
-7.385 -1.636 -1.059  2.804  8.737 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 127.22163    8.00767   15.89 6.21e-09 ***
CLEAR        -1.25352    0.09724  -12.89 5.55e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.264 on 11 degrees of freedom
Multiple R-squared:  0.9379,	Adjusted R-squared:  0.9323 
F-statistic: 166.2 on 1 and 11 DF,  p-value: 5.553e-08

Multiple Linear Regression Table - Detroit Dataset

Type in the variable search term below:

Model Learning Technique Variables Used Multiple R-Squared Adjusted R-Squared p-value
Model 27 Multiple Linear Regression Outcome=HOM Predictors=FTP, MAN, LIC, GR,WM,NMAN,GOV,HE,WE,ASR 0.998 0.9986 0.001188
Model 28 Multiple Linear Regression Outcome=HOM Predictors=NMAN, GOV, HE, WE, ACC, ASR 0.9912 0.9823 6.829E-06
Model 29 Multiple Linear Regression Outcome=HOM Predictors=LIC,FTP, GR,HE,WE,NMAN, ASR 0.9936 0.9847 3.512E-05

Best Model Multiple Linear Regression: The best model for simple linear regression is Model 27 and has the highest adjusted r-squared 0.9986. The model also has a low p-value of 0.001188.

Call:
lm(formula = HOM ~ FTP + MAN + LIC + GR + WM + NMAN + GOV + HE + 
    WE + ASR, data = Detroit_Data_Ben_Gonzalez)

Residuals:
        1         2         3         4         5         6         7         8         9        10        11 
 0.134676 -0.119242  0.035018  0.156071 -0.527640  0.528712 -0.262829  0.140073 -0.115613 -0.167151  0.101167 
       12        13 
-0.002143  0.098901 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -2.021e+02  4.986e+01  -4.054   0.0558 .
FTP          2.933e-02  2.345e-02   1.251   0.3375  
MAN         -1.150e-01  1.832e-02  -6.279   0.0244 *
LIC          2.704e-02  4.221e-03   6.406   0.0235 *
GR          -4.355e-03  2.362e-03  -1.844   0.2066  
WM           2.385e-04  6.509e-05   3.664   0.0671 .
NMAN         1.147e-01  3.824e-02   3.000   0.0954 .
GOV          2.740e-01  9.392e-02   2.917   0.1001  
HE          -9.155e+00  3.038e+00  -3.013   0.0947 .
WE           4.393e-01  7.687e-02   5.714   0.0293 *
ASR         -1.437e-02  1.602e-02  -0.897   0.4644  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6188 on 2 degrees of freedom
Multiple R-squared:  0.9998,	Adjusted R-squared:  0.9986 
F-statistic: 841.1 on 10 and 2 DF,  p-value: 0.001188

Subset Selection Models

Subset Selection Models Table - Detroit Dataset

Type in the variable search term below:

Model Learning Technique Variables Used R.S.Q Adjusted R2 Min B.I.C.
Model 30 Subset Model Selection Method=Forward Outcome=HOM Predictors=All Other Variables 0.999 0.9989 8
Model 31 Subset Model Selection NVMAX=12 Outcome=HOM Predictors=All Other Variables 1.000 0.9999 12
Model 32 Subset Model Selection Method=Backward Outcome=HOM Predictors=All Other Variables 0.999 0.999 8

Best Model Subset Selection: The best subset model selection is Model 32 utilizing the backward method. The model utilizes the lowest BIC and has the highest adjusted r-squared 0.999 utilizing the least number of variables.


Generalized Additive Models

Generalized Additive Models (GAMs) Approach - Detroit Dataset

Type in the variable search term below:

Model Learning Technique Variables Used Total Predictors Multiple R2 Adjusted R2
Model 33 Generalized Additive Model Splines Regression Outcome=HOM Predictors=UEMP,LIC,WE,HE,ASR,NMAN,MAN,ASR 8 0.998 0.996 6.714e-07
Model 34 Generalized Additive Model Splines Regression Outcome=HOM Predictors=MAN,LIC,WM,NMAN,HE,WE 6 0.997 0.994 2.056e-07
Model 35 Generalized Additive Model Splines Regression Outcome=HOM Predictors=FTP,LIC,GR,NMAN,GOV,HE,WE,ASR 8 0.994 0.984 0.002583
Model 36 Generalized Additive Model Polynomial Regression Outcome=HOM Predictors=FTP,LIC,NMAN,GOV^2,HE^3,WE,ASR 7 0.999 0.9944 0.004688
Model 37 Generalized Additive Model Polynomial Regression Outcome=HOM Predictors=FTP^2,GR,LIC^2,NMAN,GOV,HE,WE,ASR 8 0.998 0.993 0.005753

Best Model Generalized Additive Model: The best model for the generalized additive model is Model 36 that utilizes polynomial regression. The model utilizes the least amount of predictors of all the generalized additive models. The model has the highest adjusted r-squared of 0.9944 that utilizes the least number of predictors. The p-values is also low at 0.004688.

Analysis of Variance Table

Response: HOM
             Df  Sum Sq Mean Sq   F value    Pr(>F)    
FTP           1 2994.37 2994.37 1978.9471 0.0005049 ***
LIC           1  151.19  151.19   99.9199 0.0098602 ** 
NMAN          1   30.18   30.18   19.9431 0.0466611 *  
poly(GOV, 2)  2   18.27    9.13    6.0363 0.1421203    
poly(HE, 3)   3   23.30    7.77    5.1325 0.1673830    
WE            1    0.65    0.65    0.4326 0.5782914    
ASR           1    0.80    0.80    0.5300 0.5423085    
Residuals     2    3.03    1.51                        
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(gam.detroitmodel44)

Call:
lm(formula = HOM ~ FTP + LIC + NMAN + poly(GOV, 2) + poly(HE, 
    3) + WE + ASR, data = Detroit_Data_Ben_Gonzalez)

Residuals:
        1         2         3         4         5         6         7         8         9        10        11        12 
 0.008567 -0.163223  0.199979 -0.520727  0.990593 -0.700423  0.534176 -0.012935 -0.837255  0.361341  0.277549 -0.149193 
       13 
 0.011550 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)   -70.16794  127.52402  -0.550    0.637
FTP            -0.29850    0.31798  -0.939    0.447
LIC             0.03193    0.01821   1.753    0.222
NMAN            0.12534    0.18927   0.662    0.576
poly(GOV, 2)1   1.33967   48.34475   0.028    0.980
poly(GOV, 2)2  40.96870   38.86995   1.054    0.402
poly(HE, 3)1  -42.82354   75.21346  -0.569    0.627
poly(HE, 3)2   -3.68839   10.90049  -0.338    0.767
poly(HE, 3)3   -3.24754    7.03880  -0.461    0.690
WE              0.54739    0.55821   0.981    0.430
ASR            -0.02719    0.03734  -0.728    0.542

Residual standard error: 1.23 on 2 degrees of freedom
Multiple R-squared:  0.9991,	Adjusted R-squared:  0.9944 
F-statistic: 212.7 on 10 and 2 DF,  p-value: 0.004688

> summary(gam.detroitmodel44)

Call:
lm(formula = HOM ~ FTP + LIC + NMAN + poly(GOV, 2) + poly(HE, 
    3) + WE + ASR, data = Detroit_Data_Ben_Gonzalez)

Residuals:
        1         2         3         4         5         6         7         8         9        10        11        12 
 0.008567 -0.163223  0.199979 -0.520727  0.990593 -0.700423  0.534176 -0.012935 -0.837255  0.361341  0.277549 -0.149193 
       13 
 0.011550 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)   -70.16794  127.52402  -0.550    0.637
FTP            -0.29850    0.31798  -0.939    0.447
LIC             0.03193    0.01821   1.753    0.222
NMAN            0.12534    0.18927   0.662    0.576
poly(GOV, 2)1   1.33967   48.34475   0.028    0.980
poly(GOV, 2)2  40.96870   38.86995   1.054    0.402
poly(HE, 3)1  -42.82354   75.21346  -0.569    0.627
poly(HE, 3)2   -3.68839   10.90049  -0.338    0.767
poly(HE, 3)3   -3.24754    7.03880  -0.461    0.690
WE              0.54739    0.55821   0.981    0.430
ASR            -0.02719    0.03734  -0.728    0.542

Residual standard error: 1.23 on 2 degrees of freedom
Multiple R-squared:  0.9991,	Adjusted R-squared:  0.9944 
F-statistic: 212.7 on 10 and 2 DF,  p-value: 0.004688

Tree Based Methods

Tree Based Methods - Detroit Dataset

Type in the variable search term below:

Model Learning Technique Variables Used Mean of Squared Residuals Minimum M.S.E. % Variance Explained Optimal Configuration Optimal Trees/Interaction F-Statistic
Model 38 Random Forest Outcome=HOM Predictors=All other variables 216.2263 212.892 -8.47 7 Predictors 10 Trees
Model 39 Random Forest K-Fold CV Approach Outcome=HOM Predictors=All other variables 216.2263 5.84121 -8.47 8 Predictors 3 Trees
Model 40 Random Forest Boosting K-Fold CV Approach Outcome=HOM Predictors=All other variables N/A N/A N/A N/A N/A
Model 41 Linear Model K-Fold CV Approach Outcome=HOM Predictors=All other variables N/A 210.021 N/A 4 Predictors 9 Trees

Note: N/A = Sample Size too small to utilize method

Best Model Tree Based Methods: The best model of the tree based methods is Model 41 the Linear Model K-Fold CV Approach Model. While it has a higher minimum MSE. The other models have a negative percentage of the variance explained for the data. The optimal number of predictors is 4 and the optimal number of trees is 9.


Validation Set Approach

Validation Set Approach - Detroit Dataset

Type in the variable search term below:

Model Learning Technique Variables Used Minimum M.S.E.
Model 42 Cross Validation Approach Outcome=HOM Predictor=LIC 180.6076
Model 43 K-Fold Cross Validation Outcome=HOM Predictor=LIC 108.0558
Model 44 K-Fold Cross Validation Outcome=HOM Predictor=LIC 179.4411

Best Model Validation Set Approach: The best model for the validation set approach is Model 43. The K-Fold Cross Validation Approach has the lowest MSE 108.0558 out of all of the models in the set.

> cv.error

[1]  182.3850  108.0558  592.6336  519.8474 1318.5811

Results: Best Model Selection - Detorit Dataset

Best Model Selection - Detroit Dataset

Type in the variable search term below:

Model Learning Technique Variables Used Correlation(s)/MSE/RSQ/Adjusted R2/%Variance Explained AIC/BIC/CP Optimal Configuration/F-Statistic/p-value
Model 1 Simple Linear Regression Outcome=HOM Predictor=FTP Correlation=0.964
Model 2 Multiple Linear Regression Outcome=HOM Predictors=FTP, MAN, LIC, GR,WM,NMAN,GOV,HE,WE,ASR Multiple R-Squared=0.9998 ADJR2=0.9986 p-value=0.001188
Model 3 Subset Selection Method=Backward Outcome=HOM Predictors=All Other Variables Multiple R-Squared=0.999 ADJR2=0.999 BIC = 8
Model 4 Generalized Additive Model / Polynomial Regression Outcome=HOM Predictors=FTP,LIC,NMAN,GOV^2,HE^3,WE,ASR Multiple R-Squared=0.999 ADJR2=0.9944 p-value=0.004688
Model 5 Tree Based Methods Linear Model K-Fold CV Approach Outcome=HOM Predictors=All other variables Optimal Configuration= 4 Predictors 9 Trees
Model 6 Validation Set Approach Outcome=HOM Predictor=LIC MSE = 108.0558
> anova(gam.detroitmodel44)

Analysis of Variance Table

Response: HOM
             Df  Sum Sq Mean Sq   F value    Pr(>F)    
FTP           1 2994.37 2994.37 1978.9471 0.0005049 ***
LIC           1  151.19  151.19   99.9199 0.0098602 ** 
NMAN          1   30.18   30.18   19.9431 0.0466611 *  
poly(GOV, 2)  2   18.27    9.13    6.0363 0.1421203    
poly(HE, 3)   3   23.30    7.77    5.1325 0.1673830    
WE            1    0.65    0.65    0.4326 0.5782914    
ASR           1    0.80    0.80    0.5300 0.5423085    
Residuals     2    3.03    1.51                        
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

summary(gam.detroitmodel44)


Call:
lm(formula = HOM ~ FTP + LIC + NMAN + poly(GOV, 2) + poly(HE, 
    3) + WE + ASR, data = Detroit_Data_Ben_Gonzalez)

Residuals:
        1         2         3         4         5         6         7         8         9        10        11        12 
 0.008567 -0.163223  0.199979 -0.520727  0.990593 -0.700423  0.534176 -0.012935 -0.837255  0.361341  0.277549 -0.149193 
       13 
 0.011550 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)   -70.16794  127.52402  -0.550    0.637
FTP            -0.29850    0.31798  -0.939    0.447
LIC             0.03193    0.01821   1.753    0.222
NMAN            0.12534    0.18927   0.662    0.576
poly(GOV, 2)1   1.33967   48.34475   0.028    0.980
poly(GOV, 2)2  40.96870   38.86995   1.054    0.402
poly(HE, 3)1  -42.82354   75.21346  -0.569    0.627
poly(HE, 3)2   -3.68839   10.90049  -0.338    0.767
poly(HE, 3)3   -3.24754    7.03880  -0.461    0.690
WE              0.54739    0.55821   0.981    0.430
ASR            -0.02719    0.03734  -0.728    0.542

Residual standard error: 1.23 on 2 degrees of freedom
Multiple R-squared:  0.9991,	Adjusted R-squared:  0.9944 
F-statistic: 212.7 on 10 and 2 DF,  p-value: 0.004688

Best Model Selection: The best model for the Detroit Dataset is Model 4 in the best model selection table. The learning technique is the generalized additive model-polynomial regression. This model has the highest adjusted r-squared utilizing the least amount of predictors at 7. The p-value is 0.004688 and is the lowest for this number of predictors. The significant predictors for the outcome variable are FTP(full time police), LIC( Gun Licenses), and NMAN(Non-Manufacturing Jobs).