Car crash and number of deathly victims: a peruvian case

Multiple linear regression
Car Crash
Number of vehicles involved in a car crash influence in the number of deathly victims?
Author

Daniel Calenzani

Published

March 12, 2023

General considerations:

Do the number of vehicles involved in a car crash have an impact in the number of deathly victims? We tested the national census of peruvian police precincts (INEI, 2017) to find out. However this analysis is, in essence, an exercise in application of the regression approach applied to publicly accessible databases.

We’ve divided the analysis in 6 steps, using the guide proposed by Rebecca Bevans in Linear Regression in R - A Step-by-Step Guide & Examples

Step 1: Install and load the packages

library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(broom)
library(ggpubr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tibble    3.1.8
✔ purrr     1.0.1     ✔ tidyr     1.3.0
✔ readr     2.1.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(haven)

Step 2: Load the data into R

# Load the database
data_at <- read_sav("data/db_censo_comisarias.sav")
summary(data_at)
     ID_N              UBIGEO              CCDI             NOMBREDI        
 Length:68811       Length:68811       Length:68811       Length:68811      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
      ANIO          ID_AT             AT_N         AT_NALF         
 Min.   :2016   Min.   :  1.00   Min.   : NA     Length:68811      
 1st Qu.:2016   1st Qu.: 10.00   1st Qu.: NA     Class :character  
 Median :2016   Median : 25.00   Median : NA     Mode  :character  
 Mean   :2016   Mean   : 29.52   Mean   :NaN                       
 3rd Qu.:2017   3rd Qu.: 46.00   3rd Qu.: NA                       
 Max.   :2017   Max.   :200.00   Max.   : NA                       
                                 NA's   :68811                     
     AT_TOT           AT101         AT101_O           AT101_NRO        
 Min.   :  1.00   Min.   :1.000   Length:68811       Length:68811      
 1st Qu.: 38.00   1st Qu.:3.000   Class :character   Class :character  
 Median : 75.00   Median :5.000   Mode  :character   Mode  :character  
 Mean   : 58.03   Mean   :3.996                                        
 3rd Qu.: 75.00   3rd Qu.:5.000                                        
 Max.   :200.00   Max.   :6.000                                        
                                                                       
  AT101_ALF            AT102             AT103_D            AT103_M         
 Length:68811       Length:68811       Length:68811       Length:68811      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   AT103_A            AT103_HOR       AT103_MIN         AT104A     
 Length:68811       Min.   : 0.00   Min.   : 0.00   Min.   :1.000  
 Class :character   1st Qu.: 9.00   1st Qu.: 5.00   1st Qu.:2.000  
 Mode  :character   Median :14.00   Median :30.00   Median :4.000  
                    Mean   :14.96   Mean   :24.46   Mean   :3.667  
                    3rd Qu.:18.00   3rd Qu.:40.00   3rd Qu.:4.000  
                    Max.   :99.00   Max.   :99.00   Max.   :8.000  
                                                                   
   AT104A_O             AT104B        AT104B_O          AT104C_V1        
 Length:68811       Min.   :1.000   Length:68811       Length:68811      
 Class :character   1st Qu.:2.000   Class :character   Class :character  
 Mode  :character   Median :2.000   Mode  :character   Mode  :character  
                    Mean   :2.623                                        
                    3rd Qu.:2.000                                        
                    Max.   :7.000                                        
                                                                         
  AT104C_R1          AT104C_V2          AT104C_R2         AT104D_DPTO       
 Length:68811       Length:68811       Length:68811       Length:68811      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 AT104D_PROV        AT104D_DIST            AT105         AT105_O         
 Length:68811       Length:68811       Min.   : 1.00   Length:68811      
 Class :character   Class :character   1st Qu.: 4.00   Class :character  
 Mode  :character   Mode  :character   Median : 5.00   Mode  :character  
                                       Mean   : 5.39                     
                                       3rd Qu.: 8.00                     
                                       Max.   :12.00                     
                                                                         
    AT106_1        AT106_1_CANT      AT106_2         AT106_2_CANT  
 Min.   :0.0000   Min.   :1.00    Min.   :0.00000   Min.   :1.00   
 1st Qu.:0.0000   1st Qu.:1.00    1st Qu.:0.00000   1st Qu.:1.00   
 Median :0.0000   Median :1.00    Median :0.00000   Median :1.00   
 Mean   :0.3613   Mean   :1.19    Mean   :0.05586   Mean   :1.04   
 3rd Qu.:1.0000   3rd Qu.:1.00    3rd Qu.:0.00000   3rd Qu.:1.00   
 Max.   :1.0000   Max.   :8.00    Max.   :1.00000   Max.   :3.00   
                  NA's   :43947                     NA's   :64967  
    AT106_3         AT106_3_CANT      AT106_4        AT106_4_CANT  
 Min.   :0.00000   Min.   :1.00    Min.   :0.0000   Min.   :1.00   
 1st Qu.:0.00000   1st Qu.:1.00    1st Qu.:0.0000   1st Qu.:1.00   
 Median :0.00000   Median :1.00    Median :0.0000   Median :1.00   
 Mean   :0.08968   Mean   :1.06    Mean   :0.1725   Mean   :1.09   
 3rd Qu.:0.00000   3rd Qu.:1.00    3rd Qu.:0.0000   3rd Qu.:1.00   
 Max.   :1.00000   Max.   :4.00    Max.   :1.0000   Max.   :6.00   
                   NA's   :62640                    NA's   :56939  
    AT106_5          AT106_5_CANT      AT106_6         AT106_6_CANT  
 Min.   :0.000000   Min.   :1.00    Min.   :0.00000   Min.   :1.00   
 1st Qu.:0.000000   1st Qu.:1.00    1st Qu.:0.00000   1st Qu.:1.00   
 Median :0.000000   Median :1.00    Median :0.00000   Median :1.00   
 Mean   :0.008676   Mean   :1.03    Mean   :0.05389   Mean   :1.05   
 3rd Qu.:0.000000   3rd Qu.:1.00    3rd Qu.:0.00000   3rd Qu.:1.00   
 Max.   :1.000000   Max.   :2.00    Max.   :1.00000   Max.   :4.00   
                    NA's   :68214                     NA's   :65103  
    AT106_7         AT106_7_CANT      AT106_8         AT106_8_CANT  
 Min.   :0.00000   Min.   :1.00    Min.   :0.00000   Min.   :1.00   
 1st Qu.:0.00000   1st Qu.:1.00    1st Qu.:0.00000   1st Qu.:1.00   
 Median :0.00000   Median :1.00    Median :0.00000   Median :1.00   
 Mean   :0.02042   Mean   :1.06    Mean   :0.08654   Mean   :1.05   
 3rd Qu.:0.00000   3rd Qu.:1.00    3rd Qu.:0.00000   3rd Qu.:1.00   
 Max.   :1.00000   Max.   :3.00    Max.   :1.00000   Max.   :4.00   
                   NA's   :67406                     NA's   :62856  
    AT106_9         AT106_9_CANT      AT106_10        AT106_10_CANT  
 Min.   :0.00000   Min.   :1.00    Min.   :0.000000   Min.   :1.00   
 1st Qu.:0.00000   1st Qu.:1.00    1st Qu.:0.000000   1st Qu.:1.00   
 Median :0.00000   Median :1.00    Median :0.000000   Median :1.00   
 Mean   :0.04305   Mean   :1.11    Mean   :0.008967   Mean   :1.03   
 3rd Qu.:0.00000   3rd Qu.:1.00    3rd Qu.:0.000000   3rd Qu.:1.00   
 Max.   :1.00000   Max.   :4.00    Max.   :1.000000   Max.   :3.00   
                   NA's   :65849                      NA's   :68194  
    AT106_11       AT106_11_CANT      AT106_12        AT106_12_O       
 Min.   :0.00000   Min.   :1.00    Min.   :0.00000   Length:68811      
 1st Qu.:0.00000   1st Qu.:1.00    1st Qu.:0.00000   Class :character  
 Median :0.00000   Median :1.00    Median :0.00000   Mode  :character  
 Mean   :0.07667   Mean   :1.12    Mean   :0.00372                     
 3rd Qu.:0.00000   3rd Qu.:1.00    3rd Qu.:0.00000                     
 Max.   :1.00000   Max.   :5.00    Max.   :1.00000                     
                   NA's   :63535                                       
 AT106_12_CANT      AT106_13      AT106_13_CANT      AT106_14     
 Min.   :1.00    Min.   :0.0000   Min.   :1.00    Min.   :0.0000  
 1st Qu.:1.00    1st Qu.:0.0000   1st Qu.:1.00    1st Qu.:0.0000  
 Median :1.00    Median :0.0000   Median :1.00    Median :0.0000  
 Mean   :1.01    Mean   :0.2003   Mean   :1.06    Mean   :0.1692  
 3rd Qu.:1.00    3rd Qu.:0.0000   3rd Qu.:1.00    3rd Qu.:0.0000  
 Max.   :2.00    Max.   :1.0000   Max.   :3.00    Max.   :1.0000  
 NA's   :68555                    NA's   :55028                   
 AT106_14_CANT      AT106_15        AT106_15_CANT      AT106_16       
 Min.   : 0.00   Min.   :0.000000   Min.   :1.00    Min.   :0.000000  
 1st Qu.: 1.00   1st Qu.:0.000000   1st Qu.:1.00    1st Qu.:0.000000  
 Median : 1.00   Median :0.000000   Median :1.00    Median :0.000000  
 Mean   : 1.12   Mean   :0.003081   Mean   :1.02    Mean   :0.008022  
 3rd Qu.: 1.00   3rd Qu.:0.000000   3rd Qu.:1.00    3rd Qu.:0.000000  
 Max.   :10.00   Max.   :1.000000   Max.   :2.00    Max.   :1.000000  
 NA's   :57167                      NA's   :68599                     
 AT106_16_CANT      AT106_17        AT106_17_CANT      AT106_18
 Min.   :1.00    Min.   :0.000000   Min.   :1       Min.   :0  
 1st Qu.:1.00    1st Qu.:0.000000   1st Qu.:1       1st Qu.:0  
 Median :1.00    Median :0.000000   Median :1       Median :0  
 Mean   :1.02    Mean   :0.001061   Mean   :1       Mean   :0  
 3rd Qu.:1.00    3rd Qu.:0.000000   3rd Qu.:1       3rd Qu.:0  
 Max.   :2.00    Max.   :1.000000   Max.   :1       Max.   :0  
 NA's   :68259                      NA's   :68738              
  AT106_18_O        AT106_18_CANT      AT107_1          AT107_2      
 Length:68811       Min.   : NA     Min.   :0.0000   Min.   :0.0000  
 Class :character   1st Qu.: NA     1st Qu.:0.0000   1st Qu.:0.0000  
 Mode  :character   Median : NA     Median :0.0000   Median :1.0000  
                    Mean   :NaN     Mean   :0.2991   Mean   :0.5817  
                    3rd Qu.: NA     3rd Qu.:1.0000   3rd Qu.:1.0000  
                    Max.   : NA     Max.   :1.0000   Max.   :1.0000  
                    NA's   :68811                                    
    AT107_3           AT108          AT108_1         AT108_2      
 Min.   :0.0000   Min.   :1.000   Min.   : 1.00   Min.   : 0.000  
 1st Qu.:0.0000   1st Qu.:2.000   1st Qu.: 1.00   1st Qu.: 1.000  
 Median :0.0000   Median :2.000   Median : 1.00   Median : 1.000  
 Mean   :0.2245   Mean   :2.332   Mean   : 1.28   Mean   : 1.609  
 3rd Qu.:0.0000   3rd Qu.:3.000   3rd Qu.: 1.00   3rd Qu.: 2.000  
 Max.   :1.0000   Max.   :3.000   Max.   :24.00   Max.   :72.000  
                                  NA's   :66355   NA's   :25305   
    AT108_3          AT109_1          AT109_2           AT109_3        
 Min.   : 0.000   Min.   :0.0000   Min.   :0.00000   Min.   :0.000000  
 1st Qu.: 1.000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000000  
 Median : 1.000   Median :0.0000   Median :0.00000   Median :0.000000  
 Mean   : 1.278   Mean   :0.2434   Mean   :0.01888   Mean   :0.006336  
 3rd Qu.: 2.000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.000000  
 Max.   :66.000   Max.   :1.0000   Max.   :1.00000   Max.   :1.000000  
                                                                       
    AT109_4            AT109_5           AT109_6          AT109_7       
 Min.   :0.000000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.000000   Median :0.00000   Median :0.0000   Median :0.00000  
 Mean   :0.006903   Mean   :0.08513   Mean   :0.2281   Mean   :0.03225  
 3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :1.000000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
                                                                        
    AT109_8           AT109_9           AT109_10           AT109_11      
 Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
 Median :0.00000   Median :0.00000   Median :0.000000   Median :0.00000  
 Mean   :0.03119   Mean   :0.05479   Mean   :0.007106   Mean   :0.02375  
 3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000  
                                                                         
    AT109_12           AT109_13           AT109_14           AT109_15     
 Min.   :0.000000   Min.   :0.000000   Min.   :0.000000   Min.   :0.0000  
 1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.0000  
 Median :0.000000   Median :0.000000   Median :0.000000   Median :0.0000  
 Mean   :0.006336   Mean   :0.008371   Mean   :0.001119   Mean   :0.1537  
 3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.0000  
 Max.   :1.000000   Max.   :1.000000   Max.   :1.000000   Max.   :1.0000  
                                                                          
    AT109_16          AT109_17           AT109_18         AT109_19      
 Min.   :0.00000   Min.   :0.000000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.00000   Median :0.000000   Median :0.0000   Median :0.00000  
 Mean   :0.00981   Mean   :0.001177   Mean   :0.2867   Mean   :0.02758  
 3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:1.0000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.000000   Max.   :1.0000   Max.   :1.00000  
                                                                        
  AT109_19_O         Factor_2016      Factor_2017   
 Length:68811       Min.   : 1.000   Min.   : 1.00  
 Class :character   1st Qu.: 1.000   1st Qu.: 1.00  
 Mode  :character   Median : 1.293   Median : 1.00  
                    Mean   : 2.912   Mean   : 2.00  
                    3rd Qu.: 3.080   3rd Qu.: 1.95  
                    Max.   :19.760   Max.   :12.05  
                    NA's   :28751    NA's   :40060  

Step 3: Explore outcome variable (must be numeric)

We checked if the independent variable automoviles involucrados (VI - AT_106_1) is numeric. A general look at the database shows that this number represents only the value for cars of sedan or hatchback models, while other automobiles such as trucks, station wagons, buses or vans are excluded from this count.

It can be said that the variable could be considered as a categoric variable. While it has a numeric linear value, the max output is limited to 5. Added to this the output for any value of vehicles belongs to \(N\) , as you can’t have half a vehicle involved in a car accident, which furthers the argument for a categorical analysis. Linear Regression is possible with one independent categoric variable, but has special considerations in the interpretation of its data.

For practice purposes we’ll consider it as a numeric variable, as the analysis with categoric variables omits many of the interpretations in a numeric variable.

heridos <- data_at$AT108_2
vehiculos <- data_at$AT106_1_CANT
glimpse(heridos) #Numeric
 num [1:68811] NA 2 NA 1 1 1 0 NA 1 1 ...
 - attr(*, "label")= chr "NÚMERO DE HERIDOS"
 - attr(*, "format.spss")= chr "F11.0"
 - attr(*, "display_width")= int 11
glimpse(vehiculos) #Numeric
 num [1:68811] 1 1 NA NA NA 1 1 NA NA NA ...
 - attr(*, "label")= chr "VEHÍCULO MAYOR  AUTOMÓVIL - CANTIDAD"
 - attr(*, "format.spss")= chr "F11.0"
 - attr(*, "display_width")= int 11

Step 4: Make sure data assumptions

A. Linearity

When checking for linearity we can see that the points are not scattered in a linear distribution, they form columns for each whole value. They miss the mark in forming some sort of bell curve and most of the cases had only 1 automobile involved.

plot(heridos ~ vehiculos, data = data_at)

B. Independence of observations

Theoretically, we expect each variable to be independent

C. Normality

When checking for normality, the histogram is not bell shaped, which states a tendency of values in the lower numbers of the independent variable. Most of the cases had 1 or 2 automovile vehicles involved.

hist(vehiculos)

# Se puede evaluar la normalidad por  QQplot, Kurtosis

Step 5: Perform the linear regression analysis

In the analysis we see that the p value is favorable to us, as it is low enough to make this a good model. It appears to be a significant positive relationship between vehicles and people hurt in car accidents, with a 0,44 increase in people hurt for every unit of vehicles involved.

However, if we treat the variables as categorical ones, we can omit the linearity analysis and insist on this relation without trouble.

data_at_lm <- lm(heridos ~ vehiculos, data = data_at)
summary(data_at_lm)

Call:
lm(formula = heridos ~ vehiculos, data = data_at)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3146 -0.5449 -0.5449  0.4551 27.4551 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.10247    0.03667   30.06   <2e-16 ***
vehiculos    0.44243    0.03089   14.32   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.307 on 13116 degrees of freedom
  (55693 observations deleted due to missingness)
Multiple R-squared:  0.0154,    Adjusted R-squared:  0.01533 
F-statistic: 205.2 on 1 and 13116 DF,  p-value: < 2.2e-16

Step 6: Check the homocedasticity

Residuals show bias, as they are distributed between the two first values. If we consider the “vehicles” variable as categorical then we can also omit this test

par(mfrow=c(2,2))
plot(data_at_lm)

par(mfrow=c(1,1))

Step 7: Perform a graph to visualize the results

With the previous steps done we can graph the linear model an see that the number of vehicles involved has (eventhough small) a relation to the number of peopler hurt in the accident

graph <- ggplot(data_at, aes(x=vehiculos, y=heridos)) + geom_point()
graph <- graph + geom_smooth(method='lm', col='blue')
graph <- graph + stat_regline_equation(label.x = 3, label.y = 7)
graph
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 55693 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 55693 rows containing non-finite values
(`stat_regline_equation()`).
Warning: Removed 55693 rows containing missing values (`geom_point()`).

relation <- lm(vehiculos~heridos)
# give the chart a name
png(file = "linearregression.png")
# Plot
plot(vehiculos,heridos, col = "blue", main = "Regresión de vehículos y heridos en choques de autos",
abline(lm(vehiculos~heridos)), cex = 1.3, pch = 16, xlab = "vehiculos", ylab = "heridos")

Step 8: Report and interpret your results

We can interpret of this analysis that the number automovile vehicles invovled in car accidents have a relation with the number of wounded people. In our next analysis we will sum the number of different types of vehicles involved, so we can get a complete picture of the relation between vehicles involved and total number of wounde people.

We will also try to create an html site where you can select the type of vehicle you wish to add in the analysis