P REDICTING SEVERE WILDLIFE VEHICLE CRASHES (WVC S ) ON N EW H AMPSHIRE ROADS USING A HYBRID GENERALIZED ADDITIVE MODEL

: Across the United States, wildlife vehicle crashes (WVCs) are increasing and remain consistently deadly to drivers, despite a downward trend in fatal automobile accidents overall. That said, the factors related to severe WVCs are unclear. With this in mind, we pursued a statistical model to reveal factors associated with WVCs that result in severe injury or death to drivers. We hypothesize that there are statistically significant interactions and non-linear relationships between these factors and severity occurrence. We developed a generalized additive model (GAM) with linear terms, additive terms, and a binary response for severity. We surmise that our fitted model results will quantify the relationship between significant variables and severity occurrence, and ultimately help to develop countermeasures to mitigate serious injury. The model was fitted to WVC records occurring between 2002 and 2019 in the state of New Hampshire. Fitted linear terms revealed: 1) in inclement weather, there is about a 22% increase in the odds of severity for slick surface conditions compared to dry surface conditions; 2) for the warmer months (spring/summer), there is a 42% decrease in the odds of severity for straight roads compared to those with curvature/incline; 3) for highways, the odds of severity decreases by 48% for accidents occurring on NH’s two major intestates highways, and 4) for spring/summer (as compared to the fall/winter), there is more than a 3-fold increase in the odds of severity for two-way traffic. Fitted additive terms revealed: 1) the odds of severity increased in the early hours, between midnight and 6AM, and after 5PM; 2) speeds between 45 and 60 mph are associated with an increase in the odds of a severe accident, while both lower and higher speeds (those below 45 and above 60 mph) are associated with a decrease in the odds of a severe accident; and 3) low, mid-range, and high human population densities are associated with decreases, increases, and decreases in odds of severity, respectively. Cross validation and resulting ROC curves gave evidence that our model is well specified and an effective predictor. Results could be used to inform drivers of potentially dangerous roadways/conditions/times.


Introduction
Currently, in the United States, it is estimated that there are between one and two million crashes between motor vehicles and large animals each year (Skroch and St. Hilaire, 2021).Most recently, according to State Farm Insurance Company, during the 12-month period between July of 2020 and June of 2021, this number exceeded 2 million (Ortega, 2022).Such accidents result in approximately 26,000 human injuries and at least $8 billion in property damage per year, and in some rural states, these wildlife-vehicle crashes (WVCs) represent as many as 20% of all reported collisions (Skroch and St. Hilaire, 2021).As stated by Huiser et al. (2008), WVCs are amongst the leading concerns for road safety in the rural areas of North America Despite an overall flattening trend of automobile accidents across the US, WVCs are increasing, likely caused by increases in both travel and some animal populations (Huiser et al., 2008).Furthermore, according to the Insurance Institute for Highway Safety (IIHS), the number of deaths due to WVCs been consistent since the mid-2000s, despite the fact that fatal automobile accidents have been generally decreasing since 1975 (IIHS, 2022).In 2020 alone, there were 202 reported human deaths due to WVCs in the United States (IIHS, 2022), two of which occurred in NH, the focal area of this study.The number of deaths due to WVCs vary considerably from state to state, but totals typically reflect the size of the driving population, the size of the animal population, and the percentage of rural roads (IIHS, 2022).With all this in mind, we investigate WVCs with a particular focus on those that result in injury to the driver.We aim to identify the characteristics that distinguish serious WVCs from their less serious counterparts, to model the severity of the human injury.More specifically, our working hypothesis is that there are statistically significant interactions between predictors; season, time-of-day, weather conditions, roadway features/geometry, speed limit, etc.; as well as non-linear relationships between such predictors and severity occurrence.To this end, we conducted a multiple logistic regression analysis with 20 potential variables (not including interactions between variables), carefully selected based on existing literature focused on similar studies (See Literature review).We surmise that our fitted model results will allow us to quantify the relationship between significant variables and the occurrence of severe WVCs.Then, as stated by Gharraie and Sacchi (2020), after identifying factors that distinguish severe accidents, effective countermeasures may be pursued to mitigate the degree of human injury sustained in association with WVCs.To date, there are many thorough works investigating the characteristics associated with WVCs, or essentially when and under what circumstances WVCs occur.However, only a minority of these explicitly model the severity of the WVC.Additionally, our work makes use of a hybrid model, a flexible platform comprised of a variety of linear terms (including interactions) and additive terms.

Literature review
In the following sections, we present the relevant literature used to inform our procedure.First, we review works identifying characteristics/factors/variables associated with WVC occurrence/frequency (Section 2.1).These works do not address the severity of the accidents, as in this analysis, but such literature can help to inform our procedure.That is, many of the characteristics associated with WVC occurrence/frequency are also likely to help distinguish WVC severity.Next, we review works analyzing accident severity of automobile accidents (Section 2.2).There is a rich field of work in predicting accident severity and identifying the factors that distinguish accidents of varying severity, but few devoted to WVCs specifically.Lastly, and most relevant to our work, we review the few works devoted to studying accident severity among WVCs (Section 2.3).

Characteristics related to WVC occurrence/frequency 2.1.1. Time-of-day
In their review of WVCs in Texas between 2010 and 2016, Wilkins et al. (2019), found peaks in occurrence between 5 and 8 AM and between 5 and 10 PM.In a study of moose and deer in Finland, Haikonen and Summala (2001) found crash peaks an hour after sunset for deer, and generally that WVCs are more likely to occur at night when some species are more active.Rowden et al. (2006) studied WVCs in Australia and found that nighttime trips and areas with higher speeds were a major risk factor compared to other accident types.Bil et al. (2017) found that Roe deer account for the vast majority of WVCs along the Czech road system, and most of these occurred within a few hours of sunrise and sunset.In an older work, Hughes et al. (1996) analyzed several years of WVCs across the United States.They found, among other trends, that the greatest number of animal crashes occurred during the early morning hours (5 to 8 a.m.) and the night hours (6 p.m. to midnight).In a study of WVCs (ungulates) from Lithuania between 2002 and 2017, Kucas and Balčiauskas (2020) found that most took place before or during sunrise and after or during sunset.In a review of literature/studies associated with deertraffic collisions on roads and railways, Steiner et al. (2014), using cluster analysis, summarized that, despite some seasonal dissimilarity, diurnal patterns exist with pronounced peaks during the hours of dusk and dawn.

Day-of-week, season, etc.
Kucas and Balčiauskas (2020) found peak WVC levels early in the morning and late in the evening during the spring season, and late mornings and early evenings during the winter season.Nationally, and based on several years of data, the State Farm Insurance Company website reports the most dangerous months for animal collisions are November, October and December, in this order (Ortega, 2022).More recently in 2020, according to the IIHS, the highest number of deaths in collisions with animals occurred during July-September (IIHS, 2022).We should note that these reports are based on national aggregates of all states, and seasonal trends could be more regionally nuanced.Garriga et al. (2017) studied WVCs from the Iberian Peninsula and found that, overall, roadkill pattern was seasonal with peaks in autumn and spring.In a study of WVCs involving Moose in Newfoundland (Canada), Joyce and Mahoney (2001) found the vast majority of accidents occurring between June and October.Germane to this analysis, Cherry et Gkritza et al. (2010) found that the frequency of deer-vehicle crashes in urban deer management zones is higher in zones with higher deer density, with the exception of zones with a higher percentage of land within the city limits.This, according to the authors, is likely identifying an effect of human settlement on deer habitats.The previously mentioned work of Wilkins et al. (2019) identified that most WVCs in Texas occur at night in unlit locations, usually on rural roads with very low traffic volumes.Using deer accident data collected from Irish mortorways, Liu et al. (2018) used logistic regression to find that WVCs were more likely to occur on road segments where the distance to the nearest forest on the west side of the road was greater.Using a threshold analysis, the previously mentioned Valerio et al. (2012) identified the relative roadkill abundance increased mostly in landscapes with anthropogenic land cover classes, such as complex cultivations, orchards, or urban surfaces.

Weather
Using a logistic regression model and two years of roadkill data from Brazil, Ascensao et al. (2019) found that both temperature and precipitation had significant effects on accident occurrence, but the exact nature of these effects was dependent on animal species.Machado et al. (2015), using roadkill data collected over a 6-month from Brazilian roadways, found that the highest number of roadkill occurring during the wet, rainy season.In their study of moose accidents, Dussault et al. (2006) found that the probability of a WVC increased when air temperature and atmospheric pressure were high.Olsen et al. (2015) investigated the effect of winter conditions (precipitation and snow depth) on DVC rates.The authors found that deer populations in the Rocky Mountains were typically found further from roads and crossed high volume roads less often when snow depths were lower, thus resulting in lower DVC rates.

Models for Accident Severity (non-WVCs)
To date, there are numerous, fine works that investigating accident severity associated with roadway collisions.These works use a variety of methodologies, and we mention several here to highlight the diversity of these analyses.Using an ordered probit models and accident data from roadways, intersections, and toll plazas in Florida, Abdel-Aty (2003) identified the significance of driver's age, gender, seatbelt use, point of impact, speed, and vehicle type on the injury severity.Zhang et al. (2021) used a hierarchical ordered probit model (HOPIT) to study driver injury severity in left-turn crash data collected in Utah between 2010 and 2017.Although results vary between season and type of accident, they found that factors like snowfall, drug use, high speed limits, position of turn signal, nighttime conditions, and head-on collisions are associated with increases in injury severity.Huang et al. (2008) studied multivehicle crash data from Singapore using Bayesian hierarchical binomial logit models with random effects to account for unobserved factors.They found that crashes occurring in peak time with good lighting and those involving pedestrian injuries tend to be less severe, while crashes that occurring during the night (under certain other conditions as well) tend to be more severe.Eluru et al. (2010) use a copulabased methodology to model injury severity for crashes that involve any number of occupants.The approach was applied to crash data from the 2007 General Estimates System in the United States and results identified inter-occupant dependency in some types of vehicles and how crash characteristics, environmental factors, and roadway attributes affect the injury severity levels of occupants in different seat positions.In a study that considered the combined effects of seatbelt usage, fixed objects, and environmental conditions, Yamamoto et al. (2008) provided evidence that sequential binary probit models outperform the ordered-response probit models.Malyshkina and Mannering (2009) used a Markov switching model in which accident-severity outcomes are generated by separate multinomial logit processes.Tested on Indiana roads, the methodology proved superior to standard multinomial logit models for a number of roadway classes and accident types.Abdelwahab and Abdel-Aty (2001) used neural networks to identify the relationship between several factors and driver injury severity.Using accident data from Florida, and a validation procedure, they found the multi-layer perceptron more accurate than the ordered logit model.Using 2001 accident data from Taipei, Chang and Wang (2006) explored a classification and regression tree (CART) approach, a type of data mining, to establish the relationship between injury severity and driver/vehicle characteristics, highway/environmental variables, and accident variables.While these works above relate to accident severity, none relates specifically to isolated WVCs.There are, however, some works in the literature that related to WVC severity.

Models for accident severity among WVCs
Based on deer population data from Iowa, Gkritza et al. (2010) used a binary logistic regression model to identify the factors associated with accident severity.In their case, a crash was deemed 'not severe' if there was property damage only and 'severe' if there was a possible, minor, major, or fatal injury.They found that crashes on dark roads, dry roads, principal arterials, those with posted speed limit over 55 miles per hour, and those in zones with a larger percentage of cropland were more likely to result in injury.On the other hand, they found that accidents on roads with lower speed limits (below 55), gravel right shoulders, and undivided with wider left shoulders; occurring on Fridays; on roads with a gravel right shoulder and higher traffic volume; and in zones with a higher percentage of roads were less likely to result in injury.Using more than 10,000 WVCs occurring in the province of Saskatchewan (Canada), Gharraie and Sacchi (2020) investigated the severity outcomes of WVCs and their influencing factors via a structural equation modeling (SEM) with generalized (ordered probit) links.They found that speeding attitude and visibility impairment as latent variables positively influenced the overall crash severity, and road surface and weather conditions can indirectly affect crash severity.In an analysis aimed at identifying factors associated with driver injury severity, and to account for unobserved heterogeneity in the data, Al-Bdairi et al. (2020) used a mixed logit model, a mixed logit model with heterogeneity in means, and a mixed logit model with heterogeneity in means and variances.Based on animal accidents occurring in Washington between 2012 and 2016, they found an increased likelihood of severity associated with freeways/expressways, daylight conditions, early morning times, dry road surfaces, and clear weather conditions.Based on 12,000 singlevehicle AVCs from the state of Arizona, Russo (2017) investigated the factors affecting injury severity via an ordered logit statistical model with random parameters to account for unobserved heterogeneity.Among the several driver-, vehicle-, and environmental-related variables found to be significantly associated with injury severity, AVCs involving livestock or pets were more likely to result in severe injury.

Data 3.1.1. Initial data preparation
Initially, two vehicular collision datasets were made available to us: the 2002-2017 IDMS and the 2017-2019 VISION data, both curated by the New Hampshire Department of Safety (NHDOS) and shared with the NH Department of Transportation (NHDOT).Both, essentially, are detailed records for auto accidents occurring in the state of New Hampshire.From 2002-2017, the IDMS was the official data recording platform, but NH state offices transitioned to the VISION platform during 2017.There were 444 duplicate values for 2017, the year of the transition, and we retained only the VISION records for this year.The combined dataset from 2002-2019 contained a total of 605,390 accident records occurring across the state.The dataset included information pertaining to the street, time-of-day, day-ofthe-week, month-of-the-year, severity of the collisions, number of vehicles involved, number of humans injured, road design, road alignment, weather conditions, light conditions, and the crash type.The crash type field was used to exclude non-WVC records as we focus on WVCs in this work.Unfortunately, nearly all WVCs were recorded as 'Animal (Other),' so type of animal (bear, deer, moose, etc.) could not be reliably distinguished.In the end, there were 27,383 WVCs recorded by the state between 2002 and 2019.Also, the IDMS data (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016) and the VISION data (2017-2019) were slightly different in terms of attribute names and, occasionally, data format.For such cases, we deferred to the attribute name and format used in the most recent platform, the VISION data (2017-2019).

Spatial analysis
During meetings with regional planners that use the NH collision dataset, we learned that the collision data included information that were not spatially explicit.Specifically, some records completely omitted latitude/longitude information, while others used lat./long.centroids corresponding to municipality centers.Given the project goals, we decided to remove records that did not closely correspond to actual roadways (within 200ft).Identifying such locations was done via ArcMap 10.8.1 and ArcGIS Pro 2.9 using the NH DOT 2021 roadways data layer (herein referred to as "NH Roads") as a reference.

Roadway characteristics
The WVC dataset, at this point, did not include detailed characteristics of the roadways.To resolve this, we used ArcMap 10.8.1 to spatially join the WVC points to the NH Roads GIS layer.This then linked each WVC with additional roadway information such as AADT, number of lanes, road width, shoulder width, functional system tier, road surface (paved/unpaved), and direction.The NH Roads GIS layer does not include posted speed limit, so we used the functional system to identify this attribute.AADT data was not available for all roadways included in the NH Roads layer, so we appended NH Roads with AADT data stored in prior NH roads layers.

Wildlife, human, and road density
Using 2020 data estimates from the NH Department of Fish and Game, we integrated regional deer, bear, and moose populations for each WVC.Deer and moose were chosen as they are believed to be the animals most closely associated with severe accidents in the state.Next, to complement the wildlife Laflamme, E.M., Villamagna, A., Kim, H.J., Archives of Transport, 69 (1), [39][40][41][42][43][44][45][46][47][48][49][50][51][52][53][54][55][56][57]2024 population densities, we calculated road density using the Kernel Density approach in ArcMap 10.8.1 and then extracting these values for WVC locations.Lastly, human population density was calculated using Block Group population data from the 2015-19 American Community Survey (Census Bureau) and dividing this value by the land area of the Block Group geographic area.This data was spatially joined to the WVC locations in ArcMap 10.8.1.

Final dataset
Since our goal is to identify the factors associated with severe WVCs, we first need to identify a severity variable, or in this case, a binary variable distinguishing WVCs that result in serious injury to the driver from those that do not.From the aggregated dataset (described in previous sections), we have 'Severity' information collected at the scene of each WVC.This variable is categorical and distinguishes between 'fatal,' 'incapacitating,' 'non-incapacitating,' 'minor,' and 'serious' injuries to the driver.We defined a new binary variable called 'severity,' and any record originally categorized as 'fatal,' 'incapacitating,' and 'serious,' was given the 'severe' designation.All other records, those associated with minor or no injury, were assigned the 'not severe' designation.This new variable, severity, will be our binary response for our statistical model to be described in subsequent sections.In Al-Bdairi et al.
(2020), despite using three categories for their ordered response, 'severe' accidents mirrored our definition.In that work, the authors defined severe injuries as either incapacitating or fatal.In their severity analysis of deer accidents, Gkritza et al. ( 2010) used a binary response with two discrete outcomes, injury or no-injury.Their 'injury' category included minor injuries, however, so their response is less of a severity distinction and more of a distinction between injury and property damage only.
In our final dataset used for this analysis, we observed 349 severe WVCs, a relatively low number compared to the non-severe accidents.We did initially pursue zero-inflated and Firth-correction models to address this relative scarcity, but we found no substantial change using these remedial models.Thus, we pursued the model form described in the following sections.We acknowledge, however, that fitted coefficients may be deflated due to the relative rarity of outcomes.
Motivated by the current literature (research identifying the factors associated with WVC frequency and severity), the objectives of our project, and the limitations of the data available to us, several variables were considered for our analysis.The table below (Table 1) presents these variables, a short description of each, and simple summary statistics.We note that to simplify the procedure and for ease of interpretation of future results, many categorical variables were discretized into broad categories or converted into binary variables (yes/no).This is a relatively common approach.For example, Gharraie and Sacchi (2020) discretized road type, accident type, time-of-day, day-of-the-week, road surface, lighting, weather, and road alignment into binary variables.Similarly, Malin et al. ( 2019) discretized categories of weather conditions in their investigation of the effects of weather on accident risk.As a final preprocessing step, we 'cleansed' the data by scanning for missing or unusual/suspicious entries.Most commonly, accident records were missing information related to accident severity, posted speed limit, or AADT.These entries were discarded and we were left with a dataset of 15,234 WVC records, each complete with information for the variables listed above and accident severity.This dataset will be used for the model-fitting procedure described below.

Logistic Regression Model
For our statistical model, we opted for a multiple logistic regression form.Such a model, of course, requires a binary response, and we define our binary response  in the following way.
where 'severe' is previously defined.
For  predictor variables, we assume a vector-valued input  = (1,  1 ,  2 , … ,   ) and vector of regression coefficients  = ( 0 ,  1 ,  2 , … ,   ) .We define  as the probability of our response taking on a value of '1', the probability of a WVC being severe, as follows.
We note that the probability depends on our predictors and coefficients.Next, we consider the logit of , defined as follows.
where the quantity  1− is known as the 'odds.'Thus, we have: By setting the logit (the log-odds) equal to a linear combination of predictor values and coefficients, we have the logistic regression equation: or in its common vector form where  and  are defined as above and (•) represents the vector dot product.We note that the logodds is now dependent on the combination of variables and coefficients and  is guaranteed to be ∈ (0, 1).For a somewhat simpler, and possibly more intuitive interpretation of the model, equation ( 6) is often expressed in terms of the odds by exponentiation.This yields We can also rewrite (2) as follows.
From the last line of (8), we can see that a one-unit change in the  ℎ predictor, while holding all other variables constant, will result in an exp{  } multiplicative change in the odds.

Additive terms
A generalized additive model (GAM), introduced by Hastie and Tibshirani (1986,1990), is a generalized model (for example, a binary logistic regression model) with a linear predictor involving a sum of smooth functions of some number of explanatory variables.Furthermore, a model form can be hybrid in the sense that our link function (the left side of our model equation, the log-odds) can be set equal to a combination of both linear terms and smooth functions of variables.That is, we may have the following.
where ∑      =1 are the  linear terms in the model, ∑   (  )  =1 are the  smooth functions of  variables (the additive terms), and  =  +  ( total predictor variables).The smooth functions of this model form allow for a flexible specification of the dependence between the response (the log-odds) and the corresponding explanatory variables.To create these smooth functions, statistical software automatically "plots" the response versus an independent variable, and then fits a smooth curve that represents the data as well as possible.This fit is achieved by dividing the data into a number of sections, assigning "knots" to the ends of the sections, and fitting a polynomial or spline function to the data in the section.The resulting functions can be high-order curves that capture complicated relationships between the response and the explanatory variable.Although GAMs are flexible and often provide better results in terms of goodness-of-fit, their primary shortcoming is that their results are difficult to interpret.That is, the fitted model result is not reported in terms of a parametric equation, and model fits are typically evaluated graphically.For this analysis, we opted to use the hybrid generalized additive model (given in equation ( 9)) as our final model form.We also note that in our statistical model, we pursue several interactions between explanatory variables.

Results and Discussion
All data preparation/manipulation/aggregation was performed using the base packages of R statistical software (R Core Team, 2022).The model form pursued in this analysis was a hybrid GAM, the model form presented in equation ( 9).This model uses severity as the binary outcome (response) and a variety of explanatory variables included as linear or additive terms.To fit this particular model, we used the "mgcv" package based on the code of Simon N. Wood (Wood, 2011).A challenge with our model form is determining which of the terms should be considered linear and which should be considered additive.Initially, all numerical variables were designated as additive terms.From investigatory, initial model-fitting, specifically looking at the estimated smooths associated with the numerical variables, we were able to determine (graphically) which terms, in fact, should remain additive and which should simply be linear terms.In the end, time-of-day, posted speed limit, and human population density required high-order smooths (to be presented and interpreted in subsequent section).The remaining categorical variables were included in our model as linear terms.Also, as previously mentioned, in addition to main effects, several twoway interactions were investigated.In fact, during the model-fitting procedure, most combinations of the categorical variables (see Table 1) were pursued.While main effects help to gain a very general understanding of the effect of a variable, the interactions will potentially reveal more complicated, nuanced features not necessarily discernable in the main effects alone, and these complicated relationships are most interesting.Though they worked more generally with accident prediction models, Islam et al. (2014) highlighted the importance of incorporating interactions between variables for the estimation of models.Lastly, the overall fit and appropriateness of the model form was assessed via a Hosmer-Lemeshow test, a commonly used technique for assessing logistic regression models (Hosmer and Lemeshow, 2000).The Hosmer-Lemeshow test creates groups of observations based on estimated probabilities from the logistic regression model and then compares observed and expected probabilities within these groups.This test, however, can become too powerful with large sample sizes, and the number of groups chosen should be adjusted to keep the power of the test consistent (Paul et al., 2013).Following the advice of Paul et al. ( 2013), and based on our sample size and number of successes (severe accidents), we chose to use 100 groups.The Hosmer-Lemeshow goodness-of-fit test yielded a test statistic of Χ 2 = 105.32(df = 98, p-value = 0.2885).This gives no evidence of a lack of fit and we can assume our model is appropriately specified.

Linear terms
Related to the linear terms in the model, fitted results were generally positive as a number of statistically significant explanatory variables were identified.In other words, a number of beta coefficients associated with the linear terms in our model form (equation (9)), were found to be statistically different from zero based on a p-value analysis and a significance level of 0.05.However, because of our large sample size ( > 15,000), we suspect our p-values may be artificially deflated and should thus be looked upon with some skepticism.So, instead of reporting and interpreting statistically significant terms (based only on p-values), we have opted to report and interpret 'practically' significant terms based on effect sizes.In the case of logistic models, "odds ratios" are considered suitable effect sizes.Odds ratios are the exponentiated beta terms in our model identified in equation (8), and these values are interpreted as the relative multiplicative effects of the variables on the odds.While there are different opinions as to what should be considered a "large" effects size, many suggest that "rules of thumb" should be avoided and that effect sizes should be "contextualized," interpreted in terms of the field of study and phenomenon being measured (for example, Ialongo, 2016).Following this advice, and albeit an arbitrary threshold, we have decided to use odds ratios of (or around) 0.50 and 2 as boundaries for practical significance (or, equivalently, coefficient estimates of -0.69 and 0.69, respectively).That is, any odds ratio resulting in a value greater than 2 or less than 0.50 (doubling or halving the odds, respectively) is considered practically significant.Note that for practically significant interaction terms, the main effects will not be discussed, regardless of their significance, since they should not be interpreted on their own.Lastly, since our model does identify several practically/statistically significant terms, we will discuss those deemed "most" significant and interesting/germane to our analysis.Table 2 below presents a selection of practically significant variables, their corresponding regression estimates, and their odds ratios.

Weather and surface interaction
The relevant fitted result for these variables is as follows.
Both weather and surface are categorical variables with just two categories each: rain, clear weather; slick, dry surface; respectively.For such variables in regression, there is always a base case (or baseline setting), and the other level of the variable is an adjustment to that baseline.For these particular variables, clear weather and dry surfaces are the bases, and the main effects would be the multiplicative adjustments to the odds of severity for inclement and slick conditions, respectively.However, since we have a practically/statistically significant interaction between weather and surface, we can no longer interpret the main effects since these variables now depend on the each other.The good news is that these variables are binary (take on values of 0 and 1 for the two respective categories), so interpretation of the model results is relatively straightforward and understandable.
Table 2.A selection of practically significant variables and their corresponding regression coefficients (with confidence intervals) and odds ratios (with confidence intervals).Note that the highway variable is weakly practically significant = 0.12.Thus, for clear weather, slick surface affects the odds of having a severe accident by a multiplicative factor of 0.12; or in other words, for clear weather, there is about an 88% decrease in the odds of severity for slick surfaces compared to dry surfaces.This result has some support in the literature.Al-Bdairi et al.
(2020) found an increased likelihood of severity associated with dry road surfaces (and decreased likelihood of severity for slick roadways) and clear weather conditions.Also, indirectly, Gharraie and Sacchi (2020), found a decrease of crash severity for wet road surface conditions, likely because of a decreased speeding attitude (which makes intuitive sense).Next, we observe the effect of slick surfaces for inclement weather conditions to see the interaction effect, or how the effect of slick surfaces depends on the observed weather.For inclement weather (setting weather to 1), we find that slick surfaces (setting surface equal to 1) corresponds to an odds of exp{−.So, the combination of slick surfaces and inclement weather results in an increase in the odds of a severe accident.

Direction and season
The relevant fitted result for these variables is as follows.
= exp{… 0.28 *  − 1.65 *  + 0.93 *  *  + ⋯ } Both direction and season are categorical variables with just two categories each: one-way, two-way traffic; sping/summer, fall/winter; respectively.For these variables, the fall/winter season and one-way traffic are the base cases, so the spring/summer season and two-way traffic are adjustments to their respective bases.Initially, we did consider four separate seasons, but found no significant differences between winter and fall and between spring and summer.Thus, the variable was discretized to distinguish between fall/winter and spring/summer.For the fall/winter season (keeping season at 0), we find that two-way traffic (setting direction equal to 1) corresponds to an odds of exp{0.28* 0 − 1.65 * 1 + 0.93 * 1 * 0} = exp{−1.65}= 0.19.Thus, for cooler months, two-way traffic affects the odds of having a severe accident by a multiplicative factor of 0.19, or there is about an 81% decrease in the odds of severity.Next, we observe the interaction effect.For spring/summer (setting season to 1), we find that two-way traffic (setting direction equal to 1) corresponds to an odds of exp{0.28* 1 − 1.65 * 1 + 0.93 * 1 * 1} = .64.If we compare this to the odds for two-way traffic in the fall/winter, we have spring/summer (as compared to the fall/winter), two-way traffic increases the odds of having a severe accident by a multiplicative factor of 3.37, more than a 3-fold increase.This is an interesting result: during the cooler months, two-way traffic is safer than oneway traffic; during the warmer months, two-way traffic is much more dangerous.At the time of this analysis, no literature was found related to the combined effect of traffic direction and season.This effect may be attributable to changes in driver attitude (speeding, etc.) and increased animal behavior/activity during the warmer months.

Alignment and season interaction
The relevant fitted result for these variables is as follows.
In this case, alignment and season are both categorical variables, each with only two categories: straight, curved/inclined roads; fall/winter, spring/summer; respectively.Therefore, as we see above in the equation excerpt, there is one adjustment for the alignment (one adjustment from curvature/incline, the base case), one adjustment for the season (adjustment from the winter/fall base case), and a single interaction term.First, as a reference, for the fall/winter season (when seasonal adjustment is set to 0), we find that straight roadways (setting alignment equal to 1) affects the odds of having a severe accident by a multiplicative factor of exp{0.30* 1} = 1.35 .This means that, during the fall/winter, there is about a 35% increase in the odds of severity for straight roadways compared to roadways with curvature or incline/decline.Now, we consider the spring/summer season, and we compare our result to that of the fall/winter.In the spring/summer (with seasonal adjustment set to 1), we find that straight roadways (setting alignment equal to 1) have an odds of exp{0.

Interstate highways
The relevant fitted result for this variable is as follows.
In this case, the highway variable is a categorical variable and simply distinguishes between interstate highways (I-93 and I-89 in NH) and all other roads.Despite pursuing numerous interactions with this variable, no meaningful relationships were identified.Interestingly, two-way interactions between highway and habitat variables (habitat block, primary corridor, and secondary corridor) were not found to be significant.This was somewhat unexpected as numerous studies have identified that WVCs often peak where roads interact with animal habitats (for example, Ha and Shilling, 2018).None of the other variables in the analysis were found to interact with highway either, so we interpret the effect of highway as a main effect.In our case, nonhighway roads are (by default) the base case, so the effect of being on NH's interstates on accident severity is given by exp{−0.65}= 0.52.That is, being on an interstate highway affects the odds of having a severe accident by a multiplicative factor of 0.52, or there is about a 48% decrease in the odds of severity for highways.Though initially unexpected, this result does make sense.Based on national crash data, Huijser et al. (2008) found that WVCs are more likely to occur on low volume roads with nearly half of all observed WVCs occurring on roadways with less than 5,000 average daily traffic (ADT).In addition to this, interstates tend to be better maintained, have wider lanes, have more space between vehicles, and have Laflamme, E.M., Villamagna, A., Kim, H.J., Archives of Transport, 69 (1), [39][40][41][42][43][44][45][46][47][48][49][50][51][52][53][54][55][56][57]2024 isolated paths of travel, so generally, WVCs occurring on highways can be 'managed' to avoid severe injury.

Additive terms
Using the default setting in the mgcv package, thin plate regression splines were used for the smooth functions.A useful output from the GAM fitting output is the effective degrees of freedom (edf) for each smooth term.An edf of 1 corresponds to a model with the smooth term penalized to a simple linear relationship, so the edf can be loosely interpreted as the order/complexity of the fitted smooth, or akin to the number of knots used.As mentioned previously, all continuous variables in our data were originally included in the model with smooth terms.However, many of the associated smooth functions had edf values of 1 (or nearly 1), so their inclusion as additive terms was not justified.In the end, three variables; time-of-day, posted speed limit, and human population density; had edf > 1 and their corresponding smooth functions were retained in the final model.

Time-of-day
The estimated smooth function associated with dayof-the-week is included below in Figure 1.The smooth function associated with day-of-the-week had an edf value around 6 which agrees with the very high-order, complicated curve shown below in Fig- ure1.Using y=0 as a reference point (at y=0 there is even odds of a severe accident), we see that the early hours, between around midnight and 6AM, there is an increase in the odds of a severe accident, with a peak around 3AM. Between about 6AM and 5PM, we find negative values with a low point around 8AM.These negative values correspond to decreases in the odds of observing a severe accident.After 5PM, we see positive values, or increases in the odds of a severe accident, with a peak around 10PM.There is literature to support this finding.(2012) applied a generalized additive model using the relative frequency of collisions as a function of the interaction of a given week and hour.They found that for certain times of the year, the predicted probability of collisions was highest between 4 and 10PM, interestingly the same times as the increasing portion of our second peak.Based on a very thorough time series analysis of nearly a decade of deerrelated accidents, Hothorn et al. ( 2015) identified a fine-scale temporal pattern of DVCs, and generally a trend that showed that DVCs occurred shortly after sunset and shortly before sunrise (during the dark hours).While that work is based on DVCs only (and not severity of the accident), these periods are nearly identical to the positive areas, those corresponding to increases in odds of severity, on our GAM plot below.Huijser et al. (2008) found that most WVCs occurred during early morning and late evening hours, and many other works cited in section 2.1.1 agree with our findings.It is worth mentioning that the time-of-day effect likely has to do with both animal and traffic behavior (human behavior) that vary throughout the day.

Posted speed limit
The estimated smooth function associated with dayof-the-week is included below in Figure 2. The associated smooth function had an edf value slightly above 2 which agrees with the parabolic curve seen below.In this case, because the smooth is relatively  2016) theorized that improved safety at higher speeds, their unexpected result, may be due to increased design standards and longer available distances between vehicles at these conditions.In a work focused specifically on WVCs on a highway segment, Murphy and Xia (2016) found a significant negative relationship between speed and accident occurrence.Lastly, Lave and Elias (1994) contend that increases in posted speed limits are safer (fewer fatal accidents) due to reductions in speed variances on higher speed segments.We note that, after reviewing the NH WVC data, high speed limits are not found exclusively on interstate highways (some smaller state routes had high speed sections), and some portions of the interstate highways themselves had lower posted speed limits (45-55 mph range).This confirms that high speed limits are not exclusive to well-designed interstates, but occur on a variety of roadways, those with different geometries, varying degrees of protection against encroachment from animals, etc.In fact, a fine gradient of speed/roadway type combinations are represented in our data.Thus, we can then generalize our result by concluding that the relationship between speed and accident severity is non-linear, a relationship that may have been overlooked with a linear model.Although we stand by our result, we can only speculate as to why high-speed conditions are associated

Human population density
The smooth function associated with human population per square mile is given below in Figure 3.This smooth had an edf value around 3.4, evident below in the non-linear, cubic curve.Between 0 and about 3,000 (persons per square mile), the curve is below zero indicating a negative effect on the odds of severity.The curve is then positive up to a point cor-responding to about 9,000, indicating a positive effect on the odds of severity for these densities.Beyond this point, for values larger than 9,000, the curve is negative and continues to decrease as the density increases.We note, however, the large increase in confidence bands for these larger densities, and thus interpret this area of the plot cautiously.
In summary, we see increased odds of severity at mid-range human densities (between around 3,000 and 9,000) and decreased odds for both low and high densities.Intuitively, we can explain this.In lower density areas, while there may be more animal activity, the roadways may be less disposed to severe accidents (lower speeds, etc.).In highly dense areas, there may simply be fewer animals present, or consistent traffic deters/scares off animals.The existing literature is split on this issue, and a variety of measures are used to quantify density (number of roads, artificial lights, etc.).Some studies have indicated that urban roads are associated with fewer WVCs (for example, Zuberogoita et al., 2014), while others have found the opposite (for example, Ha and Shilling, 2018).Gkritza et al. ( 2010), modeling injury severity associated with WVCs, found that both higher traffic volumes and areas higher percentages of roads (both proxies for high densities) were less likely to result in injury.This agrees with our result, specifically the upper end of our figure.
Fig. 3. Plot of smooth function fit to the human population density variable

Model validation
To further evaluate the appropriateness of our model form, and to assess the predictive capability of our model/predictors, we employ a validation procedure quantified by a receiver-operating characteristic (ROC) curve analysis.First, for our validation, we define distinct, nonoverlapping "training" and "testing" portions of the data.Then, our "best" model, that identified in the Results section, is refit to the training set (this establishes new coefficient estimates based on the training data only).Next, this model form is used to predict responses for the testing data, and these predicted responses can be compared to the actual responses of this testing data.Importantly, this type of procedure identifies the true predictive capability of the model since the testing data was not used in the model-fitting.It is somewhat standard to use a slightly larger dataset for the training data, and we have chosen to do so.In addition, to mimic the way our model could be used in application, we have decided to partition our data by year.Specifically, we chose 13 years of the 18 total years of data as a training set; this equates to slightly more than 70% of the entire data.The remaining 5 years (not necessarily 5 consecutive years), about 30% of the full dataset, is designated as the testing set.Lastly, we repeat this procedure several times, randomly choosing training and testing sets, to ensure reliable results.For every training/testing iteration, we observe the receiver operation characteristic (ROC) curve.The ROC curve methodology has been adapted to numerous fields and is especially useful for evaluating the accuracy of a statistical model that classifies subjects into 1 of 2 categories (for example, logistic regression).The ROC curve is a plot of sensitivity versus 1−specificity for varying values of the threshold for  (the probability of an accident being severe in our case).The success of a model can be easily visualized by observing the ROC curve in relation to the transverse diagonal line, which corresponds to random chance.Typically, we observe the area under the curve (AUC) to quantify the difference from random chance, which is 50%.The ROC curve has two major benefits: 1) it automatically reports results for a variety of threshold values, and 2) it considers both forms of misclassification.
For each iteration of training/testing described above, we calculated the AUC via the 'pROC' package in R (Robin et al., 2011).This also, by default, includes a 95% CI are computed with 2000 stratified bootstrap replicates.From the iterations, the AUCs are generally in the neighborhood of 75%.A ROC curve from one training/testing iteration is given below in Figure 4. Interpreting AUC is subjective and relative, but this result indicates our model (as specified) is, by most standards, a good predictor of accident severity.This also lends credence to our procedure and model form as an appropriate choice.Also reported is the AUC, 75.4% in this case, along with the bootstrap confidence interval for the AUC

Conclusion
In this analysis, we develop a model framework for identifying the meaningful factors associated with severe wildlife vehicle collisions (WVCs).The approach is driven, primarily, by the data made available to us by the NHDOT.This data contains detailed information for WVCs occurring between 2002 and 2019, including the severity of the accident.Using a severity designation (either 'severe' or 'not severe') as the binary response, and using numerous explanatory variables based on the available information and variables commonly believed to be associated with WVCs, we used a logistic regression model form.We further augmented this model by using several additive terms for numerical variables.The benefit of this model, technically a generalized additive model (GAM), is that the additive terms automatically estimate complicated, non-linear relationships between the variables and response.Fitted results yielded several interesting relationships.The linear terms in our model (non-additive terms) correspond mostly to categorical variables, so interpretation is relative to base case settings.Furthermore, many of the statistically/practically significant results, and all those presented in this work, are interactions, so interpretation of one variable depends on the setting of the other.A brief summary of our results for the linear terms is as follows.First, we found that, in inclement weather conditions, there is about a 22% increase in the odds of severity for slick surface conditions compared to dry surface conditions.Second, for the warmer months (spring/summer), there is a 42% decrease in the odds of severity for straight roads compared to those with curvature/incline.Third, for spring/summer months (as compared to the fall/winter), there is more than a 3-fold increase in the odds of severity for two-way traffic.Fourth, the odds of severity decreases by 48% for accidents occurring on NH's two major interstates highways, I-93 and I-89.The additive terms are interpreted graphically, and a brief summary of our results for these terms is as follows.First, for time-of-day, we observe increases in odds of severity in the early hours, between midnight and 6AM, and after 5PM.The morning and afternoons observed decreases in odds of severity with a low point around 8AM.For posted speed limit, speeds between 45 and 60 mph are associated with an increase in the odds of a severe accident.On the other hand, both lower and higher speeds (those below 45 and above 60 mph) are associated with a decrease in the odds of a severe accident.Lastly, for human density, low, mid-range, and high densities are associated with decreases, increases, and decreases in odds of severity, respectively.We validated our procedure by randomly partitioning years of data into training and testing sets.Then, for the training data, we refit our model form to reestimate the parameters before applying the model to the testing data.ROC curves with AUC were used to evaluate the predictions, and the validation results gave ample evidence that our model is well specified and an effective predictor.Ideally, our results could be used to inform decisionmakers and ultimately affect policy or remedial Laflamme, E.M., Villamagna, A., Kim, H.J., Archives of Transport, 69 (1), [39][40][41][42][43][44][45][46][47][48][49][50][51][52][53][54][55][56][57]2024 measures, so-called "targeted preventative measures…to have the maximum impact on driver alertness" (Neumann et al., 2012).There are a variety of preventative measures that exist today (some relatively low cost, others more costly), such as wildlife warning signs, speed limit reductions, wildlife warning reflectors (which create an optical fence for wildlife), and wildlife corridors, to name a few.Our results associated with roadway features, posted speed limit and roadway alignment, could be used to identify locations where some preventative measures (as mentioned above) may be added.However, generally, these remedial measures have been shown to be either ineffective in the long term (Beben, 2012) or of little value (Benten et al., 2018).In addition to this, our other results, either interactions or additive terms, have temporal/seasonal effects, which means these effects are not constant throughout the day/year, and there is often a habituation effect associated with fixed preventative measures in such cases (Beben, 2012).Instead, a more realistic approach would be to implement a tailored driver warning added into a navigation system.This would allow drivers to be alerted only when necessary.Like any analysis, there were some limitations which may be addressed in future work.First and foremost, the data collected by the NHDOT, while spanning several years and mostly complete, did not distinguish the type of animal involved in the crash.Ideally, with this information, we could incorporate animal-specific behavior in the analysis.For example, Etter et al. (2002) suggested behavioral response of deer to hunting might contribute to an autumn peak in DVCs.Other species-specific behavior would surely contribute to a deeper understanding of WVCs.Next, there is no information related to the demographics of the driver involved in the WVC.A key finding in Huijser et al. (2008) was that, unlike vehicular accidents in general, WVCs have no pronounced spike for young drivers.This suggests that driving experience does not reduce the risk of being involved in a WVC, or possibly that younger drivers are not typically on the types of roads where WVCs occur.In any event, this information would likely help to identify more factors associated with severe WVCs.Also, it would be interesting to know if alcohol was involved in the WVCs, if the driver was distracted (cell phone use, etc.), if the driver was using a seatbelt, or the type of vehicle involved.These factors are known to contribute to accident severity in general, and this information should be collected at the accident scene, so theoretically it should be possible to obtain for future studies.Also, with all works related to WVCs, underreporting must be acknowledged.As stated by Savolainen et al. (2011), "[underreporting] is an extremely important matter because estimating models with under-reported data can easily lead to erroneous inferences, particularly because less severe crashes are far more likely to be underreported than more severe crashes, which results in a data sample which is nonrandom in its dependent variable thereby violating the fundamentals of traditional statistical and econometric model derivations."Furthermore, according to Snow et al. (2015), as much as 2/3 of WVCs may be unreported in some cases.Fortunately, as stated by the authors, estimates based on underreported data are reliable until very high rates of underreporting occur.Lastly, our data comes exclusively from the state of New Hampshire.While this data does represent a variety of different locations, such as rural and urban areas, it is not clear that our results are generally transferable.Future work may include WVCs from more diverse areas.

Fig. 1 .
Fig. 1.Plot of smooth function fit to the time-of-day variable

Fig. 2 .
Fig. 2. Plot of smooth function fit to the posted speed variable

Fig. 4 .
Fig. 4. ROC curve for one training/testing procedure in which 13 years of data is randomly selected as the training set and the remaining 5 years of data is set aside as the testing set.Also reported is the AUC, 75.4% in this case, along with the bootstrap confidence interval for the AUC al. (2019) found that the highest counts of WVCs were reported during fall for the northeast region of the US.Dussault et al. (2006) studied moose accidents between 1990 and 2002 and found the highest number of accidents in late June, but accident frequency remained relatively high from mid-May to late August.The previously mentioned work of Wilkins et al. (2019) identified that Texans are significantly more likely to collide with a wild animal during the months of October, November, and December.Using an additive model and citizen science data collected from urbanized regions of Italy, Valerio et al. (2021) found significant peaks in roadkill occurrence (for a variety of species) between the winter and spring months.Using Polish railway data from 2012 to 2015 and generalized additive mixed models, Krauze-Gryz et al. (2017) found that the number of wildlife-train collisions varied depending on time of year with most collisions occurring in autumn when animals form winter herds and migrate.In a cluster analysis of deer and boar accidents on Czech roads, Bil et al. (2019) found closed habitats, the presence of shrubs along roads, road width, and the distance to forests to be significant indicators of WVC hotspots.Similarly, Bartonicka et al. (2018) applied a kernel density estimation method to identify hotspots among WVCs.Using data from the Czech Republic collected between 2006 and 2011, the method identified local clusters associated with distance from forest and linear vegetation.Based on deer population data from the Iowa Department of Natural Resources,

Table 1 .
List of variables used in procedure.First column gives variable name, second column gives type of variable, and third column gives units for continuous variables or categories of categorical variable Winter/Fall -Sept., Oct., Nov., Dec., Jan., Feb. Spring/Summer -Mar., Apr., May, Jun., Jul., Aug.