Modeling Severe Acute Respiratory Syndrome Coronavirus 2019 ( SARS-CoV-19 ) Incidence across Conterminous US Counties : A Spatial Perspective

Abstract. This study examines the spatial distribution of COVID-19 incidence and mortality rates across the counties in the conterminous US in the first 604 days of the pandemic. The dataset was acquired from Emory University, Atlanta, United States, which includes socio-economic variables and health outcomes variables (N = 3106). OLS estimates accounted for 31% of the regression plain (adjusted R2 = 0.31) with AIC value of 9263, and Breusch-Pagan test for heteroskedasticity indicated 472.4, and multicollinearity condition number of 74.25. This result necessitated spatial autoregressive models, which were performed on GeoDa 1.18 software. ArcGIS 10.7 was used to map the residuals and selected significant variables. Generally, the Spatial Lag Model (SLM) and Spatial Error Model (SEM) models accounted for substantial percentages of the regression plain. While the efficiency of the models is the order of SLM (AIC: 8264.4: BreucshPagan test: 584.4; Adj. R2 = 0.56) > SEM (AIC: 8282.0; Breucsh-Pagan test: 697.2; Adj. R2 = 0.56). In this case, the least predictive model is SEM. The significant contribution of male, black race, poverty and urban and rural dummies to the regression plain indicated that COVID-19 transmission is more of a function of socio-economic, and rural/urban conditions rather than health outcomes. Although, diabetes and obesity showed a positive relationship with COVID-19 incidence. However, the relationship was relatively low based on the dataset. This study further concludes that the policymakers and health practitioners should consider spatial peculiarities, rural-urban migration and access to resources in reducing the transmission of COVID-19 disease.



Introduction
SARS-CoV-19 otherwise known as Severe Acute Respiratory Syndrome Coronavirus 2019 was first reported on December 30, 2019, in Wuhan, China, as a pneumonia-related diagnosis (Xie et al., 2020).Several weeks later, The World Health Organization (WHO) officially tagged coronavirus "2019-ncov" as 2019 novel coronavirus and estimated its incubation period to be about 2 to 14 days. The primary aim of the spatial analytical process is to measure geographic distributions, analyze patterns, map clusters, and model spatial relationships among observed variables. Hence, spatial analysis becomes vital in medical geography because diseases' distribution tends to be intrinsically linked with socioeconomic, political, and environmental conditions that affect susceptibility. However, mapping patterns of phenomena have provided tremendous advantages of observing hidden relationships among variables (Oluwafemi et al., 2013). The community transmission was first reported in February, 2020 (CDC, 2020: Desjardin, et al., 2020. It is worthy of note that confirmed cases of COVID-19 had been reported in every state in the U.S by mid-March 2020 (Schuchat, 2020). As at April 7th 2020, the worldwide cases of COVID-19 has risen to over 131 million cases, 2.85 million deaths and over 74.5 million people have recovered from the virus (The New York Times, The COVID-19 Tracking Project, 2020). As of April, 7th 2021, in the US, the virus has infected 30,732,250 million people and resulted in 554,579 deaths (The New York Times, The COVID-19 Tracking Project, 2020). The major transmission route of COVID-19 is through respiratory droplets from close direct contact with symptomatic, pre-symptomatic or asymptomatic people, and indirect contact through objects, or aerosols over longer distances . The Basic Reproduction Number (R0), is a commonly used epidemiologic measure of transmissibility of an infectious agent. R0 will be greater than 1 during an outbreak and will drop to less than 1 as the outbreak subsides. This statistic can be used to estimate the proportion to be vaccinated within a population in order to control the spread of the infection (Delamater, et al., 2019). Across different regions of the U.S., the R0 ranged from 1.3 to 3.8 on March 1, 2020 and from 0.64 to 1.1 on May 1, 2020, according to Rt.live.com website (Rt Live, 2020. Early research suggested that the average number of days from transmission of COVID-19 to case confirmation was 18 (Backer, et al., 2020). According to Desjardins et al. (2020), COVID-19 estimated Ro equals 2.2 to 6.7 depending on various sources. The challenge of COVID-19 has been of global concern because of it's "unknowns" and the impacts of its emergence on virtually all aspects of life. Early studies on the COVID-19 have suggested that preexisting health conditions, air pollution, and socioeconomic variables could be pointers to COVID-19 Incidence and mortality (Petroni et al.;Xie et al.;Wu et al.;Pansini andFornacca, 2020). However, understanding the spatial distribution of COVID19 cases, and modelling variables that influences its transmission relies mainly on big data. The quality of the big data allows digital manipulations using spatiallyinclined software that allows exploratory spatial analysis. A number of studies have explored the relationship between COVID-19 transmission and socio-economic variables in New York City and Chicago, United States. Maroko et al. (2020) showed a relationship between COVID-19 transmission and socio economic and demographic characteristics. The authors affirmed certain socio-economic variables as predictors of COVID-19 disease. Safray et al., (2020) adopted both global and local spatial correlation statistic to examine dependencies in US counties. They affirmed that race, and certain health outcomes are predictors of COVID-19 transmission. Traditional statistical models do not account for spatial dependence (Smirnov & Anselin, 2001;Anselin, 2003;Anselin, et al., 2006). When spatial dependence is present, Ordinary Least Squares (OLS) regression produces biased parameter estimates (Smirnov & Anselin, 2001). In order to mitigate this, spatial autoregressive models are used that uses Maximum Likelihood (ML) estimation and accounts for the presence of spatial dependence in the data. The Incidence of COVID-19 has been found to vary over space at the global, national, and local scales depending on the risk factors (Franch-Pardo et al., 2020;Desjardins et al., 2020;Petroni et al., 2020;Adekunle et al. 2020 andMFF..Sobral et al., 2020). In particular, there is a notable geographical variation in the distribution of COVID-19 cases across the US's counties (Desjardins et al., 2020).In addition, COVID-19 mortality also varies across counties in the US (Zhang and Schwartz, 2020). In the spatial autoregressive models, spatial dependence is incorporated using two different methods; either in a spatially lagged dependent variable or through the error term. The former method is known as a spatial lag model and the later as spatial error model (Anselin, 2003). In the case of unaccounted spatial error, regression will have inefficient results (Anselin, et al., 1996), potentially giving incorrect standard error, wrong significance, or wrong model fit. When the spatial lag term is not treated as an endogenous variable under a proper estimation method, it will produce biased and inconsistent results in the regression model (Anselin, 1988;Baltagi, et al., 2007;Fotheringham & Rogerson, 2008;Lee & Yu, 2010;Badr, et al., 2020). To address the inconsistent, inefficient, and biased results of traditional statistical models, We employed spatial autoregressive models to examine the effects of selected socioe-economic and health outcomes variables on COVID-19 cases in the contiguous U.S. over the months of March 2020 and August, 2020 using Emory University Datasets.

Study Area
The United States is one of the North American continent countries; it is believed to be the most powerful nation globally in terms of Gross Domestic Product (GDP). According to the United States Census Bureau 2019 projection, the population is 329, 256,465 million, with the Capital city in Washington DC. The contiguous United States has 3,143 counties and 5 administrative regions ( Figure 1). The land area has 3,796,725 square miles (9,833,517 square kilometers) with 50 states within the contiguous United States (ThoughtCo.). The temperature is mostly temperate, tropical in Florida, semi-arid along the Mississippi River, and arid in the southwest's Great Basin (ThoughtCo). This study area includes all counties in the contiguous United States, where COVID -19 incidence and mortality data are available for the study period from January, 21 to September 16, 2020.

Dataset and Descriptive Statistics
The secondary data was used for this study. The dataset comprises over 40 variables, which includes COVID-19 cases count ( 7 and 14 days), mortality rate, COVID-19 Incidence, Race (Black, White, Hispanic), % Male, % Female, Household Income, Community Vulnerability Index, Population density, % Insured, % Uninsured, Age over 65, Poverty, Diabetes, Obesity among others. These datasets were acquired from Emory University COVID-19 Health Equity Interactive Dashboard, sourced from government and non-government agencies (

Data Summary
The datasets used for this study were summarized and variable statistics were calculated using STATA 13. The descriptive statistics for the explanatory variables were provided in Table 2.

Research Questions
The study will provide answers to the following research questions; i. Does the COVID-19 occurrence shows spatial variation and spatial dependencies across counties? ii. What is the spatial effects that present in the datasets? iii. What is the relationship between COVID-19 occurrences and selected socio-economic and health outcomes variables across US counties?

Aim and Objectives of the study
The aim of this cross-sectional study is to use the spatial analytical models to statistically investigate the spatial relationship between selected explanatory variables and COVID-19 transmissions with the view of solving public health problems and providing a framework for resources allocation to the deprived counties. The objectives of the study are to: i. to use Ordinary Least Square estimation as diagnostics tool to determine pattern of correlation and spatial dependence of COVID-19 incidence across the counties; ii. to use spatial autoregressive models to explain the spatial relationship of between COVID-19 case rate incidence and selected variables in the continental United States;

Working Hypothesis
The study sets to test the following hypotheses i. There is spatial variation and spatial dependencies in the COVID-19 incidence. ii.
COVID-19 incidence variation can be statistically explained with selected socioeconomic and health outcomes variables.

Spatial Analytical Procedure
In order to isolate the real predictor variables that influence COVID-19 incidence was transformed by squaring the incidence variable. I subjected the data to stepwise and exploratory analysis. In this case, COVID-19 incidence is the dependent variable while other variables (population density, male, poverty, diabetes, black, obesity and both urban and rural counties were represented as dummy variables; 1 for urban and 0 for rural counties) were entered into the model as potential predictor variables (regressors).

Spatial Weight Matrix
We first characterized the spatial relationships and identify the neighborhood structure by defining who neighbors are among all observation in the dataset this was done by creating spatial weight matrix (Anselin, 2003;Fotheringham and Rogerson, 2008). The spatial weight matrix expresses the existence of a neighbor relations and quantifies neighborhood structure between observations using n x n matrix, W (Anselin, 2003). The spatial weight matrix were calculated using Queen contiguity binary weight matrix, which defines the neighbor as any spatial unit that shares a common edge or vertex using GeoDa 1.18 software. The spatial weight is 0 if any two unit i and j are not neighbors and 1 if they are neighbors. The diagonal cells of a spatial weights are also represented by 0 because a geographic units is not considered neighbor of itself (Anselin, 2003). The spatial weights matrix is row standardized (Equation 1), where the given weights, Wij, are divided by the row sum resulting in the sum of all weights to equal n, that is the total number of observations. (Anselin, 2003). (1)

Estimation method & Models adopted
To establish a relationship between the isolated predictor variables and the dependent variable (COVID-19 case rate), I adopted (Ordinary Least Squares estimates (OLS), Spatial Lag Model (SLM), Spatial Error Model (SEM)). For a start, I subjected the potential predictor variables to OLS regression in GeoDa 1.18 platform (geodacenter.github.io). The OLS is a regression method that investigates the relationships between a set of explanatory or independent variables and dependent variable and has the general form (Ward and Gleditsch, 2018).
Where at county i, yi is the COVID-19 incidence, 0 is the intercept, 1 is the vector of the selected variables, is the vector of regression coefficient, and is a random error term.

Determining the Spatial Dependence
To achieve the first hypothesis of the study, with the null hypothesis saying.There is spatial variation and spatial dependence in the COVID-19 infection cases. Moran's I test which was captured in the OLS diagnostics for spatial dependence shown in Table 3 below indicated Moran'I value (0.4225, p< 0.00000) which implies clustering pattern and significant positive spatial dependence in the number of COVID-19 incidence across the counties in the U.S. Since Moran's I statistic is diagnostic tool, this result actually pointed me to the direction to go for testing for the marginal effect of spatial dependence using Spatial Lag Model (SLM) or Spatial Error Model (SEM).

Spatial Lag Model (SLM)
According to Anselin, (2003); Ward and Gleditsch, (2018), assumes dependency between the dependent variables and incorporates spatial dependence into the regression model with a "spatially-lagged dependent variables". SLM is denoted by: Where, is the spatial autoregressive variable (i.e. the spatial lag parameter), and is a row of the matrix of spatial weights (that is, vector of the spatial weights). The origin of equation 2    where represents the spatial component of the error, λ connotes the existing correlation rate among the components, and denotes the non-correlated spatial error term. The outcome of the SEM model was shown in Table 3 and SEM residuals was squared and latter mapped (See Figure 3b).
The Spatial Lag residuals in the SEM latter were squared in GeoDa 1.18 platform so as to remove the negative signs  The OLS result indicated a passing model for spatial regression which includes the selected independent variables as population density, male, poverty, black, diabetes, obesity and urban/rural dummies. Meaning that, only the seven variables were taken as the major predictor variables that influenced COVID-19 incidence across US counties between January 21 and September 16, 2020. For uniformity and consistency sake, we subjected the isolated predictor variables to the adopted regression models. For comparison sake, we implemented SLM and SEM on ArcGIS 10.7 GeoDa 1.18 (geodacenter.github.io) software. SLM and SEM were implemented on GeoDa 1.18 platform. The performances of the models were comparatively evaluated based on the Breusch-Pagan, R 2 , Akaike Information Criterion (AICc) and also the coefficient of the selected independent variables were used to assess the relationships that exist among the variables across counties in the U.S. during the first eight months of the global pandemic.

Analysis of Results
The initial estimate summary result of OLS with the seven regressors (population density, male, black, poverty, diabetes, obesity and urban/rural dummies) included in the estimation to measure the outcomes of COVID-19 incidence across counties in the US in the first eight month (604 days) of the pandemic is shown in Table 3. While COVID-19 incidence exhibited positive relationships with male, poverty, black, diabetes, obesity and urban/rural dummies poverty, diabetes, and obesity, it was negatively associated with population density (Table 3). The OLS estimates accounted for 31 % of the regression plain (adjusted R 2 = 0.31), this implies low R 2 . Although the OLS estimates presented a very low adjusted R 2 , it provides baseline for SLM and SEM. The interpretation to this is that almost 69% of the COVID-19 incidence across the contiguous US are caused by unknown variables to the model and likely due to the local variations which were not captured by the OLS models. The AIC value indicated 9263 while the Breusch-Pang test for heteroskedasticity indicated 472.4 and multicollinearity condition number of 74.25 This implies that selected regressor in the model are slightly correlated with the error term and demonstrated slight heteroskedastic nature. The SLM and SEM presented slightly improved adjusted R 2 values compared to OLS. This improvement was credited to the incorporation of spatial dependence into the regression analysis of the relationship between dependent variable and the predictor variables. The Rho (ρ) and Lambda (λ) were very significant for SLM and SEM respectively with α = 0.000. The adjusted R 2 value computed for SLM (0.29) is slightly higher than that of SEM (0.28). Nevertheless, lower value of standard error was recorded for SLM.

Regression analysis
Based on the model evaluation given above, we adopted SLM. The results of the SLM regression analysis was presented in Table 5. Results showed that Male (0.05), Black (0.01) and Poverty (0.02), Diabetes, (0.002), Obesity (0.007) and urban and rural dummy variable (0.02) exhibited positive and significant relationship with COVID-19 incidence. On the other hand, population density (-3.721) demonstrated negative relationship with COVID-19 incidence across the study area in the first 604 days of the pandemic. The three positively statistically significant variables in the SLM were also mapped to show the spatial variation across US counties between January 21 and September 15 2020 (See Figure, 4,5,6).

Discussion and Results
In this study, I attempted to examine the influence of socioeconomic and health outcomes variables on COVID-19 incidence across the counties of Contiguous USA. To achieve this, I selected 7 predictor variables (population density, male, black, poverty, diabetes, obesity and urban/rural dummies), I subjected the variables to stepwise and GIS-based exploratory and GeoDa-based regression analysis. Thus, we adopted spatial autoregressive models to establish an explanatory relationship between the dependent and predictor variables. The results provide an insight to the goodness fit model among the two models that I considered. The study found SLM to be a better and most preferred model above SEM due to low AIC value and Breusch-Pagan test when compared to SEM. The map of the squared Spatial Lag Residual indicated clustering around the centre of the map stretching north and south ( Figure 2&3).This pattern reveals an evident cross-border spatial autocorrelation of the examined socio-economic and health variables. The significant contribution of male, black, race, poverty and urban dummies to the regression plain indicated that COVID-19 transmission is more of a function of socioeconomic and rural/ urban conditions rather than health outcomes. Although, diabetes and obesity showed positive relationship with COVID-19 incidence but the relationships was very low. However, available data showed that socio-economic and health conditions are also potent determinant of COVID-19 incidence across the counties of contiguous USA. In particular, the vulnerable and poor populations (i.e., black race, Hispanic, among others) had recorded more COVID-19 incidences in USA Saffray et al., 2020).
The results of the regression analysis showed that the prominent influential factors of COVID-19 incidence varied significantly across the counties of USA and this underscores the importance of spatial context in modelling of the outbreak of infectious diseases in space and time.
All the outputs of this study revealed an interesting pattern that reflects cross-border spatial autocorrelation of events. Although, researchers had attributed the impacts of COVID-19 to the socio-economic disadvantages and inequalities arising from the pandemic itself (Ahmed et al. 2020;Mollabo et al. 2020). I observed that the severity of the impacts of COVID-19 is a function of cumulative poor socio-economic, rural/ urban interactions and health conditions of the people. For instance, poverty ridden persons may be prone to underlying health conditions such as diabetes, obesity and upper respiratory tract infections these may be as a result of eating habits or inappropriate access to health care. These underlying health conditions in turn expose the sick individuals to complications when infected with COVID-19. Results of correlation analysis show that poor persons and are likely susceptible to the disease conditions. The study also showed that population density had negative effects on the COVID-19 incidence, though our study did not consider the quantity and quality of health-care providers as explanatory variables, the evaluation of the influence of such variables on COVID-19 incidence would be highly revealing. Perhaps the most influencing but the most difficult to capture is the influence of behavioral factors on COVID-19 incidence. For instance, COVID-19 occurrence could be influenced by the willingness of the people to comply with rules and regulations regarding COVID-19 pandemic. Also, addiction to certain behaviors or lifestyles could expose some individuals or group of people to infections. Also, Mollalo et al. (2020) highlighted the possible influence of the dichotomy in enforcing COVID-19 guidelines among the states. Therefore, there are still more to learn about the factors influencing COVID-19 pandemic.

Conclusions
The study revealed the footprints of COVID-19 incidences across counties in contiguous United States between January 21 and September 16 2020. This study will help to mitigate the diffusion and the severity of the disease as well as creating early warning surveillances where attention should be focused. This study has shown statistically significant positive dependency in the COVID-19 incidence across counties in the United States.
The study further confirms previous findings (Saffray et al., 2020;Maroko et al.,2020) that racial minorities, poverty, and migration pattern between rural and urban locations could explain COVID-19 incidences across US counties. The set of individuals are at high risk of severe COVID-19 infections and deaths which could be explained with poor living standards and poor access to healthcare facilities. This study further concludes that the policy makers should take into cognizance spatial peculiarities, rural-urban migration and access to resources in the transmission of COVID-19 disease in the United States.