logo link to homepage

Data on Alcohol, Seatbelts and Airbags Added to FARS

Dr. Peter Cummings of the Harborview Injury Prevention and Research Center has produced extensions to the National Highway Safety Administration's FARS (Fatal Accident Reporting System) that fill in missing data with multiply imputed estimates. Each file is named CRASHIMPUTEDyyyy.DTA where yyyy is the data year.

The CRASHIMPUTED files are Stata (version 9.0) files that can be linked to data from FARS. These data contain information about all fatal crashes on public roads in the U.S. FARS data have been collected since 1975 and are publicly available. Information about FARS can be found at http://www.nhtsa.dot.gov/

Each CRASHIMPUTED file contains imputed information for some FARS variables for the years 1982 through 2001. The files have four variables that can be used to link each record to the correct FARS record:

year: Crash year.

stcase: A six-character code indicating the state in which the crash occurred (two characters) and the crash number (four characters).

Vehno: Vehicle number. Each vehicle in a crash is assigned a number.

Perno: Person number. Each individual within a vehicle is assigned a person number.

Cummings has added the variable airbag , a variable for the presence or absence of an airbag for each person in a crash. FARS data contain information about airbags, but it is often missing. Cummings used the partial vehicle identification number that is in FARS and software (Vindicator 2001, Release No. 1. Arlington, VA: Highway Loss Data Institute, 2001) that could link that number to manufacturer information about air bag presence for the driver or right front seat passenger. This variable is based on that linked information. When no link was obtained, FARS information was used. For some records assumptions were made about airbag presence; for example, it was assumed there were no air bags in large trucks or buses. Some missing information remained.

Multiple-imputation:

Cummings used multiple-imputation to impute unknown information 10 times for several variables. These variables have names that start with “i” for imputed. The variables are:

ibacbin0-ibacbin9: imputed values for any (bin = binary) blood alcohol.

ibacamt0-ibacamt9: imputed values for blood alcohol concentration (gm/dL).

ibelt0-ibelt9: imputed values for use of a seat belt or car seat.

iairbag0-iairbag9: imputed values for use of a helmet

isex0-isex9: imputed values for sex (F=0, M=1)

iage0-iage9: imputed values for age (years)

Alcohol levels were imputed only for drivers or non-occupants, not for passengers.

The imputation methods used to create this information involved over 10,000 files took up over 13 gigabytes of hard drive space. We cannot guarantee that this process was free of error.

These data are freely available to researchers who wish to use them. If you use these data in a publication, we would appreciate if you would acknowledge the source with wording such as: “We used multiple imputations of missing data created by Peter Cummings of the Harborview Injury Prevention & Research Center, University of Washington, Seattle WA. The work was supported by grants R49/CCR002570 and R49/CCR019477-01 from the Centers for Disease Control and Prevention, Atlanta, Ga and by the Crash Injury Research and Engineering Network of the National Highway Traffic Safety Administration.”

For anyone who wishes to use these data, we recommend that you be familiar with methods for combining multiple-imputed data. Useful references include:

1. Schafer JL. Multiple imputation: a primer. Stat Meth Med Res 1999; 8:3-15.

2. Schafer JL. Analysis of Incomplete Multivariate Data. New York: Chapman & Hall, 1997, p 109-10.

3. Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons, 1987, p 75-8.

4. Little RJA, Rubin DB. Statistical Analysis With Missing Data. Hoboken, NJ: John Wiley & Sons, 2002, p 86-7.

A Brief Description of the Imputation Process That Was Used:

We used multiple imputation to create 10 sets of data which were identical in regard to known information, but could differ, one from another, on imputed values for missing information. We first imputed 10 values for the presence of any alcohol in the blood of drivers or non-occupants.

The imputation process was done separately for each year of data and each of eight categories:

1) non-occupants (pedestrians and bicyclists);

2) motorcycles;

3) passenger cars;

4) light trucks and vans;

5) utility vehicles;

6) minivans;

7) medium and heavy trucks;

8) buses, motor homes, and miscellaneous vehicles.

Variables considered for each imputation included the outcome (survival or death), age, gender, police report of drinking, seatbelt or helmet use, valid license, previous drunk driving episodes, day of the week, hour of the day, vehicle on roadway, and whether the crash involved a single vehicle or the vehicle was struck by another or the vehicle struck another.[See Subramanian 2002 and Schafer 1997 for more detail.]

Imputations for any blood alcohol used an expectation-maximization algorithm to estimate a probability distribution for the values in each possible cell of the incomplete data. A Markov-chain Monte Carlo method was used for simulating draws from the cell probabilities.[See Schafer 1997] We used S-Plus software.[See Schimert 2001]

Among those known or imputed to have some alcohol in their blood, we then imputed the level of alcohol using the method of chained equations (or regression switching).[See van Buuren 1999] Blood alcohol and the other imputation variables were each in turn the response variable for a linear or logistic model in which known values of each variable were used to impute the missing information; this process cycled through each variable 10 times, updating the imputed values with each cycle. We used Stata software for this step. [See Royston 2005] After the imputation of blood alcohol levels was complete, we used the chained equations method to multiply impute missing information about seat belts, helmets, air bags, age, and sex, using the imputed blood alcohol levels and the other variables.

Other useful references about multiple-imputation include:

1. Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analysis. Am J Epidemiol 1995; 142:1255-1264.

2. Harrell FE, Jr. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York: Springer-Verlag, 2001.

3. Heitjan DF, Little RJA. Multiple imputation for the Fatal Accident Reporting System. Appl Statist 1991; 40:13-29.

4. Raghunathan TE. What do we do with missing data? Some options for analysis of incomplete data. Annu Rev Public Health 2004; 25:99-117.

5. Royston P. Multiple imputation of missing values: update. Stata J 2005; 5:188-201.

6. Rubin DB, Schafer JL, Subramanian R. Multiple imputation of missing blood alcohol concentration (BAC) values in FARS. Washington DC: National Highway Traffic Safety Administration, 1998.

7. Schimert J, Schafer JL, Hesterberg T, Fraley C, Clarkson DB. Analyzing Data with Missing Values in S-Plus. Seattle, WA: Insightful Corporation, 2001.

8. Subramanian R. Transitioning to multiple imputation - a new method to estimate missing blood alcohol concentration (BAC) values in FARS. DOT HS 809 403. Washington, DC: National Highway Traffic Safety Administration, 2002.

9. Subramanian R. Alcohol involvement in fatal crashes 2001. DOT HS 809 579. Washington, DC: National Highway Traffic Safety Administration, National Center for Statistics and Analysis, 2003.

10. van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999; 18:681-94.


^ Back to Top


          |    University of Washington    |    Harborview Medical Center
Contact HIPRC   |   Site by Publications Services