Skip links

Surveillance Research Efforts Highlight Innovation and Outreach

Continuous Improvement in Processing Data

Over the years, our surveillance research team has witnessed the transition from carbon copy paper records to electronic medical records, and the change from basic computing (floppy discs, anyone?!) to the explosive growth in computing power and machine learning. This constant technology change requires continual learning and adaptation. Fortunately, it also means the opportunity to refine our methods and make the surveillance system more robust.

One of the guidelines of evaluating surveillance systems is “acceptability,” which includes the time burden to run the system; therefore, any means to reduce human review while maintaining accuracy is important. Integrating machine learning is one of the strategies we have implemented.

We previously developed a naïve Bayes-based classification strategy to extract non-fatal injury cases from pre-hospital, or EMS free-text records. In the spirit of continual improvement, the aim of our most recent work was to improve retrieval rates, in terms of false positive rate required to obtain a true positive rate of 0.90, by benchmarking the reference naïve Bayes against three other algorithms: elastic net regression, Support Vector Machines, and boosted decision trees (XGBoost).

Using a labeled, gold-standard dataset (N=60,143) with substantial missing data (24%), we benchmarked these algorithms on complete case data (N=44,566) and imputed data using two methods:

  1. grouped hot-deck (predicting what the value should be based on other attributes of the dataset)
  2. recoding of missing units to the category “unknown,” using a 75:25 train/test split and stratified sampling.

Here’s what we learned:

  • While all models produced similarly accuracies (0.96 to 0.98), XGBoost performed best related to false positive rates.
  • All four models perform well on complete data; however, our dataset contained missing units, which resulted in misclassification and in omissions and required additional human coding.
  • Reliance on a machine learning method that is robust related to missing data and imputation method, such as XGBoost, is a reasonable approach to improving classification rates without omitting data.

The Northeast Center surveillance team continues to evaluate methods to enhance the system, especially related to missing data. In addition, newer data are being continually evaluated.

Getting the Data to Our Stakeholders

Collecting and analyzing data is useless if the derived insights don´t get to stakeholders. We conducted a survey to understand how to best communicate our findings. There were notable differences among industries (e.g., agriculture, forestry, and fishing, education, public health) and occupations (e.g., managers, technicians, workers). The most popular methods of dissemination were infographics and short reports (1-5 pages), the latter preferably delivered quarterly. Survey respondents deemed surveillance data useful in analyzing trends and tailoring trainings, among others.

Read more at Understanding Stakeholder Dissemination Preferences for an Agriculture, Forestry, and Fishing Injury Surveillance System: Journal of Agromedicine: Vol 29 , No 2