Tuesday, April 18, 2006

Handling Holidays

When dealing with daily syndromic data a major source of variability is attributable to holidays. Counts of hospital visits on holidays are usually lower, and the same is true for medication and grocery sales. Some of this is attributable to the holiday-hours and staffing at shops, work places, etc. Also, individuals probably tend to avoid going shopping on some holidays (whereas on other holidays shopping is extremely popular). In data such as school absences holiday data are missing altogether.

In the context of outbreak detection, holidays are usually bad news. They can affect the performance of monitoring methods and lead to false alarms. The good news is that we know exactly when holidays take place.

From the algorithm design standpoint the question is what to do about holiday data. Should it be removed? imputed? ignored? Should this be done before the algorithm is applied to the data, or should the algorithm have a built-in solution?

A discussion that we've had in our research group raises the following issues:

- If we are trying to precondition the data in order to remove seasonality and temporal trends, some methods are more sensitive to the presence of holiday data than others. For example, a 7-day differencing will only be affected by holidays locally: the week of the holiday will have an extremely low 7-day difference, and the following week will have an extremely high 7-day difference. In contrast, a regression model with day-of-week predictors might be heavily influenced by holidays.

- What metrics are affected by holiday data? (e.g., local/global averages, autocorrelation)

- How should holiday data be imputed? (of course, for what goal?)

I hope we can have a discussion here about different approaches to handling holiday data.

Wednesday, March 29, 2006

Biosurveillance meetings

Because of the very diverse nature of the research, implementation, and usage of syndromic surveillance systems, the platforms for discussing advances are disperesed. I've attended and given talks at a range of conferences, workshops, and workgroups ranging from purely academic to non-academic. Even within the academic conferences there are different types of meetings, from multiple disciplines. Let me describe a few past and upcoming events, and hopefully others will add to the list.

The largest venue is of course the annual Syndromic Surveillance Conference, to be hosted this year in Baltimore, MD. This is attended by many of the researchers developing and designing the systems, organizations that delpoy them, as well as users.

The DIMACS group on Biosurveillance, Data Monitoring and Information Exchange, spearheaded by Henry Rolka and Colleen Martin from CDC and David Madigan from Rutgers, organized a few workgroup meetings in the last few years. The latest brought together many health monitors who use syndromic surveillance systems.

The upcoming 2006 INFORMS conference will feature an invited session on "Quality and Statistical Decision-Making in Healthcare Applications", which might include a biosurveillance aspect.

Statistical flavor:
SAMSI Anomaly Detection Workgroup- A group of statisticians that have been meeting weekly since Sept-05 (some of them remotely) to discuss statistical methods for biosurveillance. There was a kickoff meeting and a mid-year workshop.

The 2005 Joint Statistical Meeting (JSM) in Minneapolis featured several session on biosurveillance, including an invited panel on "national systems for biosurveillance" and a session on "Innovations in Prospective Anomaly Detection for Biosurveillance".

The upcoming 2006 ENAR spring meeting will have a few biosurveillance-related talks.

The upcoming 2006 Intl Workshop on Applied Probability will have several sessions on biosurveillance (update from Daniel Neill)

More data-mining/machine-learning venues are the KDD 2005 workshop "Data Mining Methods for Anomaly Detection" and the next one, called "ML Algorithms for Surveillance and Event Detection" in conjuction with ICML-2006 in Pittsburgh.

And a venue more medical-informatics oriented is the annual American Medical Informatics Association symposium. In 2006 it will be in Washington DC (update from Daniel Neill)

Wednesday, March 01, 2006

Syndromic surveillance systems in practice

Last week's DIMACS working group on biosurveillance brought together users of BioSense, ESSENCE, and RODS surveillance systems alongside with a few researchers. This was an eye-opener for me, a researcher. I learned a very important point: most health monitors, who are the users of such systems, learned to ingore alarms triggered by their system. This is due to the excessive false alarm rate that is typical of most systems - there is nearly an alarm every day!

The increased alarm rate can be a result of several factors. First, both BioSense and ESSENCE use classic control charts such as CuSum and EWMA (although with some adaptations). One clear pattern that is very popular in daily syndromic data is a day-of-week effect. This violates one of the main assumptions underlying the above control charts, which is that the "target value" is constant. In contrast, we do not expect the same number of ER visits on weekends and on weekdays. In the presence of such an effect, a CuSum or EWMA are likely to lead to false alarms on "high" days (e.g., weekday visits to ED), and to miss real outbreaks on "low" days (e.g., weekend visits to ED).

The day-of-week effect is only one pattern that can lead, if ignored, to excessive false alarms. Another factor is the autocorrelation present in daily syndromic data, even after removing the day-of-week effect. Howard Burkom gives the example of the effect of a common cold on daily counts of respiratory syndrome data, if the cold is not what you want to detect. This, of course, can lead to excessive false alarms.

The bottom line is that the excessive false alarms are not surprising when these control charts are applied directly to raw daily syndromic data. This is closely related to my posting on monitoring complex data. The good news is that this can actually be improved: there are ways to precondition the data that directly address the structure of these new daily syndromic streams. Some examples are regression models (e.g., Brillman et al. 2005), ARIMA models (e.g., Reis and Mandl, 2003), exponential smoothing (e.g., Burkom, Murphy, and Shmueli, work in progress), wavelet methods (e.g., Goldenberg et al. 2002, Shmueli, 2005), etc. The difficulty is to have one automated method (or suite of methods) that can be used with diverse syndromic data streams. The diversity need therefore be characterized.

Another reason for the high rate of alarms has to do with multiple testing: A health monitor will be examining data from multiple locations, and can also slice the data by categories (e.g., by age group). This relates to a previous posting that I had on multiple testing. How should this multiplicity be addressed? Should there be a ranking system that displays the "most abnormal" series?

And finally, there is a host of data quality issues that are related to the alarming. These are typically harder to tackle through models.

As a statistician, I feel that we must regain the trust of health monitors in the alerting abilities of syndromic surveillance systems. Designing automated algorithms for data preconditioning should be a priority. We should also be able to show the benefit of employing such preconditioning before using popular control charts like the CuSum, and its effectiveness in reducing false alarms and capturing real abnormalities.


References

Judith C Brillman , Tom Burr , David Forslund , Edward Joyce , Rick Picard and Edith Umland, Modeling emergency department visit patterns for infectious disease complaints: results and application to disease surveillance, BMC Medical Informatics and Decision Making 2005, 5:4, pp 1-14 http://www.biomedcentral.com/content/pdf/1472-6947-5-4.pdf

Goldenberg A, Smueli G, Caruana RA and Fienberg SE (2002). Early Statistical Detection of Anthrax Outbreaks by Tracking Over-the-Counter Medication Sales. PNAS, 99 (8), 5237-5240.
Reis BY, Mandl KD. Time series modeling for syndromic surveillance. BMC Med Inform Decis Mak. 2003;3(1). http://www.biomedcentral.com/1472-6947/3/2

Shmueli, G., (2005) Wavelet-Based Monitoring in Modern Biosurveillance, Working Paper, Smith School of Business, Universith of Maryland

Tuesday, February 14, 2006

Monitoring Complex Data

I. Relevance
A main challenge in biosurveillance is that pre-diagnostic data are much more complicated than more traditional diagnostic data. These data are more frequent (daily vs. weekly or monthly), more noisy, and include a lot of irrelevant variability. There are many other challenges that arise on top of this, but the "ugly" data are a fundamental issue that needs to be accounted for.

II. Current State
Many current monitoring methods, which may have worked for traditional data, appear to be somewhat over-simplistic for monitoring modern pre-diagnostic data. The major effort has therefore been to design more sophisticated algorithms. And indeed, this approach is also popular in other fields where modern data are of different nature than previously collected data. In a forthcoming paper (see references below), Stephen Fienberg and I survey advanced monitoring methods in other fields. The thought behind this was to see whether we can learn from other fields, and see how they tackle the new data era. We found methods that range from singular spectral analysis, wavelets, exponential smoothing, and to data depth. These more sophisticated methods are designed for monitoring noisy, frequent, and often non-stationary data.

III. Alternative Approach
An alternative approach to designing more complicated algorithms is to simplify the data. This is actually standard statistical thinking, if we consider transformations and other pre-processing techniques that are common in statistical analyses. However, unlike traditional pre-processing, practical biosurveillance systems require a much more automated way of data simplification. We cannot afford to have a statistician tweak every data stream separately, because there is an abundance of data sources, data types, and data streams, and all of these are subject to frequent changes. So unless we want to develop a new population of "data tweakers", finding automated ways sounds more reasonable.

By "simplification" I mean brining the data to a form that can then be fed into standard monitoring techniques. I am sure that the end-users, mainly epidemiologists and public health experts, would prefer using CuSum and EWMA charts to learning complicated new monitoring methods, or treating the system as a black-box.

Our research group, together with Howard Burkom and Sean Murphy from Johns Hopkins APL, has been focusing on this goal. I plan to post a few papers on some results soon. I'd be interested in hearing about similar efforts.


IV. References

Shmueli G. and Fienberg, S. E. (2006), Current and Potential Statistical Methods for Monitoring Multiple Data Streams for Bio-Surveillance, in Statistical Methods in Counter-Terrorism, Eds: A Wilson and D Olwell, Springer, to appear.