Thoughts on Biosurveillance

Tuesday, April 18, 2006

Handling Holidays

When dealing with daily syndromic data a major source of variability is attributable to holidays. Counts of hospital visits on holidays are usually lower, and the same is true for medication and grocery sales. Some of this is attributable to the holiday-hours and staffing at shops, work places, etc. Also, individuals probably tend to avoid going shopping on some holidays (whereas on other holidays shopping is extremely popular). In data such as school absences holiday data are missing altogether.

In the context of outbreak detection, holidays are usually bad news. They can affect the performance of monitoring methods and lead to false alarms. The good news is that we know exactly when holidays take place.

From the algorithm design standpoint the question is what to do about holiday data. Should it be removed? imputed? ignored? Should this be done before the algorithm is applied to the data, or should the algorithm have a built-in solution?

A discussion that we've had in our research group raises the following issues:

- If we are trying to precondition the data in order to remove seasonality and temporal trends, some methods are more sensitive to the presence of holiday data than others. For example, a 7-day differencing will only be affected by holidays locally: the week of the holiday will have an extremely low 7-day difference, and the following week will have an extremely high 7-day difference. In contrast, a regression model with day-of-week predictors might be heavily influenced by holidays.

- What metrics are affected by holiday data? (e.g., local/global averages, autocorrelation)

- How should holiday data be imputed? (of course, for what goal?)

I hope we can have a discussion here about different approaches to handling holiday data.

Wednesday, March 29, 2006

Biosurveillance meetings

Because of the very diverse nature of the research, implementation, and usage of syndromic surveillance systems, the platforms for discussing advances are disperesed. I've attended and given talks at a range of conferences, workshops, and workgroups ranging from purely academic to non-academic. Even within the academic conferences there are different types of meetings, from multiple disciplines. Let me describe a few past and upcoming events, and hopefully others will add to the list.

The largest venue is of course the annual Syndromic Surveillance Conference, to be hosted this year in Baltimore, MD. This is attended by many of the researchers developing and designing the systems, organizations that delpoy them, as well as users.

The DIMACS group on Biosurveillance, Data Monitoring and Information Exchange, spearheaded by Henry Rolka and Colleen Martin from CDC and David Madigan from Rutgers, organized a few workgroup meetings in the last few years. The latest brought together many health monitors who use syndromic surveillance systems.

The upcoming 2006 INFORMS conference will feature an invited session on "Quality and Statistical Decision-Making in Healthcare Applications", which might include a biosurveillance aspect.

Statistical flavor:
SAMSI Anomaly Detection Workgroup- A group of statisticians that have been meeting weekly since Sept-05 (some of them remotely) to discuss statistical methods for biosurveillance. There was a kickoff meeting and a mid-year workshop.

The 2005 Joint Statistical Meeting (JSM) in Minneapolis featured several session on biosurveillance, including an invited panel on "national systems for biosurveillance" and a session on "Innovations in Prospective Anomaly Detection for Biosurveillance".

The upcoming 2006 ENAR spring meeting will have a few biosurveillance-related talks.

The upcoming 2006 Intl Workshop on Applied Probability will have several sessions on biosurveillance (update from Daniel Neill)

More data-mining/machine-learning venues are the KDD 2005 workshop "Data Mining Methods for Anomaly Detection" and the next one, called "ML Algorithms for Surveillance and Event Detection" in conjuction with ICML-2006 in Pittsburgh.

And a venue more medical-informatics oriented is the annual American Medical Informatics Association symposium. In 2006 it will be in Washington DC (update from Daniel Neill)

Wednesday, March 01, 2006

Syndromic surveillance systems in practice

Last week's DIMACS working group on biosurveillance brought together users of BioSense, ESSENCE, and RODS surveillance systems alongside with a few researchers. This was an eye-opener for me, a researcher. I learned a very important point: most health monitors, who are the users of such systems, learned to ingore alarms triggered by their system. This is due to the excessive false alarm rate that is typical of most systems - there is nearly an alarm every day!

The increased alarm rate can be a result of several factors. First, both BioSense and ESSENCE use classic control charts such as CuSum and EWMA (although with some adaptations). One clear pattern that is very popular in daily syndromic data is a day-of-week effect. This violates one of the main assumptions underlying the above control charts, which is that the "target value" is constant. In contrast, we do not expect the same number of ER visits on weekends and on weekdays. In the presence of such an effect, a CuSum or EWMA are likely to lead to false alarms on "high" days (e.g., weekday visits to ED), and to miss real outbreaks on "low" days (e.g., weekend visits to ED).

The day-of-week effect is only one pattern that can lead, if ignored, to excessive false alarms. Another factor is the autocorrelation present in daily syndromic data, even after removing the day-of-week effect. Howard Burkom gives the example of the effect of a common cold on daily counts of respiratory syndrome data, if the cold is not what you want to detect. This, of course, can lead to excessive false alarms.

The bottom line is that the excessive false alarms are not surprising when these control charts are applied directly to raw daily syndromic data. This is closely related to my posting on monitoring complex data. The good news is that this can actually be improved: there are ways to precondition the data that directly address the structure of these new daily syndromic streams. Some examples are regression models (e.g., Brillman et al. 2005), ARIMA models (e.g., Reis and Mandl, 2003), exponential smoothing (e.g., Burkom, Murphy, and Shmueli, work in progress), wavelet methods (e.g., Goldenberg et al. 2002, Shmueli, 2005), etc. The difficulty is to have one automated method (or suite of methods) that can be used with diverse syndromic data streams. The diversity need therefore be characterized.

Another reason for the high rate of alarms has to do with multiple testing: A health monitor will be examining data from multiple locations, and can also slice the data by categories (e.g., by age group). This relates to a previous posting that I had on multiple testing. How should this multiplicity be addressed? Should there be a ranking system that displays the "most abnormal" series?

And finally, there is a host of data quality issues that are related to the alarming. These are typically harder to tackle through models.

As a statistician, I feel that we must regain the trust of health monitors in the alerting abilities of syndromic surveillance systems. Designing automated algorithms for data preconditioning should be a priority. We should also be able to show the benefit of employing such preconditioning before using popular control charts like the CuSum, and its effectiveness in reducing false alarms and capturing real abnormalities.

References

Judith C Brillman , Tom Burr , David Forslund , Edward Joyce , Rick Picard and Edith Umland, Modeling emergency department visit patterns for infectious disease complaints: results and application to disease surveillance, BMC Medical Informatics and Decision Making 2005, 5:4, pp 1-14 http://www.biomedcentral.com/content/pdf/1472-6947-5-4.pdf

Goldenberg A, Smueli G, Caruana RA and Fienberg SE (2002). Early Statistical Detection of Anthrax Outbreaks by Tracking Over-the-Counter Medication Sales. PNAS, 99 (8), 5237-5240.
Reis BY, Mandl KD. Time series modeling for syndromic surveillance. BMC Med Inform Decis Mak. 2003;3(1). http://www.biomedcentral.com/1472-6947/3/2

Shmueli, G., (2005) Wavelet-Based Monitoring in Modern Biosurveillance, Working Paper, Smith School of Business, Universith of Maryland

Tuesday, February 14, 2006

Monitoring Complex Data

I. Relevance
A main challenge in biosurveillance is that pre-diagnostic data are much more complicated than more traditional diagnostic data. These data are more frequent (daily vs. weekly or monthly), more noisy, and include a lot of irrelevant variability. There are many other challenges that arise on top of this, but the "ugly" data are a fundamental issue that needs to be accounted for.

II. Current State
Many current monitoring methods, which may have worked for traditional data, appear to be somewhat over-simplistic for monitoring modern pre-diagnostic data. The major effort has therefore been to design more sophisticated algorithms. And indeed, this approach is also popular in other fields where modern data are of different nature than previously collected data. In a forthcoming paper (see references below), Stephen Fienberg and I survey advanced monitoring methods in other fields. The thought behind this was to see whether we can learn from other fields, and see how they tackle the new data era. We found methods that range from singular spectral analysis, wavelets, exponential smoothing, and to data depth. These more sophisticated methods are designed for monitoring noisy, frequent, and often non-stationary data.

III. Alternative Approach
An alternative approach to designing more complicated algorithms is to simplify the data. This is actually standard statistical thinking, if we consider transformations and other pre-processing techniques that are common in statistical analyses. However, unlike traditional pre-processing, practical biosurveillance systems require a much more automated way of data simplification. We cannot afford to have a statistician tweak every data stream separately, because there is an abundance of data sources, data types, and data streams, and all of these are subject to frequent changes. So unless we want to develop a new population of "data tweakers", finding automated ways sounds more reasonable.

By "simplification" I mean brining the data to a form that can then be fed into standard monitoring techniques. I am sure that the end-users, mainly epidemiologists and public health experts, would prefer using CuSum and EWMA charts to learning complicated new monitoring methods, or treating the system as a black-box.

Our research group, together with Howard Burkom and Sean Murphy from Johns Hopkins APL, has been focusing on this goal. I plan to post a few papers on some results soon. I'd be interested in hearing about similar efforts.

IV. References

Shmueli G. and Fienberg, S. E. (2006), Current and Potential Statistical Methods for Monitoring Multiple Data Streams for Bio-Surveillance, in Statistical Methods in Counter-Terrorism, Eds: A Wilson and D Olwell, Springer, to appear.

Sunday, December 18, 2005

Outbreak Simulation

I. Relevance

In the absence of syndromic data that include a bioterrorist-related disease outbreak it is hard to evaluate the detection ability of different algorithms. This is on top of the complication of having natural disease oubreaks in the data, which are not easily labeled (when exactly was the last flu season in a certain geographical location?)

One approach has been to try and simulate signatures of such attacks and inject them into real, but attack-less data. A second approach has been to simulate the attack-less data as well. Yet, a different approach has been to try and model the consequences of a bio-agent release using meteorological and atmospheric models and use those to simulate an attack.

In this posting we concentrate on temporal data streams and algorithms. However, similar issues arise in spatial and spatio-temporal data and approaches.

II. Examples
In Goldenberg et al. (2002) we injected linearly increasing outbreaks into cough medication sales, over a 3-day period. We tried different slopes and different magnitudes.

Stoto et al. (2004) “seeded” real ER data with a “fast-outbreak” which was constructed as a 3-day linear increase in cases (adding 3,6, and 9 cases on the first, second, and third day, respectively), or a “slow-outbreak” constructed as a 9-day step function (adding 1,1,1,2,2,3,3,3 cases to each of the first through 9th days).

Burkom et al. (2005) simulated background (=attack-less) counts from a Poisson distribution. They then injected counts drawn randomly from a lognormal distribution, based on the lognormal distribution of incubation periods of infectious diseases.

III. Determining the outbreak structure
The main issue is that we do not really know what the signature of a bioterrorist attack disease outbreak would look like in medication sales, ER admissions, etc. In particular
o Different types of outbreaks can lead to different signatures
o Different data streams might have different “reactions” to outbreaks
What we do have is some knowledge on disease progression. The lognormal curve arrives from such information. But what does it measure when it comes to syndromic data streams? According to Burkom et al. (2005),

“The incubation period distribution was used to estimate the idealized curve for the expected number of new symptomatic cases on each outbreak day. The lognormal parameters were chosen to give a median incubation period of 3.5 days, consistent with the symptomatology of known weaponized diseases and a temporal case dispersion consistent with previously observed outbreaks”

The question is what can be inferred from disease progression to the manifestation of the outbreak in pre-diagnosis data? There will clearly be large effects of media, word-of-mouth, and mass psychology. Can these be integrated to some degree?

Another approach has been to model behavior at the individual level. Wong et al. (2005) consider the fact that “the majority of the background knowledge of the characteristics of respiratory anthrax disease is at an individual rather than a population level.” They therefore build a model based on “person-level” activity for detecting infectious but noncontagious diseases such as Anthrax.

IV. Implications
Given that outbreak simulation is used for the purpose of evaluating the performance of detection algorithms, the main issue with simulating a pre-defined outbreak shape is that we can design the most efficient monitoring algorithm to detect the particular simulated outbreak shape!!!

For instance, it can be shown that the Shewhart chart is most efficient at detecting a (large) single spike, a Cusum chart for detecting a step function, an EWMA chart for detecting an exponential increase, etc. (Box & Luceño, 1997)

In the recent Bio-ALIRT competition, a group of medical and epidemiological experts examined the datasets and by eyeballing and using a Cusum chart determined when there were outbreaks:
“Using visual and statistical techniques, ODG found evidence of disease outbreaks in the data” (Siegrist & Pavlin, 2004)

The participating groups were then to detect those outbreaks. Clearly, those who used a Cusum or algorithms that mimic human sight were most likely to do “best”.

V. Injecting simulated outbreaks
Another issue with outbreak simulation is how it to inject it into the no-outbreak data. Clearly there are some periods when it is more likely to get detected than others.

In Goldenberg et al. (2002) we injected the simulated outbreak at every point in the series, and then evaluated an overall rate of how many times it was detected (as well as false alarms). A similar approach was taken in Stoto et al. (2004)

In contrast, Burkom et al. (2005) injected the simulated outbreak at a randomly chosen start day (recall that their background data is in itself Poisson-simulated data).

VI. Some solutions
Since we really do not know the shape or magnitude of an outbreak, one approach is to simulate a range of different outbreaks and then evaluate algorithms over all the different types. This will most likely give preference to algorithms that are not very tightly coupled with a certain outbreak type (e.g., wavelets or other multi-resolution methods).

A practical consideration is to choose an outbreak duration that is not longer than the period that we would act upon. For instance, if detecting an Anthrax attack occurs 3 days later, then it is too late. In that sense we can consider just the outbreak signature in its first three days.

VII. References
Box, G. and Luceño, A. (1997). Statistical Control: By Monitoring and Feedback Adjustment. Wiley-Interscience, 1st edition.

Burkom, H, Murphy, S, Coberly, J, and Hurt-Mullen K (2005), Public Health Monitoring Tools for Multiple Data Streams, MMWR 54 (suppl), 55-62.

Goldenberg A, Smueli G, Caruana RA and Fienberg SE (2002). Early Statistical Detection of Anthrax Outbreaks by Tracking Over-the-Counter Medication Sales. PNAS, 99 (8), 5237-5240.
Siegrist, D and Pavlin, J (2004), Bio-ALIRT Biosurveillance Detection Algorithm Evlauation, MMWR 53 (suppl), 152-158.

Stoto MA, Schonlau M, Mariano LT (2004). Syndromic Surveillance: Is it Worth the Effort? Chance, 17 (1), 19-24.

Wong, W-K, Cooper, G, Dash, D, Levander, J., Dowling, J, Hogan, W, and Wagner M (2005), Use of Multiple Data Streams to Conduct Bayesian Biologic Surveillance, MMWR 54 (suppl), 63-69.

Thursday, November 10, 2005

Multiple Testing

The goal of this post is to generate reactions on how we should tackle the issue of multiple testing that takes place in biosurveillance systems. It considers multiple testing in the context of biosurveillance. The notes are based on personal opinions and questions and on conversations with Dr. Howard Burkom from the Johns Hopkins Applied Physics Lab.

I. RELEVANCE
Multiplicity occurs at multiple levels within a biosurveillance system:

Regional level-- When monitoring multiple regions. This is also true within a region, where we are monitoring multiple locations (e.g., hospitals, offices, stores)
Source level-- Within a region we are monitoring multiple sources (OTC, ER…)
Series level -- within each data source we are monitoring multiple series. Sometimes multiple series are created from a single series by stratifying the data by age group or gender.
Algorithm level -- Within a single series, using multiple algorithms (e.g., for detecting changes in different parameters) or even a method such as wavelets that breaks down a single series into multiple series

The multiplicity actually plays a slightly different role in each case, because we have different numbers and sets of hypotheses.

Howard Burkom et al. (2005) coin the terms “parallel monitoring” vs. “consensus monitoring” to distinguish between the case of multiple hypotheses being tested simultaneously by multiple independent data (“parallel”) and the case of monitoring multiple data sources for testing a single hypothesis (“consensus”). According to this distinction we have parallel testing at the regional level, but consensus monitoring at the source-, series-, and algorithm-level.

Are the two types of multiplicity conceptually different? Should the multiple results (e.g., p-values) be combined in the same way?

II. HYPOTHESES
Regional level -- Each region has a separate null hypothesis. For region i we have

H0: no outbreak in region i
H1: outbreak in region i

Therefore we have multiple sets of hypotheses.

If we consider singular bioterrorist attacks, then these sets (and the tests) are independent.
If we expect a coordinated terrorist attack at multiple locations simultaneously, then there is positive dependence. For an epidemic, there is positive dependence

Source level -- even if we limit ourselves to a certain geographic location and, say, a single zipcode, then we have multiple data sources. In this case we have a single conceptual null hypothesis for all sources:

H0: no outbreak in this region
H1: outbreak in this region

However, we should really treat the outbreak occurrence as a hidden event. We are using the syndromic data as a proxy for measuring the hidden Bernoulli variable, and in fact testing source-specific hypotheses:

H0: no (outbreak-related) increase in OTC sales
H1: (outbreak-related) increase in OTC sales

When we test this we ignore the (outbreak-related) part. We only search for increases in OTC sales or ER admissions, etc. and try to eliminate as many external factors as possible (promotions, day of week, etc.) To see the added level of uncertainty in using proxy information, consider the following diagram:

When there is no outbreak we might still be getting false alarms that we would not have received had we been measuring a direct manifestation of the outbreak (such as laboratory results). For example, a new pharmacy opening up in a monitored chain would show an increase in medication sales, which might cause an alert. So we should expect a much higher false alarm rate.

On the other hand, in the presence of an outbreak we are likely to miss it if it does not get manifested in the data.

So the underlying assumptions when monitoring syndromic data are
(1) The probability of outbreak-related anomalies manifesting themselves in the data is high (removing red nodes from tree)
(2) The probability of an alarm due to non-outbreak reasons is minimal (removing blue nodes from tree)

Based on these two assumptions, most algorithms are designed to test:

H0: No change in parameter of syndromic data
H1: change in parameter of syndromic data

With respect to assumption (2), it has been noted that there are many cases where non-outbreak reasons lead to alarms. Thus, controlling tightly for those is importance. Alternatively, the false alarm rate (and correct detection rate) should be adjusted to account for these additional factors. These same issues are true for the series- and algorithm-level monitoring.

Series level – multiple series within a data source are usually collected for monitoring different syndromes. For instance: cough medication/cc, fever medication/cc, etc. This is also how CDC is thinking about the multiple series, by grouping ICD-9 codes into (11) categories by symptoms. If we treat each symptom separately then we have 11 tests going on.

H0: no increase in syndrome j
H1: increase in syndrome j

Ideally, we’d look at the specific combination of symptoms (i.e. a syndrome) that increase to better understand which disease is being spread. Also, we believe that an outbreak will lead to an increase in multiple symptoms. So these hypotheses are really related. Again, the conditional part comes in (whether there is an outbreak + the additional level of uncertainty on if/how the syndromic data will show this)

Algorithm level – multiple algorithms running on a same series might be used because each algorithm is looking for a different type of signal. This would give a single H0 but multiple H1

H0: Series mean is same
H1: change in series mean of type k

Or, if we have algorithms that monitor different parameters (cusum for mean and F-statistic for variance), then we also have a multiplicity of H0. Finally, algorithms such as wavelet-based MS SPC break down the series into multiple resolutions. Then, if we test at each resolution we have a multiplicity. Whether the resolutions are correlated or not depends on the algorithm (e.g., using downsampling or not).

III. HANDLING MULTIPLE TESTING STATISTICALLY

There are several methods for handling multiple testing, ranging from Bonferroni type corrections to Hochberg & Benjamini’s False Discovery Rate (and its variants). There are also Bayesian methods aimed at tackling this problem. Each of the methods has its limitations (Bonferroni is considered over-conservative, FDR corrections depend on the number of hypotheses and are problematic with too few hypotheses, and Bayesian methods are sensitive to the choice of prior and it is unclear how to choose a prior). Howard Burkom et al. (2005) consider these methods for correcting for "parallel monitoring" (multiple hypotheses with independent data streams). For "consensus monitoring" they consider a different set of methods for combining the multiple p-values, from the world of clinical trials. These include Fisher’s statistic and Edgington’s method (that can be approximated for a large number of tests by . Burkom et al. (2005) discuss the advantages and disadvantages of these two methods in the context of the ESSENCE surveillance system.

But should we really use different methods for accounting for the multiplicity? What is the link between the actual corrections and the conceptual differences?

IV. WHAT CAN BE DONE?

Rate the quality of data sources: signaling by more reliable sources should be more heavily weighted
Evaluate the risk level of the different regions: alarms in higher-risk regions should be taken more seriously (like Vicky Bier’s arguments about investment in higher-risk cases)
“The more the merrier” is probably not a good strategy when concerning the number of data sources. It is better to invest in few reliable data sources than in multiple less-reliable ones. Along the same lines, the series chosen to be monitored should also be carefully screened according to their real contribution and their reliability. With respect to regions, better to monitor more risky regions (in the context of bioterrorist attacks or epidemics).
Solutions should depend on who the monitoring body is: national surveillance systems (e.g., the BioSense by CDC) have more of the regional issue than local systems.
The choice of symptom grouping and syndrome definitions, which is currently based on medical considerations (http://www.bt.cdc.gov/surveillance/syndromedef/index.asp), would benefit from incorporating statistical considerations.

V. REFERENCES

Burkom, H. S, Murphy, S., Coberly, J., and Hurt-Mullen, K. “Public Health Monitoring Tools for Multiple Data Streams”, MMWR, Aug 26, 2005 / 54(Suppl);55-62.

Marshall, C., Best Nicky, Bottle, A., and Aylin, P. “Statistical Issues in the Prospective Monitoring of Health Outcomes Across Multiple Sources”, JRSS A, 2004, vol 167 (3), pp. 541-559.