Multiple Testing
I. RELEVANCE
Multiplicity occurs at multiple levels within a biosurveillance system:
- Regional level-- When monitoring multiple regions. This is also true within a region, where we are monitoring multiple locations (e.g., hospitals, offices, stores)
- Source level-- Within a region we are monitoring multiple sources (OTC, ER…)
- Series level -- within each data source we are monitoring multiple series. Sometimes multiple series are created from a single series by stratifying the data by age group or gender.
- Algorithm level -- Within a single series, using multiple algorithms (e.g., for detecting changes in different parameters) or even a method such as wavelets that breaks down a single series into multiple series
The multiplicity actually plays a slightly different role in each case, because we have different numbers and sets of hypotheses.
Howard Burkom et al. (2005) coin the terms “parallel monitoring” vs. “consensus monitoring” to distinguish between the case of multiple hypotheses being tested simultaneously by multiple independent data (“parallel”) and the case of monitoring multiple data sources for testing a single hypothesis (“consensus”). According to this distinction we have parallel testing at the regional level, but consensus monitoring at the source-, series-, and algorithm-level.
Are the two types of multiplicity conceptually different? Should the multiple results (e.g., p-values) be combined in the same way?
II. HYPOTHESES
Regional level -- Each region has a separate null hypothesis. For region i we have
H1: outbreak in region i
If we expect a coordinated terrorist attack at multiple locations simultaneously, then there is positive dependence. For an epidemic, there is positive dependence
H1: outbreak in this region
H1: (outbreak-related) increase in OTC sales

On the other hand, in the presence of an outbreak we are likely to miss it if it does not get manifested in the data.
So the underlying assumptions when monitoring syndromic data are
(1) The probability of outbreak-related anomalies manifesting themselves in the data is high (removing red nodes from tree)
(2) The probability of an alarm due to non-outbreak reasons is minimal (removing blue nodes from tree)
Based on these two assumptions, most algorithms are designed to test:
H1: change in parameter of syndromic data
Series level – multiple series within a data source are usually collected for monitoring different syndromes. For instance: cough medication/cc, fever medication/cc, etc. This is also how CDC is thinking about the multiple series, by grouping ICD-9 codes into (11) categories by symptoms. If we treat each symptom separately then we have 11 tests going on.
H1: increase in syndrome j
Algorithm level – multiple algorithms running on a same series might be used because each algorithm is looking for a different type of signal. This would give a single H0 but multiple H1
H1: change in series mean of type k
III. HANDLING MULTIPLE TESTING STATISTICALLY
But should we really use different methods for accounting for the multiplicity? What is the link between the actual corrections and the conceptual differences?
IV. WHAT CAN BE DONE?
- Rate the quality of data sources: signaling by more reliable sources should be more heavily weighted
- Evaluate the risk level of the different regions: alarms in higher-risk regions should be taken more seriously (like Vicky Bier’s arguments about investment in higher-risk cases)
- “The more the merrier” is probably not a good strategy when concerning the number of data sources. It is better to invest in few reliable data sources than in multiple less-reliable ones. Along the same lines, the series chosen to be monitored should also be carefully screened according to their real contribution and their reliability. With respect to regions, better to monitor more risky regions (in the context of bioterrorist attacks or epidemics).
- Solutions should depend on who the monitoring body is: national surveillance systems (e.g., the BioSense by CDC) have more of the regional issue than local systems.
- The choice of symptom grouping and syndrome definitions, which is currently based on medical considerations (http://www.bt.cdc.gov/surveillance/syndromedef/index.asp), would benefit from incorporating statistical considerations.
V. REFERENCES
Burkom, H. S, Murphy, S., Coberly, J., and Hurt-Mullen, K. “Public Health Monitoring Tools for Multiple Data Streams”, MMWR, Aug 26, 2005 / 54(Suppl);55-62.
Marshall, C., Best Nicky, Bottle, A., and Aylin, P. “Statistical Issues in the Prospective Monitoring of Health Outcomes Across Multiple Sources”, JRSS A, 2004, vol 167 (3), pp. 541-559.