Tuesday, April 18, 2006

Handling Holidays

When dealing with daily syndromic data a major source of variability is attributable to holidays. Counts of hospital visits on holidays are usually lower, and the same is true for medication and grocery sales. Some of this is attributable to the holiday-hours and staffing at shops, work places, etc. Also, individuals probably tend to avoid going shopping on some holidays (whereas on other holidays shopping is extremely popular). In data such as school absences holiday data are missing altogether.

In the context of outbreak detection, holidays are usually bad news. They can affect the performance of monitoring methods and lead to false alarms. The good news is that we know exactly when holidays take place.

From the algorithm design standpoint the question is what to do about holiday data. Should it be removed? imputed? ignored? Should this be done before the algorithm is applied to the data, or should the algorithm have a built-in solution?

A discussion that we've had in our research group raises the following issues:

- If we are trying to precondition the data in order to remove seasonality and temporal trends, some methods are more sensitive to the presence of holiday data than others. For example, a 7-day differencing will only be affected by holidays locally: the week of the holiday will have an extremely low 7-day difference, and the following week will have an extremely high 7-day difference. In contrast, a regression model with day-of-week predictors might be heavily influenced by holidays.

- What metrics are affected by holiday data? (e.g., local/global averages, autocorrelation)

- How should holiday data be imputed? (of course, for what goal?)

I hope we can have a discussion here about different approaches to handling holiday data.