For an introduction to flu biosurveillance in the US, see:
How Does ILI-Nearby Work?
We use sensor fusion theory to optimally combine, in real-time, diverse sources of information about the prevalence of Influenza-Like-Illness (ILI) in various geographic areas of the US. These sources, enumerated below, vary in their geographic granularity (metro areas, U.S. states, multi-state regions, entire US), their temporal granularity (daily or weekly), their noise (error) characteristics, and their correlation with one another.
We estimate a geographically detailed joint distribution of ILI prevalence over the entire US, which we then integrate over to derive real-time nowcasts for any desired geographic area. This means that even an information source that is available only at a national or regional level contributes indirectly to the accuracy of nowcasting a state. All information sources benefit all of our estimates, but to different degrees in different circumstances (to see how the information sources contribute to nowcasting for a particular geographic area, hover over the icons representing these sources.) Our combined estimate is significantly more accurate than estimates based on any one of the sources alone.
We continuously monitor the correlations of the information sources and use them to estimate (again in real-time) a covariance matrix. This helps us avoid over- or under- estimation when the errors of two or more sources are correlated.
How accurate is ILI-Nearby?
As Gold Standard for national and regional weighted ILI, we use the CDC-reported 'final' values, typically defined as the values available at epi-week 30 (late July – well after the end of the flu season!). This is because updates continue to arrive many months after each initial report. ILI-Nearby overlays the gold standard (black line) on its own historical nowcasts (which are done “out of sample”, without knowledge of anything that wasn’t known at nowcasting time), from which you can assess its historical accuracy.
In addition, the following visualizations summarize the historical out-of-sample absolute error of ILI-Nearby, alongside that of Delphi’s near-term forecasts (available a few weeks earlier) and CDC’s ILI-Net initial report and later reports (available a few weeks later). From these visualizations, you can see the tradeoff between accuracy and timeliness offered by Delphi’s nowcasts and forecasts:
As Gold Standard for fitting the US states’ %ILI, we use the best stable estimate available from each state. Twenty five states graciously agreed to share with us the %ILI data they sent to the CDC. Forty three states post a version of their %ILI data online, which we scrape and combine with the CDC-provided data when both are available. Only 5 states do not fall in either of these categories, and thus lack a gold standard. However, their nowcasts still benefit from the data of their neighboring states and regions.
CAVEAT: %ILI numbers are not meaningfully comparable across different states or regions. This is because each state's ILINet consists of a different mix of healthcare provider types (hospital emergency departments, pediatric clinics, family practices, student health centers, etc.), and different provider types have different typical rates of seeing ILI patients. We are currently working to derive measures that will be more meaningfully comparable across states and regions.
Information Sources Currently Used by ILI-Nearby
Our method can readily incorporate any additional information sources: Electronic Health Records, relevant retail sales (e.g. thermometers), crowdsourced reports, etc. If you would like to contribute an information source or have an idea for a new potential source, please contact us!
Google Search Query Signal (GHT)
The Google Trends team has graciously granted us access to its research-grade Google Health Trends interface. This allows us to build a specialized query based signal optimized for ILI nowcasting. GHT data is available from 2003 through the present, at daily (and even finer) temporal resolution, and at national, state and metro-area geographic resolution.
The Johns Hopkins Social Media and Health Research Group has graciously allowed us access to HealthTweets.org, an ongoing, real-time signal of flu activity based on Twitter content. This signal is available from the end of 2011 through the present, at a daily resolution, and at the level of US states (and other locations).
Wikipedia Article Access Counts (WIKI)
The Wikimedia Foundation makes available the number of visits (“hits”), per hour, to every Wikipedia article. There is a strong correlation between wILI in the US and the number of hits to English articles related to influenza. This allowed us to develop an ILI-related signal. We currently have no access to any geographic information associated with Wikipedia access. This reduces the usefulness of the signal, but we are able to compensate somewhat by considering the time of day (or night) of access.
CDC Page Visits (CDCP)
CDC has graciously been sharing with us some of their website page analytics, which allowed us to develop an ILI signal based on frequency of access to flu-related pages. The signal is available since 2013, currently at weekly resolution and for each individual state.
Epicast Prediction (EPIC)
Epicast is a ‘wisdom of crowds’ system that aggregates predictions from multiple lay human forecasters. On any week, it allows anyone to predict the rest of the flu season given the ILINet estimates available to date. Epicast was developed by David Farrow as part of his PhD thesis (ch. 5), and has performed at or near the top in CDC’s “Predict the Flu” challenges in 2014-2015 and 2015-2016. It is available since 2014 at weekly temporal resolution and at the geographic resolution of HHS regions and US as a whole.
Seasonal Autoregression (SAR3)
SAR3 is a seasonal auto-regressive model that estimates the final wILI value of the current week based on the current week number and preliminary (currently available) wILI values of the past three weeks. SAR3 was developed by David Farrow as part of his PhD thesis. SAR3 is available for the same time period and at the same temporal and geographic resolution as wILI — ongoing weekly for the US nationally and for all HHS and census regions.
The Archetype (ARCH)
The Archetype is a nonparametric Kalman Filter-based forecasting system built by David Farrow as part of his PhD thesis (Appendix B). ARCH is available weekly from 2003w40–2016w20 for the US nationally and for all HHS and census regions. However, the ARCH system does not produce forecasts during the off-season, and so no values are reported between epiweeks 21 and 39, inclusive, on all years.
For more details about any of these sources, see David Farrow's PhD thesis and subsequent publications.
We are grateful to Google (Google Trends team), Johns Hopkins Social Media and Health Research Group, the Wikimedia Foundation, CDC Influenza Division (Epidemiology and Prevention Branch), and the Health Departments of many US states for giving us access to their data.