By DataKind San Francisco
After the Tōhoku earthquake and tsunami devastated eastern Japan and caused the subsequent Fukushima Daiichi nuclear meltdown in 2011, the public lacked accurate and trustworthy radiation information. Safecast was formed in response to democratize data access, growing quickly in size, scope, and geographic reach as its volunteers began monitoring, collecting, and sharing information on environmental radiation. The nonprofit organization has since supported the citizen-led collection of radiation and air quality data in 102 countries. Safecast has empowered around 5,000 volunteers worldwide who independently perform environmental monitoring with Safecast’s open hardware and data, with 5,000 devices deployed and more than 159 million environmental readings published into the public domain to date.
With an innovative model of citizen-led deployments spanning the globe, Safecast has multiple device types that measure radiation levels and air quality. The data collected by these sensors sometimes contain anomalies, which may be caused by environmental conditions, human error, hardware issues, firmware issues, or data transmission issues. Safecast and DataKind San Francisco partnered to detect these anomalies, allowing the team to quickly determine which sensors could be malfunctioning and may need to be repaired/replaced. This faster response improves the accuracy and reliability of measurements and reduces the need for Safecast volunteers to manually monitor incoming data.
In this blog post, we describe our approach to analyze the time series air-quality data through rule-based and statistical anomaly detection viewpoints. Over the second half of 2020, this approach helped identify two critical bugs in how sensors tracked time. We also developed open-source code that would allow this monitoring to continue moving forward, hoping to provide Safecast volunteers all over the world with another tool to create valuable and independent data.
Data and approach overview
We analyzed historical time series of radiation and air-quality data and developed strategies to identify anomalous events from the Solarcast devices. Solarcast devices are static panels with 36 deployments around the world. Many of the devices have been collecting data since 2017. These GPS-enabled devices are made up of two radiation sensors, three air-quality sensors, and sensors for measuring relative humidity and panel temperature. They log data at different intervals; a representative frequency of logging would be around 15 minutes.
Images above courtesy of Safecast.
We noted that the data itself lacked a definition of what qualifies as an anomalous event; i.e. the ground-truth labels for anomalies were absent. Therefore, we needed to design an approach that would work in this unsupervised condition. Because it’s more important to detect faulty sensors than to avoid triggering a false alert, we wanted to take a more exhaustive approach in identifying anomalies, so as to maximize the recall over precision as a metric. For that reason, we took two complementary anomaly detection approaches: the first based on simpler rules to clean the data and the second applying more complex statistical filters.
Rule-based anomaly detection
Our rule-based anomaly detection screened the data for the following qualities:
- Type consistency: The format of the various entries weren’t consistent in the dataset. For instance, with both air quality and radiation data, the same column might have both string and float entries. We standardized the type.
- Range validity: Environmental humidity, recorded as relative humidity (RH) here, is by definition bounded between 0 and 100. Sensor issues sometimes caused RH values to be out of range. We replaced those out-of-range values with null values and flagged such events.
- Timestamp accuracy: There were multiple different records for a device all with the same timestamp, caused by data transmission and firmware issues (see here). Since there’s no way to trace back which one or more of these is the correct record if any, these data records create ambiguity and so are worth flagging.
- Consistency between measurements: The Air Quality Indices (AQI) as measured by the sensors in the Solarcast panels correspond to PM1.0, PM2.5, and PM10.0 measurements. PM1.0 is proportional to the concentration of particulate matter that’s 1.0 micrometer or less in size, and PM2.5 and PM10.0 follow the same convention. As defined, for the same reading, the measurements must be such that PM1.0<PM2.5<PM10.0. We identified measurements in the dataset that didn’t follow this constraint and flagged them as well.
After running the dataset through rule-based anomaly detection and correcting those values, we now had a clean dataset to use for statistical anomaly detection.
Statistical anomaly detection
Our statistical anomaly detection screened the data for slightly more complex qualities, and each requires input hyperparameter(s) from stakeholders to tune the results:
- Night vs. day readings: Based on the data, we expected better air quality readings during the night time than during the preceding day time. This isn’t a hard-bound rule of course. Nevertheless, if the nighttime median AQ is observed to be greater than the preceding day’s median AQ, the sensor could be faulty or detect significant environmental trends, both of which merit an alert. We developed this alert to be configurable by changing the hours considered nighttime vs. daytime and by specifying a minimum required number of data points during each time period needed to trigger the alert.
- Time lag between consecutive measurements: We flagged events for which the next measurement is available only after a specified number of days (default configuration set to two days) as anomalous. A large gap could hint at which sensors might be failing since they aren’t logging any data.
- Standard deviations from a rolling median: To find outliers in the traditional statistical sense, we first needed to identify a certain type of a “normal” trend of the data. To capture this normal, we adopted the method of taking the rolling median of the time series (using median instead of mean to minimize the effect of outliers). An event that doesn’t lie with a specified number of standard deviations is tagged as an anomaly.
In the figure above, radiation data from one of the sensors is visualized using the interactive holoviews Python library along with the rolling median and +/- 3 standard deviation bounds. The 3 standard-deviation bounds are able to capture the range of measurements for the most part. The outliers are then tagged as anomalies. The windows used to calculate the rolling median and number of standard deviations in the bound are both hyperparameters and therefore configurable.
Monitoring moving forward
Beyond the findings of the project, we wanted to set up a way to continue monitoring these anomalies moving forward. We packaged the code for both rule-based and statistical approaches into separate Python scripts, documented and open-sourced here along with instructions for reproducible software setup. This package takes the project from a one-time analysis to a safeguard that can continuously review new data for anomalies.
The two approaches, particularly the statistical one, are capable of identifying many anomalies of varying degrees of severity. The final output of the Python scripts helps identify which anomalies are the most severe, providing a 1-10 rating normalized based on historical data for each anomaly in a .csv file.
Thanks to the Safecast team, Angela Eaton, Mat Schaffer, and Rob Oudendijk, for their technical expertise, engaged partnership, and commitment to open data for good. At DataKind San Francisco, we appreciate the opportunity to work together in support of citizen science, and we’re grateful for the efforts of Edwin Zhang and other DataKind San Francisco volunteers on this project.