Guest blog post by Heinrich von Keler
In this post, I will cover in-depth a Big Data use case: monitoring and forecasting air pollution.
A typical Big Data use case in the modern Enterprise includes the collection and storage of sensor data, executing data analytics at scale, generating forecasts, creating visualization portals, and automatically raising alerts in the case of abnormal deviations or threshold breaches.
This article will focus on an implemented use case: monitoring and analyzing air quality sensor data using Axibase Time-Series Database and R Language.
Steps taken by the data science team to execute the use case:
- Collect historical data from AirNow into ATSD
- Stream current data from AirNow into ATSD
- Use R Language to execute data analytics and generate forecasts for all collected entities and metrics
- Create Holt-Winters forecasts in ATSD for all collected entities and metrics
- Build a visualization portal
- Setup alert and notification rules in the ATSD Rule Engine
Hourly readings of several key air quality metrics are being generated by over 2,000 monitoring sensor stations located in over 300 cities across the United States, the historical and streaming data is retrieved and stored in ATSD.
The data is provided by AirNow, which is a U.S. government EPA program that protects public health by providing forecast and real-time air quality information.
The two main collected metrics are PM2.5 and Ozone (o3).
PM2.5 is particles less than 2.5 micrometers in diameter, often called “fine” particles. These particles are so small they can be detected only with an electron microscope. Sources of fine particles include all types of combustion, including motor vehicles, power plants, residential wood burning, forest fires, agricultural burning, and industrial processes.
o3 (Ozone) occurs naturally in the Earth’s upper atmosphere, 6 to 30 miles above the Earth’s surface where it forms a protective layer that shields us from the sun’s harmful ultraviolet rays. Man-made chemicals are known to destroy this beneficial ozone.
Other collected metrics are: pm10 (particulate matter up to 10 micrometers in size), co (Carbon Monoxide), no2 (nitrogen dioxide) and so2 (sulfur dioxide).
Collecting/Streaming the Data
A total of 5 years of historical data has been collected, stored, analyzed and accurately forecast. In order for the forecasts to have maximum accuracy, account for trends and for seasonal cycles, at least 3 to 5 years of detailed historical data is recommended.
An issue with the accuracy of the data was immediately determined. The data was becoming available with a fluctuating time delay of 1 to 3 hours. An analysis was conducted by collecting all values for each metric and entity, resulting in several data points being recorded for the same metric, entity and time. This led us to believe that there was both a time delay and stabilization period. Below are the results:
Once available, the data then took another 3 to 12 hours to stabilize, meaning that the values were fluctuating during that time frame for most data points.
As a result of this analysis, it was decided, that all data will be collected with a 12 hour delay in order to increase the accuracy of the data and forecasts.
Axibase Collector was used to collect the data from monitoring sensor stations and stream into Axibase Time-Series Database.
In Axibase Collector a job was setup to collect data from the air monitoring sensor stations in Fresno, California. For this particular example, Fresno was selected because it is considered one of the most polluted cities in the United States, with air quality warnings being often issued to the public.
The File Job sets up a cron task that runs at a specified interval to collect the data and batch upload it into ATSD.
The File Forwarding Configuration is a parser configuration for data incoming from an external source. The path to the external data source is specified, a default entity is assigned to the Fresno monitoring sensor station, start time and end time determine the time frame for retrieving new data (end time syntax is used).
Once these two configurations are saved, the collector starts streaming fresh data into ATSD.
The entities and metrics streamed by the collector into ATSD can be viewed from the UI.
The whole data-set currently has over 87,000,000 records for each metric, all stored in ATSD.
Generating Forecasts in R
The next step was to analyze the data and generate accurate forecasts. Built-in Holt-Winters and Arima algorithms were used in ATSD and custom R language data forecasting algorithms were used for comparison.
To analyze the data in R, the R language API client was used to retrieve the data and then save the custom forecasts back into ATSD.
Forecasts were built for all metrics for the period of May, 11 until June, 1.
The steps taken to forecast the pm2.5 metric will be highlighted.
The Rssa package was used to generate the forecast. This package implements Singular Spectrum Analysis (SSA) method.
Recommendations from the following sources were used to choose parameters for SSA forecasting:
The following steps were executed when building the forecasts:
pm2.5 series was retrieved from ATSD using the query() function. 72 days of data were loaded.
SSA decomposition was built with a window of 24 days and 100 eigen triples:
dec <- ssa(values, L = 24 * 24, neig = 100)
eigen values, eigen vectors, pairs of sequential eigen vectors and w-correlation matrix of the decomposition were graphed:
plot(dec, type = "values")
plot(dec, type = "vectors", idx = 1:20)
plot(dec,type = "paired", idx = 1:20)
plot(wcor(dec), idx = 1:100)
A group of eigen triples was then selected to use when forecasting. The plots suggest several options.
Three different options were tested: 1, 1:23, and 1:35, because groups 1, 2:23 and 24:35 are separated from other eigen vectors, as judged from the w-correlation matrix.
The rforecast() function was used to build the forecast:
rforecast(x = dec, groups = 1:35, len = 21 * 24, base = "original")
Tests were run with vforecast(), and bforecast() using different parameters, but rforecast() was determined to be the best option in this case.
Graph of the original series and three resulting forecasts:
Forecast with eigen triples 1:35 was selected as the most accurate and saved into ATSD.
- To save forecasts into ATSD the save_series() function was used.
Generating Forecasts in ATSD
The next step was to create a competing forecast in ATSD using the built-in forecasting features. Majority of the settings were left in automatic mode, so the system itself determines the best parameters (based on the historical data) when generating the forecast.
Visualizing the Results
To visualize the data and forecasts, a portal was created using the built-in visualization features.
Thresholds have been set for each metric, in order to alert the user when either the forecast or actual data are reaching unhealthy levels of air pollution.
When comparing the R forecasts and ATSD forecasts to the actual data, the ATSD forecasts turned out to be significantly more accurate in most cases, learning and recognizing the patterns and trends with more certainty. Until this point in time, as the actual data is coming in, it is following the ATSD forecast very closely, any deviations are minimal and fall within the confidence interval.
It is clear that the built-in forecasting of ATSD often produces more accurate results than even one of the most advanced R language forecasting algorithms that was used as part of this use case. It is absolutely possible to rely on ATSD to forecast air pollution for few days/weeks into the future.
You can keep track of how these forecasts perform in comparison to the actual data in Chart Lab.
Alerts and Notifications
A smart alert notification was setup in the Rule Engine to notify the user by email if the pollution levels breach the set threshold or deviate from the ATSD forecast.
Analytical rules set in Rule Engine for pm2.5 metric – alerts will be raised if the streaming data satisfies one of the rules:
value > 30 - Raise an alert if last metric value exceeds threshold
forecast_deviation(avg()) > 2 - Raise an alert if the actual values exceeds the forecast by more than 2 standard deviations, see image below. Smart rules capture extreme spikes in air pollution.
At this point the use case is fully implemented and will function autonomously; ATSD automatically streams the sensor data, generates a new forecast every 24 hours for 3 weeks into the future and raises alerts if the pollution levels rise above the threshold or if a negative trend is discovered.
Results and Conclusions
The results of this use case are useful for travelers, for whom it is important to have an accurate forecast of environmental and pollution related issues that they may face during their visits or for expats moving to work in a new city or country. Studies have proven that long-term exposure to high levels of pm2.5 can lead to serious health issues.
This research and environmental forecasting is especially valuable in regions like China, where air pollution is seriously affecting the local population and visitors. In cities like Shanghai, Beijing and Guangzhou, pm2.5 levels are constantly fluctuating from unhealthy to critical levels and yet accurate forecasting is limited. Pm2.5 forecasting is critical for travelers and tourists who need to plan their trips during periods of lower pollution levels due to potential health risks associated with exposure to this sort of pollution.
Government agencies can also take advantage of pollution monitoring to plan and issue early warnings to travelers and locals, so that precautions can be taken to prevent exposure to unhealthy levels of pm2.5 pollution. Detecting a trend and raising an alert prior to pm2.5 levels breaching the unhealthy threshold is critical for public safety and health. Having good air quality data and performing data analytics can allow people to adapt and make informed decisions.
Big Data Analytics is an empowerment tool that can put valuable information in the hands of corporations, governments and individuals, and that knowledge can help motivate or give people tools to stimulate change. Air pollution is currently affecting the lives of over a billion people across the globe and with current trends the situation will only get worse. Often the exact source of the air pollution, how it’s interacting in the air and how it’s dispersing cannot be determined, the lack of such information makes it a difficult problem to tackle. With advances in modern technologies and new Big Data solutions, it is becoming possible to combine sensor data with meteorological satellite data to perform extensive data analytics and forecasting. Through Big Data analytics it will be possible to pinpoint the pollution source and dispersion trends days in advanced.
I sincerely believe that Big Data has a large role to play in tackling air pollution and that in the coming years advanced data analytics will be a key tool influencing government decisions and regulation change.
You can learn more about Big Data analytics, forecasting and visualization at Axibase.