At the RCUK Centre for Energy Epidemiology, based at UCL Energy Institute, a research project by Tadj Oreszczyn and Jonathan Chambers wanted to find better ways of using smart meter data to measure energy efficiency in UK homes. They aimed to apply big data methods to smart meter data to generate models of homes and their energy use.
The introduction of smart meters in the UK has provided high quality, high-resolution electricity and gas consumption data across millions of homes across the UK. These datasets have enormous potential but older data-processing frameworks struggle to cope with both their size, and the complex calculations required. Researchers using traditional ‘desktop’ analysis were spending most of their time trying to process the data, rather than addressing research questions.
The value of the raw energy data is really unlocked when integrated with other time series data, including energy consumption series from smart meters, local weather series, and contextual information such as characteristics of the homes. The Climate Forecast Reanalysis System dataset from the National Center for Atmospheric Research is a 10GB global, high-resolution gridded weather dataset drawn from sophisticated numerical weather prediction (NWP) models. Traditional desktop systems would struggle to store and link such large datasets.
The DSaaP infrastructure, based on Hadoop, an industry-standard “big data” platform, can process hundreds of millions of records in seconds or minutes instead of hours or days. The existing data pipeline from UCL, written in the Python programming language, was re-implemented in Hadoop’s Python framework, PySpark, to make sure the pipeline is maintainable within Hadoop, and to take advantage of the substantial performance benefits of Apache Spark, Hadoop’s distributed computation engine. The analytics platform in DSaaP provided data exploration and analysis tools including Jupyter Notebooks, an integrated environment for interactive data analysis and visualisation.
The UK Data Service offers a solution to cope with the volume and complexity of the datasets involved, as well as the complexity of the analysis involved and data security concerns. Our Data Services as a Platform (DSaaP) prototype infrastructure uses Hadoop to store massive data collections, specifically in HDFS (Hadoop distributed file system). This can scale literally to petabytes of storage, so managing datasets in the tens of gigabytes was straightforward.
|Performance||Reduced query times over 2.5 billion observations (for example, the mean yearly energy demand computed across all sites) so that the time take to analyse and visualise was reduced from four hours to 15 seconds.|
|Productivity||Reduced the development time of new models from approximately two weeks to 20 minutes, by dramatically reducing the time spent on tackling the challenges associated with processing high volumes of data.|
|Scalability||Scaled out the analysis from a single site to over 8,000 sites, with up to 2.5 billion observations per query, which led to better model design and error detection. For example, applying an outlier detection filter on a single site on local desktop ran for an hour then crashed. With DSaaP, the filter could be run on over 8,000 sites in five minutes.|
|Simplicity||Provided a single point of access to disparate data sources and interactive data analysis environments, allowing researchers to easily carry out a wide variety of analyses at scale.|