Olivia, a PhD researcher, is using the Integrated Census Microdata (ICeM) dataset to do research on foreign female domestic servants in London from 1880-1939. Her research consists of quantifying their presence in the capital during this period and investigating what nationality the head of household was when hiring the servants. Her theory is that there is a great correlation between the mistress’ nationality and that of the servants. She would also like to capture how many others in the household had the same nationality as the servants, and what position in the family they were.
The data consists of over 180 million census records, digitised and harmonised from the original enumeration books, detailing up to 118 variables for all individuals resident in Great Britain at the time of each census. Researchers can currently download subsets of the ICeM dataset, by using a purpose-built access tool to filter the 180 million census records based on 20 key variables, before downloading the resulting data table of individual census records with the full complement of variables (118) per record. Data download, however, is limited to 1 million records. Alternatively, researchers can explore and analyse the records of individual census years online via the ICeM-Nesstar system, by tabulating variables and exporting resulting statistical tables. This system however is limited to only 47 key variables. Either way, it is not straightforward to home in the data that really matters to the researcher.
The detailed census microdata are held in flattened legacy database tables, one per census year. These needed to be pre-processed and then queried using the existing ICeM Application Programming Interface (API), limiting the data to London by regional code and including only individuals who live in private households with a female domestic servant. The resulting ICeM dataset was hosted in the DSaaP environment (based on Hadoop – an industry-standard big data platform) and made available for Olivia to customise and run SQL queries written with assistance from the DSaaP team. Olivia then used Zeppelin notebook (an interactive analysis tool native to Hadoop) to export the data extractions generated from the SQL queries on to her desktop for offline analysis.
The Zeppelin notebook also tool gave Olivia access to the power of Apache Spark – a fast, in-memory data processing engine, as well as providing her with access to the Spark API which supports powerful data visualisations including histograms, thematic mapping, and rendering of live streaming inputs.
The process of framing the research question, identifying, subsetting and downloading the data in an enhanced structure and format, was deemed a positive learning experience by Olivia. The explicit value of the data outputs was realised through iterative discussions of the research goals with the DSaaP team. Their support was considered invaluable in optimising Olivia’s data queries.
|Performance||Spark facilitates the iterative tuning of queries containing multiple subqueries – each amended re-run taking just 12 seconds. For example, correlated totals (foreigner, female foreigner, same nationality, same nationality servant, same nationality female servant) for the household of the servant, could be calculated in this timeframe or less.|
|Productivity||Utilising the legacy UK Data Service ICeM API (with its one million record download limit) would have involved multiple downloads and complex joins to calculate the totals detailed above|
|Scalability||This particular use case only required the London area subset of the six decennial census years of ICeM data. The entire dataset (180 million records) would easily be within the capacity of secure On-Premises DSaaP infrastructure.|
|Simplicity||Spark, with its SQL query interface, gives researchers with rudimentary SQL skills access to the immense storage capacity of Hadoop and fast in-memory processing|
The UK Data Service’s new DSaaP infrastructure provided a powerful and unified environment for the interactive examination and analysis of the ICeM dataset. It offered a low barrier of entry for Olivia to be able to exploit the computational power of Spark and the vast data storage of Hadoop. The flexibility of the Zeppelin notebook environment also gives data scientists complete flexibility to craft queries unique to their research domain in multiple programming languages, in SQL. R, Python and many others. Data was easier to download from DSaaP and easier to manipulate, allowing Olivia to derive meaning from a significantly larger subset of the census microdata than is currently possible using standard desktop techniques