Synthetic ALSPAC longitudinal datasets for the Big Data VR project

Three synthetic datasets - of observation size 15,000, 155,000 and 1,555,000 participants, respectively - were created by simulating eleven cardiac and anthropometric variables from nine collection ages of the ALSAPC birth cohort study. The synthetic datasets retain similar data properties to the ALSPAC study data they are simulated from (co-variance matrices, as well as the mean and variance values of the variables) without including the original data itself or disclosing participant information. In this instance, the three synthetic datasets have been utilised in an academia-industry collaboration to build a prototype virtual reality data analysis software, but they could have a broader use in method and software development projects where sensitive data cannot be freely shared.


Introduction
In 2015 Wellcome Trust and Epic Games ran a challenge -pairing computer games developers with researchers -to develop visualisation methods in virtual reality (VR) for big biomedical datasets from the following Wellcome Trust funded research projects: 1. A collection of historical medical records from the Casebooks Project 2. Genomics data from the Sanger Institute 3. Cohort data from the ALSPAC study (also known as Children of the Nineties) University of Bristol researchers were paired with team Luma Pie (comprising Masters of Pie http://www.mastersofpie.com and Lumacode http://www.lumacode.com), who won the challenge with the vARC concept designed to visualise complex cohort data from the ALSPAC study. The Masters of Pie blog records vARC development (part 1, http://www.mastersofpie.com/big-data-vrchallenge/ and part 2 http://www.mastersofpie.com/big-data-vrchallenge-phase-2-update/), and a description of the challenge winning vARC prototype (http://www.mastersofpie.com/big-datavr-challenge-winners/).
The value of this emerging technology and its potential applications to e-health and wider use in medicine was recognised by the winning collaboration who continue the development of a VR proof-of-concept biomedical data exploration and visualisation tool under the Big Data VR project using the ALSPAC cohort study as a use case. This project has additionally explored a variety of VR visual analytic methodologies, investigated VR analytics applied to different scales of data and scoped the integration of privacy protecting analytical methods via DataSHIELD 1 . Findings will be reported in a forthcoming paper.
Due to the nature of the Big Data VR project, it was necessary to use a dataset that could be freely shared across the project team of researchers and games developers, as well as be deployed as an open test dataset for a demo release of the Big Data VR tool. There exist, however, ethical-legal constraints on the open sharing of, or access to, biomedical study data due to concerns around participant privacy and disclosure risk. ALSPAC deploys a rigorous data governance and access policy to protect participant data confidentiality and disclosure. This meant that we could not simply share real ALSPAC data with the developers without going through a potentially lengthy process of formally assessing the bona-fides of every single person in the development team who may need to work with or see the data. Given the very short time scale of the project this was not feasible. However, in order to properly challenge the developers and their evolving tools, and to ensure that the tools would ultimately be useful in a meaningful scientific context, it was nevertheless important that the test datasets closely mirrored real ALSPAC data. To ensure privacy protection, it was therefore necessary to generate synthetic datasets to be used in the project, an approach commonly utilised within the research health data domain 2 . This paper outlines three synthetic datasets simulated from ALSPAC study data for the purposes of the Big Data VR project.

Methods
Based at the University of Bristol, ALSPAC (also known as Children of the 90s) studies the health and well-being of pregnancies from the Avon region -with children born between 1991-1992. The whole cohort includes children from original enrolment (phase I recruitment), as well as children invited to join from the age of 7 onwards (from phase II and III recruitment), n = 15445 participants (excluding triplets and quadruplets) at the time of this work. Cohort profiles are described in Boyd et al. 3 , Fraser et al. 4 and the study website contains details of all the data that is available through a fully searchable data dictionary (http://www.bris.ac.uk/alspac/researchers/data-access/ data-dictionary/). The variables from 15445 ALSPAC child participants used for the simulated data generation are outlined in Table 1. They include cardiac measures (i.e. blood pressure and pulse rate) and anthropometric measures (i.e. height, sitting height, weight, bmi, hip and waist circumference) of children visiting different ALSPAC clinics. The age indicated at each clinic is the age of the child at attendance, which is calculated from the date of the visit and the child's date of birth. All variables used were continuous, except gender which is a binary variable (with 1 indicating male and 2 indicating female). The coverage of these variables at different clinic ages is shown in Table 2, highlighting any variables missing from collection.
Synthetic data was simulated using the statistical programming language R (5, version 3.2.3) comprising the following steps with the corresponding R functions noted in line:

Data cleaning
The ALSPAC dataset described in Table 1 and Table 2 was cleaned by removing all rows with missing values, leaving 1593 observations remaining.

Standardising continuous variables
Each continuous variable, x, was standardised using the the z-score transformation: where z denotes the standardised version of the variable, with µ and σ representing the mean and standard deviation of x , respectively (using mean() and sd()). This z-score transformation was used to transform normally distributed data N (µ, σ) to standard normally distributed data N (0, 1).

Data generation: Continuous variables
It was assumed that the continuous variables (excluding BMI) follow an approximate multivariate normal distribution. Using the pseudo-random multivariate normal generator (mvrnorm()), three synthetic datasets were generated of observation sizes  15500, 155000 and 1550000 participants. Using the assumption of approximate multivariate normality (without transforming any non-normal data to normal), the synthetic data do not have precisely the same joint and marginal distributions as the original ALSPAC data, but they have very accurate approximations with most variables passing formal tests of normality. The simulated continuous variables were then rescaled back to their original mean and standard deviation by the inverse z-score transformation: where X and Z denote the simulated data for x and z respectively, with µ and σ representing (as above) the mean and standard deviation of the real x data.

Data generation: Binary variables
The simulated gender variable retains the same proportions of males and females as that in the original ALSPAC data set. This was achieved by converting the levels 1-2 (indicating males and females respectively) to 0-1 data and then applying a logistic model for gender regressed (glm()) on all continuous variables using the original dataset. The estimated coefficients were then used to calculate the linear predictors of the simulated datasets. Then, using the log odds, y, from the linear predictors, we have calculated the odds, p, that indicate the probability ratio between males and females, using the inverse logit (also known as expit) transformation: The simulated binary variable denoting gender in each subject was then generated using the value of p in that individual (derived from the expit transformation) as the probability argument in R's rbinom() function.

Data generation: BMI variable
The simulated BMI variable was calculated from the simulated values of weight and height for the clinics F@7, F@9, F@10, F@11 and TF4 using the relationship 2 ( /100) weight BMI height = Data generation: Age variable The age at each clinic, initially reported in months, was divided by 12 to represent its values in years. The simulated age variable at each clinic was generated assuming normality and using the rnorm() R function with mean and variance set equal to the actual mean and variance of age at each clinic.
The three synthetic datasets have similar properties to the ALSPAC data they are simulated from. This is demonstrated by the close similarity of the estimated means, variances and covariance matrices for the relevant variables in the original ALSPAC dataset and the three synthetic datasets (see Supplementary material). The synthetic datasets contain none of the original data itself. This report describes the creation of three simulated datasets using the ALSPAC (Children of the Nineties) cohort. The data are easily accessible, and the provision of the links to the 'Masters of Pie' blog helps provide some interesting context for the work.
Some very minor points for the authors' consideration: It is mentioned that the VR project had a very short time scale -it might be useful to provide some information about the time scale, and how this relates to a typical turnaround time for a formal data request using the usual ALSPAC process.
Could the authors provide a little more information to explain why they created three synthetic datasets of different sizes? Why 15500, 155000 and 1550000? Table 2 suggests that weight and height were available at all timepoints. However, Table 2 lists 'NA' for BMI at various timepoints -if weight and height were available, why was BMI not derived?
Finally, in Supplementary File 2 (variable variance), the variances for the real ALSPAC data are really similar to the variances of the simulated data for most of the variables. The variances for BMI seem to be a little different, with the variances in the simulated datasets being a little higher at all ages across the simulated datasets. Perhaps the authors could comment on this observation?

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Yes