Characterising the index mothers in the Avon Longitudinal Study of Parents and Children (ALSPAC) who are also UKBioBank participants [version 1; peer review: awaiting peer review]

Background : This is an examination of the characteristics of the women who enrolled into both the ALSPAC cohort during the initial recruitment phase in 1991-1992 and subsequently into the UKBiobank (UKBB) cohort from 2007 to 2010. There is substantial overlap in recruitment across the two cohorts with the existing Avon Longitudinal Study of Parents and Children (ALSPAC) sampling frame being included in that deployed latterly for UKBiobank. Methods : A process was developed using anonymisation and encryption to flag individuals participating in both studies without revealing the identity of those participants. A description of participant characteristics was undertaken to compare the demographic, socio-economic, locational, consent and participatory characteristics of the women in ALSPAC who were also eligible to participate in UKBiobank by their UKBB participation status. Results : The ALSPAC mothers enrolled in UKBB were more engaged in and contributed more data to ALSPAC than those who were eligible to join UKBB but did not enrol. The women in both studies were more likely to be married, have higher levels of education attainment and more financial stability than their counterparts who were only enrolled in ALSPAC. Conclusion : It is possible these characteristics are reflected more generally within UKBB compared to the wider population.


Introduction
The Avon Longitudinal Study of Parents and Children (ALSPAC) is a multigenerational birth cohort study which began in the early 1990s with the recruitment of around 14,000 pregnant women who lived within a defined geographic area in and around the city of Bristol, UK i,ii,iii . The mothers, their partners, and the study children have been followed up ever since via self-complete questionnaires and clinic visits. Study participants have been provided 'fair processing' materials describing ALSPAC's intended use of their health and administrative records and were given clear means to consent or object or even withdraw (or dissent) from the study via a written form. Data were not extracted for participants who objected, dissented or withdrew. This has resulted in a rich databank containing information on many aspects of participants' lives, including their health, socio-economic circumstances, and families. As a birth cohort with an initial recruitment rate over 75%, where the only qualifying factors were being pregnant within a certain timeframe and within a certain geographic region, ALSPAC can be considered broadly representative of the population at the time iv .
UKBiobank (UKBB) is a much larger study than ALSPAC; approximately 500,000 participants aged 40 to 69 were enrolled between 2006 and 2010 v . In contrast to ALSPAC, UKBiobank is a national study, with participants living within 10 miles of one of 22 recruitment centres spread across the UK. One of these centres was located in Bristol. The location requirement was for practical and financial reasons. At the point of recruitment, UKBiobank collected biological samples, physical measurements and other data relevant to epidemiological studies. They have also run further clinics and studies. The recruitment rate was much lower than for ALSPAC: 9.2m adults were invited to enrol in UKBB, 576k accepted the invitation (6.2%), and 503k were ultimately enrolled, a recruitment rate of 5.5% vi .
Cohort studies can be combined in studies of rare events or weak associations, where large numbers are needed to increase statistical power. This is often done employing methods able to account for between study heterogeneity (for example meta-analyses). ALSPAC and UKBiobank are both examples of cohorts used routinely in this way within the UK vii . Combining cohorts assumes that the study samples are independent. However, given the sampling frames of these two studies, it was considered highly likely a subset of the ALSPAC mothers and partners are also enrolled in UKBiobank. The level of overlap between the two samples -and the consequences of this -needs to be understood in order to understand the implications for analyses, and the potential benefits of having participants who are in both studies. It has only recently become possible to identify the individuals in each cohort using an anonymised, encrypted code based on their unique NHS number. This method allows us to identify women enrolled in both studies. The aims of the work described in this data note were to: (1) determine which women in the ALSPAC sample are also enrolled in UKBB and create a flag so they can be easily identified; (2) examine whether the women in both ALSPAC and UKBB differ from the women in ALSPAC only, in terms of demographic, socio-economic, location, consent, and participation characteristics. Note that at the time this work was undertaken (July 2019) ALSPAC had NHS identifiers for the women in ALSPAC (known as the 'ALSPAC mothers'), but not for the men (the partners of the women, also referred to as the 'ALSPAC fathers'). Therefore, only women are included in this current study. It is anticipated that enrolled partners/fathers will be linked to the NHS in the future, and the encrypted data exchange protocol may be repeated to establish the number of them also enrolled in UKBiobank. in Avon (former county covering Bristol and the surrounding areas in the Southwest UK), with expected dates of delivery between 1st April 1991 and 31st December 1992 i,ii . 13,640 of these women had a study child alive at one year of age. These children and their mothers are referred to as the 'core sample'. This core sample was boosted a few years later by the recruitment of children who met the original eligibility criteria but whose mothers had not joined the study during pregnancy. Please note that the study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool (http://www.bristol.ac.uk/alspac/ researchers/our-data).

Identifying women in both ALSPAC and UKBiobank
Both ALSPAC and UKBiobank used the 'OpenPseudonymiser' package (https://www.openpseudonymiser.org/) to create respective lists of encrypted/hashed (AES256bit) identifiers (initially 14,875 encrypted identifiers were generated for all the nondissenting women in ALSPAC, 502,383 for participants in UKBB, both male and female). The lists were based on the following identifiers: • Month and year of birth The list of UKBiobank encrypted identifiers was securely transferred from UKBB to ALSPAC and merged to identify matches and generate a flag against participants enrolled in both studies. A list of matching participants was then securely transferred from ALSPAC back to UKBB. This list of 14,875 included women recruited into ALSPAC at any time. However, the descriptive analyses presented in this Data Note are restricted to women in the core sample as only they have baseline characteristic data.
A list of the basic identifiers of all the women enrolled in ALSPAC was securely compared against the identifiers of those enrolled in UKBiobank, as described above, and a marker was created to indicate women found in both groups.
Defining the denominator of women in ALSPAC who were also eligible for UKBiobank In order to determine the denominator of our study, it was necessary to identify which women in ALSPAC had the potential to also be a UKBiobank participant (i.e. met the eligibility criteria) (Figure 1). We first restricted our study sample to those in the core ALSPAC sample, with a study child alive at one year. These restrictions were because: (1) women who had a miscarriage, or whose child died before one year of age, did not continue participating in ALSPAC, and (2) women in the non-core sample did not complete the early questionnaires that contain baseline characteristics. Women who had declined consent for the use of their health records were then dropped as we have no way of identifying whether those women are in UKBB or not. Finally, as eligibility for participation in UKBB was constrained to those aged 40 to 69 years old, we only included women in ALSPAC who were in that age range during UKBB's recruitment phase (UKBB recruitment began in April 2007 and ended in August 2010 when the target of 500,000 was reached). It was deduced that those enrolled in UKBB aged 69 in April 2007 at the start of recruitment for UKBB would have been born in 1938 and those aged 40 in August 2010 at the close of recruitment would have been born in 1970. Restricting our study sample to those born between 1938 and 1970 resulted in a final study sample of 12,127 women who are in the core ALSPAC sample, have not opted out of health record linkage, and who were eligible to enrol in UKBB. Our study sample was not restricted on geographic criteria due to uncertainties in the full residential history of all ALSPAC participants and the difficulties in obtaining the precise boundaries used during recruitment into UKBB.

ALSPAC measures
A number of measures, detailed below, were used to compare the women in the study sample who were in both studies compared to those in only ALSPAC, based on demographic, partner, socio-economic, locational, consent and participatory data.
Participation: a participation score was generated by counting how many of the 56 possible questionnaires each woman had returned. This was put in to ten categories (category 10 -low participation, <10% of questionnaires returned; category one -high participation, >90% questionnaires returned). A binary variable (yes, no) was derived which specifies whether the mother has a partner who is enrolled in the ALSPAC cohort.
Age: The women's age was known directly from the study child's birth records. The age of the women at the time of delivery was categorised into five-year brackets. A binary variable was created to indicate whether, within each couple, the partner was older than the woman based on their respective ages at the birth of the index child (yes, no).
Ethnicity: self-reported by the women (White; non-White -no further disaggregation possible due to small numbers). The child's ethnicity was derived from the ethnicity of the mother and the father, when this was known. It was designated as 'White' if both parents were 'White' and 'non-White' if at least one parent was known to be 'non-White'.
Partner status: woman lived with partner during pregnancy with study child (yes, no).
Pregnancy/child related: length of gestation (weeks) and birthweight (kg) of study child; parity (0; 1+); multiple birth (yes, no); breastfeeding of study child (yes, no). Whether the mum's previous child(ren) had been breastfed (this was only relevant if the mum had had a previous child and so first it was determined whether the mum was primiparous or not). The feeding method used for previous child(ren) was then reduced to a binary (yes, no) breastfeeding variable.
Smoking: two binary (yes, no) variables were derived: (i) ever smoked and (ii) smoking during pregnancy. These were based on self-reported smoking behaviours reported at three time points: during the pregnancy (where the women were asked whether they had ever been a smoker and their current smoking); when the study child was aged two months (asked whether mother smoked in labour); and when the study child reached the age of eight years (asked about whether they have ever smoked and current smoking).
Alcohol: detailed responses about the number of glasses viii of alcohol consumed, the type of alcoholic beverage and the pattern of drinking through the days of the week were sought through questionnaires during the pregnancy and again in the period two months to 21 months after. The categories 'one or two glasses per day' and 'three to nine glasses per day' were combined into a single category due to the presence of small numbers.
Alcoholism: a binary variable, ever suffered from alcoholism (yes, no), was derived based on women's self-reports during pregnancy. The 'yes' group includes those who answered 'recently' and 'in the past'. When the study child was aged eight and 21 months, the women reported whether their partner had had alcoholism since the baby was born and a binary variable was derived (no, yes). The partner was asked directly during the pregnancy if they had suffered alcoholism and a binary variable was derived (yes, no).
The mother was also asked during her pregnancy whether either of their natural parents, or alternatively a parental figure, had an alcohol problem. The partner was asked the same question of their own parents at the same time. Any response indicating an index child's grandparent or grandparent figure had an alcohol problem was consolidated into a 'yes'. A 'no' was only recorded if it was definitely recorded that no grandparent or grandparent figure of the index child had an alcohol problem, otherwise the value was left blank (missing). These responses were consolidated into a single binary (yes, no) variable.
Substance misuse: The women were asked about substance misuse during their pregnancy, again when their child was aged about eight years and later when their child was aged about 11 years. Their partners were asked the same questions at the same timepoints. Any answers of 'yes' were aggregated together over time. Any answer of 'no' was included if not contradicted. The remaining data was recorded as missing.
Health and Wellbeing: The women and their partners were asked to self-assess their state of general health at several time points before and after the birth of their child. While viii The questionnaires stipulated 'By glass we mean a pub measure of spirits, half a pint of lager or cider, a wine glass of wine, etc.' still pregnant the women were asked to assess their own and their partner's health. The women and partners were asked to re-assess their own health eight weeks after the birth of their child and again at 21 months after the birth of their child.
Financial difficulties: The women were asked whether they had experienced financial difficulty when their child was aged about eight months, during the second year at about one year nine months and finally at about two years nine months. A binary flag was generated with any period of financial difficulty indicating as 'yes', however a 'no' required all periods to be recorded as being without financial difficulty. Any participants without a complete response record were excluded as 'missing'.
Place of Residence: The women were asked how many times they had moved in the five years prior to their pregnancy. They were also asked for the status of their residential circumstances at the time of their pregnancy. The historic address records available to ALSPAC of those registered in the study were utilised to examine the residential stability of the participants. The residential address postcodes were linked to the Index of Multiple Deprivation (IMD ix ) by decile using the year 2000 indices for the period during recruitment into ALSPAC and the 2007 indices for the period of recruitment into UKBB.
Socio-Economic Class and Education: during the pregnancy, the women and their partners were asked about their highest educational qualification. The women and their partners were also asked about their employment at the later stages of the pregnancy. The responses were provided as free text and were manually coded against the OPCS x occupation codes. From the responses, the maternal and paternal social classes were derived as at the time just prior to the birth of the study child using the Standard Occupation Classification 1990.

Women in both ALSPAC and UKBiobank
When the 14,875 encrypted identifiers generated for the women in ALSPAC were compared to the 502,383 for participants in UKBB, there were initially 938 matches in total, however only 866 were in the study sample as defined in Figure 1. Therefore, of the 12,127 women in ALSPAC who were also eligible to enrol in UKBiobank, 866 (7.1%) were confirmed to be enrolled in both studies. This is higher than the general rate of enrolment in UKBB (5.5%). It was observed that 88.9% of the original enrolled women were also eligible to be enrolled into UKBB, demonstrating as expected the over-lap in the sampling frame of the two projects.

Comparison by UKBB enrolment status
The 866 women in both UKBB and ALSPAC (also referred to as 'in' UKBB or Group B -B for Both ALSPAC and UK Biobank) were compared to the 11,261 only in ALSPAC (also referred to as 'not in' UKBB or 'out' of UKBB or Group A -A for ALSPAC only) across a range of measures. Note that the denominator may vary by variable due to missing data.
The women in both cohorts had higher levels of engagement in ALSPAC than women only in ALSPAC. Table 1 shows that women who were enrolled in ALSPAC and UKBB were more likely to have a partner also enrolled in ALSPAC, and nearly half (49.5%) of women enrolled in both cohorts completed 90% or more of the surveys compared to less than a fifth (17.5%) of mothers only in ALSPAC.
Participation in ALSPAC over time by UKBB enrolment status is shown in Figure 2. This shows the percentage of women in each group who returned questionnaires, from pregnancy up until the study children were aged twenty years. Both groups show a slow but steady decline in participation over time, however the rates of participation declined slower for the women enrolled in both cohorts.
Women enrolled in both cohorts (Group B) were more likely to report co-habitation with their partners during the pregnancy than those only enrolled in ALSPAC (Group A) ( Table 2). There was negligible difference between whether women enrolled in either Group had a partner older than themselves. There is a lower proportion of women in Group B in the younger age brackets than those women in Group A and a correspondingly higher proportion in the older age brackets. There was little difference between the two groups in terms of ethnicity.
Pregnancy and birth characteristics (birth weight, gestation, parity) were similar for both groups of women (Table 3). However, women enrolled in both cohorts were more likely to have breastfed than those only enrolled in ALSPAC.
Women in Group B were less likely to report that they had ever been a regular smoker and were less likely to report they had smoked during the pregnancy (Table 4). With regards alcohol consumption before pregnancy, women in Group B were more likely to report of 'one or more glasses a week' than those in Group A during the periods both before and after the pregnancy. Group B were also more likely to report 'never' consuming alcohol during the pregnancy but reported higher rates of alcohol consumption both before and after the pregnancy.
There is negligible difference between the women enrolled in Group B, both ALSPAC and UKBioBank, and those in Group A, with regards their self-assessment of both their own and their partner's health (Table 5).
Compared to women in Group A, the women in Group B were of higher socio-economic position across a number of measures: they had fewer financial difficulties, more stable housing and higher rates of home ownership, and lived in more affluent neighbourhoods ( Table 6). The women in Group B were also of higher occupational social classes xi , and more likely to be educated to degree level (Table 7).

Conclusions
It had long been suspected that many of the women and men in ALSPAC were also enrolled into UKBiobank. ALSPAC and UKBiobank are both large cohorts that are often used together in studies, and so the level of independence between the two data sources therefore needs to be understood. For the first time, using a novel encryption method, we have identified a cohort of women who are enrolled in both ALSPAC and UKBB (7.1% of the ALSPAC women who met the UKBB age eligibility criteria). It is not clear whether it will remain possible to identify members-in-common between two different research projects if they are held within two different TREs (Trusted Research Environments).
ALSPAC can be considered as broadly representative of the population at the time due to being a birth cohort with very basic eligibility criteria and the project's high recruitment rate. UKBB is a cohort with a significantly lower recruitment rate possibly indicating a self-selection bias in the study sample.
We compared those women who are in both cohorts, Group B, to the women in ALSPAC who were eligible but did not enrol into UKBiobank (Group A) and found that they differed on many characteristics. Notably the women in Group B participated more fully in ALSPAC and provided more data over time. Those in Group B were older and more advantaged across a range of important socio-economic measures such as housing and education, and they had some healthier behaviours such as smoking less, drinking less during pregnancy and being more likely to have breastfed their babies. However, they tended to drink more when not pregnant.
It is conjectured any differences in the characteristics of the ALSPAC mothers enrolled in UKBB are indicative of those in UKBB compared to the wider population, as indicated by the ALSPAC mothers not in in UKBB. This could be of assistance when interpreting results using UKBB data.
As ALSPAC and UKBB are often used together there is potentially the issue of the samples being non-independent, and we now know that there is an overlap of about 900 women. A key benefit of this work is the derivation of a flag which will allow researchers using a combination of the two cohorts to identify women who are in both, and to delete these women from one of the studies if desired.
As there is a very large amount of data on the approximately 900 women who are in both studies, there could be the potential for studies focused only on these women. The knowledge that there are disparate datasets holding detailed, and different, information on these individuals and that potentially these are able to be linked quickly and accurately is important. This could be leveraged to facilitate rapid research into evolving real-world situations, such as the COVID pandemic, with minimal extra resource or through sharing existing resource.

Data availability
UKBioBank details can be found at https://www.ukbiobank.ac.uk/ Further details on ALSPAC can be found at http://www.bristol. ac.uk/alspac/ ALSPAC data access is through a system of managed open access. The steps below highlight how to apply for access to the data included in this data note and all other ALSPAC data: i. Please read the ALSPAC access policy (http://www.bristol. ac.uk/media-library/sites/alspac/documents/researchers/dataaccess/ALSPAC_Access_Policy.pdf) which describes the process of accessing the data and samples in detail, and outlines the costs associated with doing so.
ii. You may also find it useful to browse our fully searchable research proposals database (https://proposals.epi.bristol. ac.uk/?q=proposalSummaries), which lists all research projects that have been approved since April 2011.
iii. Please submit your research proposal (https://proposals. epi.bristol.ac.uk/) for consideration by the ALSPAC Executive Committee. You will receive a response within 10 working days to advise you whether your proposal has been approved.
The availability of our linked participant records is dependent on our ethical approvals and contractual arrangements with the NHS. If you are interested in using these data, then please contact the ALSPAC Data Linkage Team (alspac-linkage@bristol. ac.uk).

Author contributions
Data note conceptualisation and funding organised by NT, who also assisted with review and editing. Methodology, review and editing by AB. RT contributed to creating the dataset of matched participants. AT assisted with review and editing.
Visualisation, investigation, curation and writing of the data note by MM.