Revealing the extent of the first wave of the COVID-19 pandemic in Kenya based on serological and PCR-test data

Policymakers in Africa need robust estimates of the current and future spread of SARS-CoV-2. We used national surveillance PCR test, serological survey and mobility data to develop and fit a county-specific transmission model for Kenya up to the end of September 2020, which encompasses the first wave of SARS-CoV-2 transmission in the country. We estimate that the first wave of the SARS-CoV-2 pandemic peaked before the end of July 2020 in the major urban counties, with 30-50% of residents infected. Our analysis suggests, first, that the reported low COVID-19 disease burden in Kenya cannot be explained solely by limited spread of the virus, and second, that a 30-50% attack rate was not sufficient to avoid a further wave of transmission.

first, that the reported low COVID-19 disease burden in Kenya cannot be explained solely by limited spread of the virus, and second, that a 30-50% attack rate was not sufficient to avoid a further wave of transmission.
Keywords Kenya,dynamic model,serology,PCR cases This article is included in the KEMRI | Wellcome Trust gateway.

Introduction
The potential risk from severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) to Africa was identified early in the global pandemic 1 . As the epicenter of transmission moved from East Asia to West Asia and Europe and then to North America, there was speculation as to the likely impact of the pandemic on the African continent with its young populations, high infectious disease burden, undernutrition and fragile health infrastructure. However, as health systems and economies of high-income countries strained, the reported burden of COVID-19 cases and associated deaths in Africa remained low with the exception of South Africa and Northern Africa 2 . The question is whether this is the result of lower risk due to demographic structure (young age 3 , either cross-reacting immunity (e.g. pre-existing SARS-CoV-2 cross-reactive T cells 4 ) or dampened immunological over-reaction 5 , a low reproduction number from rapidly imposed interventions (such as school closures and lockdowns 6 ), environmental conditions (e.g. temperature and humidity 7 ), or under-reporting. The reason this remains a conundrum is, at least in part, a paucity of good quality data to reveal the probable extent of SARS-CoV-2 spread in African populations.
Following the first confirmed coronavirus disease 2019  case in Kenya on 13th March 2020, the Kenyan Government moved rapidly, closing international borders, schools, restaurants, bars and nightclubs, banning meetings and social gathering, and imposing a dusk to dawn curfew and movement restrictions in the two major city counties, Nairobi and Mombasa 8 . The major concerns from unmitigated spread were a limited surge capacity of the Kenyan health system 9 and groups of the Kenyan population identified as potentially highly vulnerable to infection, due to socio-economic factors such as crowded households or lack of access to handwashing, and/or severe disease, due to epidemiological factors such as higher rates of obesity and hypertension 10 . Throughout the months of April, May and into June 2020 few people in Kenya were reported SARS-CoV-2 test positive by polymerase chain reaction (PCR), or severely diseased or dying with COVID-19 as the established cause 11 . There followed a relaxation of some measures in June and July including controlled opening of restaurants and places of worship and the removal of travel restrictions into and out of Mombasa and Nairobi counties. As of 30th September 2020, there were 45,795 laboratory-confirmed positive swab tests out of over 340,000 tests (about 13.5%), and 749 deaths with a positive test result in Kenya 11 . This should be compared with the 200-250,000 cases and 30-40,000 deaths attributable to SARS-CoV-2 for similar sized countries in Europe (France, Italy, UK) by the end of September 12 .
The reason for this apparently low level of COVID-19 disease in Kenya is unknown; one possible explanation is that SARS-CoV-2 had not widely spread among the Kenyan population by the end of September. However, two pieces of information suggest that SARS-CoV-2 had already spread extensively by the end of September. First, a regionally-stratified seroprevalence study of 3098 Kenyan blood donors sampled between May and June reported a national estimate of 4.3% (adjusted to reflect the population distribution by age, sex and region) 13 . Sero-prevalence was higher in Nairobi (7.6%) and Mombasa (8.3%). These levels of seropositivity are comparable to those reported in May in the United Kingdom (UK) 14 , April/May in Spain 15 , and March/April in some United States (US) cities 16 , where high numbers of PCR-positive cases, hospitalizations and deaths have also been reported, in contrast to Kenya. Second, we noticed that test-positive PCR cases, and daily reported test-positive deaths, were declining first in in Mombasa (from early July 2020) and then Nairobi (from early August 2020); respectively Kenya's second and first largest cities. In Europe, declining case and mortality rates have been closely associated with non-pharmaceutical interventions (NPIs) 17 . However, in Kenya this went counter to evidence of increased mixing, and hence reproduction potential, arising from Google Mobility data for these cities which showed a steady reversion in mobility towards pre-COVID-19 intervention levels since early April (Fig. S1). These observations, in turn, lead to the conclusion that either a smaller than expected proportion of infected individuals have had severe disease, and/or, that there has been under-reporting of severe disease.
To investigate these findings, we developed a simple SEIR (susceptible-exposed-infectious-recovered) compartmental mechanistic and data-driven transmission model for Kenya, which integrates three sources of longitudinal data: national time series polymerase chain reaction (PCR) tests, the Kenyan serological survey and Google mobility behavioural data. The overall modelling approach is similar to Flaxman et al. 17 ; that is we use time-to-event lag distributions, and the daily incidence time series, and, both models generate the daily incidence time series using a simple deterministic transmission model with the key unknowns being initial numbers of infected individuals and R(t). Where we differ in approach from Flaxman et al. 17 is that, instead of using reported test-positive deaths as the most reliable data for inferring underlying transmission patterns, we use a combination of PCR test-positive and serological data. The PCR test-positive data informs the model on the epidemic trajectory but does not account for likely under-detection of cases. This under-detection of cases is inferred from the proportion exposed to SARS-CoV-2 evidenced by the seroprevalence estimates, hence scaling the incidence estimation. Finally, the mobility data, as a proxy for the contact

Amendments from Version 2
In this version we respond to two comments from the reviewer. First, our assertion in the abstract that 30-50% attack rate in Kenya after the first wave would not be sufficiacint to prevent a further wave was not based on the expectation of a new immune-evading variant being introduced, but rather on a presumption of heterogeneity in population structure and mixing rates. We investigate this explanation in a subsequent publication. Second, the revewier called into question the comparison with a very high seroprevalence estimate from a South American study, which we agreed with and hence removed the sentence.
Any further responses from the reviewers can be found at the end of the article REVISED rate, determines the contribution of the intervention (which acts to alter contact patterns) relative to other factors that alter incidence and the effective reproduction number, the most important of which is the susceptible proportion of the population. Our aim is to derive a coherent picture of the SARS-COV-2 epidemiology in Kenya in the first wave and reveal the historic and future patterns of spread across the country and by county. Reported deaths are not used as primary data for inference, but rather the trend in changing rates of reported deaths is used as a validation data set for model predictive accuracy (see supporting information for description of model validation). Reported deaths may be subject to substantial under-reporting, and we assume that the bias in under-reporting is consistent over time.

Results
Underlying transmission rates in Mombasa and Nairobi during the first wave As at 30th September, a substantial proportion of PCR positive tests have been samples from the capital Nairobi (25,182 positive tests), while Kenya's second largest city, Mombasa, has reported the next highest number of PCR positive tests (2,056). We infer that the underlying rate of new infections peaked on May 18th 2020 (CI May 16th -May 21st) in Mombasa and July 9th 2020 (CI July 7th -July 10th) in Nairobi, and subsequently declined from peak transmission (Figure 1 H, G). The model suggests that the PCR test and serology data can be explained by the initial presence of <200 infected individuals in both Mombasa and Nairobi on 21st February, three weeks before the first reported case in Kenya. Thereafter, growth of transmission was rapid in both counties. In early March, the reproductive ratio was estimated to be 1.94 (CI 1.89-1.98) and 2.00 (CI 1.97-2.02) in Mombasa and Nairobi, respectively, with associated doubling-time of 4.84 and 4.59 days, respectively. After March, the transmission curves flattened substantially. This change is consistent with the introduction of containment measures by the Kenyan government, and evidence of substantial reduction in mobility (see Google Mobility data Fig. S1). However, we should note that there was very limited PCR testing available in Kenya before April 2020, and our estimates of R(t) pre-April 2020 rely on the assumption that R(t) dropped by ~45% in late March, in parallel to the drop in mobility data (see Methods and supporting data).
From late April, through May and June, and into July the evidence suggests movement restrictions became steadily less effective. The waning effectiveness of movement restrictions results in an inferred increase in R(t) across Kenyan counties and an increased rate of epidemic growth ( Figure 2). The increasing R(t) estimates are broadly in line with predicted trends from Google mobility data (supporting information), although it should be noted that the R(t) estimates exhibit secondary fluctuations around the increasing mobility trend ( Figure 2). In Nairobi and Mombasa we predict that reduction in susceptibility of the population ( Figure 1C,D) caused the effective reproductive ratio (R eff ; the mean number of secondary cases accounting for reduced susceptibility) to drop significantly below the basic R value from June onwards ( Figure 2). However, other counties, where the epidemic did not establish itself as early as Mombasa and Nairobi, and where a substantial majority of the population are likely to still be susceptible, now have R(t) estimates which we estimate rebounded to the original levels estimated as occurring before Kenyan public health measures ( Figure 2).
By accounting for the delay of an average of 19 days between infection and death (supporting information for details on infection to death distribution) we find the transmission curve, estimated from PCR tests and serology, generates a good prediction of the observed trend in daily deaths in Nairobi and Mombasa (Figure 1 E, F). We did not use mortality data in transmission model inference, therefore the good fit to the observed trend in deaths with a PCR-confirmed test result represents an out-of-sample validation of the modelling 18 . Note, it is the distribution of deaths over time, rather than the absolute numbers, that we consider to be a good fit. In accord with observations, we estimate a peak of positive PCR test samples occurred at the end of July or early August in Nairobi and earlier, mid-June, in Mombasa. The lag between transmission peak and positive swab testing peak being explained by both the delay between infection and becoming detectable by PCR, and the period after an infected individual has ceased being actively infectious but remains detectable by PCR 19 (Figure 1 G,H and A,B). As of the end of September 2020 we estimate that about 35.4% (CI 29.0%-40.4%) of the Nairobi population, and 30.3% (CI 23.6-36.7%) of the Mombasa population were serologically positive with SARS-CoV-2, (Figure 1 C,D). This estimated level of seropositivity is substantially higher than has been estimated in some countries that have been hit hard by the pandemic 14-16 . However, they are in broad agreement with a study in Niger state, Nigeria, from June 2020 20 , as well as seropositivity rates reported from the hard-hit city of Manaus, Brazil, in May 2020 21 . Note that these estimates of seropositivity at the end of September assume both that waning seropositivity would not have had a significant effect on serological observations by late September, and furthermore that waning immunity leading to re-infection remained insignificant by late September.
SARS-CoV-2 attack rates in the first wave in Kenyan counties and the estimated crude infection-to-fatality ratio Accounting for the sensitivity of the serological assay, and the delay between infection and seroconversion, we estimate that the actual exposure of the population to SARS-CoV-2 by September 30th was 43.3% (CI 35.3%-49.5%) in Nairobi and 37.6% (CI 29.2%-45.7%) in Mombasa (Figure 1 C,D). Such levels of population exposure are predicted to be associated with decreased rates of new cases due to reduced numbers of susceptible individuals in these urban populations, although this is also influenced by the estimated reproductive number and effective population size at risk of exposure (P eff ). The effective population size accounts for the impact of heterogeneity in the susceptibility, transmissibility and social interactivity in the population (supporting information for more details on effective population size in transmission modelling); for  The posterior mean reproductive number for Nairobi (red curves), Mombasa (green curves), and the inter-quartile range (IQR) over mean reproductive number estimates for all other Kenyan counties (blue curve and shading). Shown are both the basic reproductive numbers (expected secondary infections in a susceptible population adjusted for mobility changes since the epidemic start; solid curves), and effective reproductive numbers (expected secondary infections accounting for depletion of susceptible prevalence in the population; dotted curves). The effective reproductive number varied significantly from county to county and is not shown except for Mombasa and Nairobi. Restrictions aimed at reducing mobility in risky transmission settings (black dotted lines) are labelled in groups. The chronologically ordered restrictions in each group are: 1) First PCR-confirmed case in Kenya, suspension of all public gatherings, closure of all schools and universities, and retroactive quarantine measures for recent returnees from foreign travel, 2) suspension of all inbound flights for foreign nationals, imposition of a national curfew, and regional lockdowns of Kilifi, Kwale, Mombasa and Nairobi counties, and 3) additional no-movement restriction of worst affected areas within Mombasa and Nairobi, and, closure of the border with Somalia and Tanzania. There were two relaxation of measures in this time frame: the end of no-movement restriction to Mombasa and Nairobi, and, the resumption of international air travel.
Nairobi it was inferred as 81.8% of actual population size (CI 66.7%-93.2%), for Mombasa 71.9% (CI 56.3%-86.5%). The effective population size estimates rest upon inferred variation in risk across the population. There remains a possibility of future increase in transmission if population mobility continues to rise, if population mixing patterns alter leading to changed risk heterogeneity or if immunity is short lived, leading to a rebound in reported cases. One or more of these factors could lead either to lengthening the tail after the first peak in cases/deaths, or even to a secondary increase in cases and/or deaths.
The inferred IFR crude values for both Nairobi (IFR crude = 0.019% (CI 0.014%-0.024%) and Mombasa (IFR crude = 0.022% (CI 0.016%-0.027%)) are substantially lower than the age-adjusted IFR expected for Kenya under full ascertainment from the agespecific IFR estimated given by Verity et al. (IFR verity = 0.26% 22 ; and supporting information). This is a crude observational value for the infection to fatality ratio, since we do not currently have an estimate of the reporting bias of deaths of individuals infected with SARS-CoV-2. Therefore, our estimate of IFR crude potentially reflects lower detection in Kenya compared to China, as well as any lower mortality risk due to fewer comorbidities. We extended our model-based inference to each of the 47 counties in Kenya (see dataset S1 for parameter estimates, peak time estimates and IFR crude estimates for each county). We find that, in addition to the two main Kenyan city counties, more than 25-30% of the population in each of the semi-urban counties neighbouring Nairobi (Kiambu, Kajiado, and Machakos) had been infected. However, the infection rate is predicted to be either lower than 25% and/or subject to high uncertainty in other counties (with high uncertainty defined as a prediction standard error of > 10% of county population size; Figure 3).
Due to the lag between infection and the observability of the infected person (whether by swab PCR test, serology test, or death), we estimate that both daily PCR positive test detections and daily observed deaths attributed to COVID-19 across the two main cities, and semi-urban counties neighbouring Nairobi had a peak in early August 2020 (Figure 3 B,C). Hospitalisation rates are not available for all Kenyan hospitals. However, sentinel clinical surveillance of severe acute respiratory infection (SARI), with or without a PCR test for SARS-CoV-2, at 14 county hospitals suggests an increasing rate of adult admissions in June and July 2020 23 . However, SARI admissions were lower in the early phase of the Kenyan epidemic than observed counts from the same months in 2018 and 2019 23 and the apparent rise in SARI admissions could represent a reversion towards pre-COVID numbers; this observation underlines the difficulties in using hospital data to understand the penetration of SARS-CoV-2 in Kenya.

Conclusions and discussion
Our modelling analysis provides a coherent account of the SARS-CoV-2 pandemic in Kenya up to end September 2020. Limitations include lacking information on the PCR testing denominators for the full time frame, the limited serological survey and that we have applied a simple dynamic model. In mitigation similar results were obtained when excluding all negative tests, and the dynamic model is transparently a fit to the data where the availability of the latter is a key strength of our study.
Our analysis suggests that 30-50% of the urban population were already exposed by the end of September, and that the first wave of the Kenya epidemic peaked in the urban and semi-urban counties during a period of relatively little restrictions or physical distancing. This level of exposure however was not sufficient to prevent a second wave which came shortly after the first (October to December 2020), which we assume to have resulted from heterogeneous spread of the virus, perhaps due to variation in population susceptibility, transmissibility or social interactivity Whilst the full picture of the epidemiology in Kenya will not be established until cause-specific mortality data become available (e.g. from resumption of Demographic Surveillance System and verbal autopsy activities), our model, fitted to three sources of nationwide longitudinal data, suggests that the number of symptomatic COVID-19 cases reported and the mortality attributed to the SARS-CoV-2 epidemic are substantially lower in Kenya than in Europe and the USA at a similar stage of the epidemic. This would remain the case even if reported deaths accounted for just 1/10th of the true value. However, there is insufficient data for speculating on the degree of under-reporting and previous estimates of 1 in 4 deaths occurring in hospital may not be generalizable to the hospital access during the COVID-19 pandemic 24 .
Late 2020 saw the spread of COVID-19 to more rural areas of Kenya, with less infrastructure and access to public health facilities and a second wave of SARS-CoV-2. This second wave needs to be dissected and understood. Policy makers need to balance the direct and indirect health and socio-economic consequences of any control measures; a balance that becomes more precise as we develop a better understanding of SARS-CoV-2 dynamics in Kenya.

Transmission model definition
The dynamics of transmission in each Kenyan county were assumed to follow a SEIR transmission model with an effective population size parameter (P eff ) 25 . The SEIR model with effective population size is an extension of the homogeneous SEIR model 26 with the additional flexibility that P eff N out of a total population size N in each county is at risk of contracting SARS-CoV-2. P eff = 1 recovers the homogeneous SEIR model, whereas, P eff < 1, recovers the effect of underlying heterogeneity in the transmission potential and risk in the population of the county on the aggregate dynamics of epidemic. This aspect of heterogeneous models of transmission has been widely investigated, for example, in the context of comparing vaccination coverage thresholds for elimination between uniform and targeted vaccination policies 27 . In the context of the SARS-CoV-2 pandemic modelling literature, the possible role of population heterogeneity in decoupling estimates of R 0 from predictions of the "herd-immunity" threshold and final attack rate has again been identified 28,29 . In this study, rather than make strong assumptions about the mechanism of population heterogeneity, e.g. differential susceptibility, differential rates of social mobility etc., we have taken a phenomenological approach; the effect of heterogeneity in the population was encoded in the effective population parameter P eff , and this parameter was inferred jointly with R 0 . Our a priori belief was that the most probable value was P eff = 1. We assumed that P eff was constant over the period of inference.
The model dynamics for each Kenyan county were represented as a system of ordinary differential equations, With initial conditions (time 0 is the calendar date 21st Feb 2020 and all rates are per day), Where the dynamic variables S(t), E(t), I(t), R(t) were the numbers of susceptibles-at-risk, exposed (but not yet infectious), infectious, and, recovered individuals in the county. The full number of susceptibles in the county at any time was (1 − P eff )N + S(t). C(t) was the cumulative numbers of infected individuals in the county.
The incubation-to-infectious rate was σ = 1/3.1 per day, and the recovery rate was γ = 1/2.4 per day, implying a mean generation time of 5.5 days (see Supporting information for a comparison to the generation distribution inferred by Ferretti et al. 30 ). The instantaneous reproductive ratio R t = R 0 β t decomposed into a basic reproductive ratio R 0 and an effective contact rate β t , where β t = 1 represents a pre-pandemic baseline contact rate in the population.

Transmission model inference
We used a mixed Bayesian and maximum a-posteriori (MAP) approach to parameter inference for each of the 47 Kenyan counties, based on daily observations of positive and negative PCR and serology tests in each county. The likelihood of individuals being detectable on any given day was based on whether they had been infected before that day, and, the number of days since their infection. The number of new infections on each day n, was denoted ι n . For a given set of model parameters ι n was generated by solving the ODE system (1), giving, for each day n. Given the daily numbers of new infections, the number of people in the county on each day n who are detectable by PCR testing, denoted (P + ) n , and serological testing, (S + ) n , were given by convolving the new infection time series with the probability of (respectively) being detectable by a PCR or serological test τ days after infection, Q PCR (τ) and Q sero (τ): The log-likelihood function for each county has the form, Where, ln f PCR ((ObsP + ) n |(P + ) n , θ OM ), and, ln f sero ((ObsS + ) n |(S + ) n , θ OM ), were, respectively, the log-probability of observing (ObsP + ) n PCR test-positives and (ObsS + ) n serological test positives on days n = 1,..., T given the model prediction of numbers of PCR and serological detectable people in the population, and the observation model parameters θ OM . Day n = 1 corresponded to the calendar date 21st February 2020, and, day n = T = 223 corresponded to 30th September 2020.
The underlying transmission prediction depended only on parameters relevant to infection (e.g. basic pre-measures reproductive ratio etc), however, the statistical modelling of the observation of evidence of these infections varied by type of test and availability of negative PCR test data. Together these form a likelihood function, which integrates the different data sources, since they are all, ultimately, generated by the same underlying infection process. The three statistical models of observation data were: • Serological tests: On each day that serological samples were collected, the log-probability of the observed number of positive tests (ln f sero ((ObsS + ) n |(S + ) n , θ OM )) was assumed to be that of a Beta-Binomial distribution with unbiased sampling of the underlying proportion of serologically detectable people in the county ((S + ) n /N). The extra dispersion compared to a Binomial sample being due to uncertainty in the underlying sensitivity of the serological assay (see supporting information in supporting data).
• PCR swab positive tests when no data on negative PCR tests was available: Negative PCR swab tests were not available in every county on every day of simulation. When negative swab tests were not available we assumed that the log-probability of the daily observed PCR test positives was from a Negative-binomial distribution: Where the mean number of daily observed test positives, conditional on the model prediction of PCR-detectable people in the population, is based on sampling a fraction p test TR(n).p test was an observation parameter that was jointly inferred during inference, and TR(n) was a normalized testing rate based on nationally reported data (see supporting information in supporting data). α was a clustering factor for negative-binomial sampling, jointly inferred with other model parameters.
• PCR swab positive tests when data on negative PCR tests was available: When both positive and negative PCR test data was available, we assumed that the fraction of positive samples reflected a biased observation of the underlying true fraction of PCR-detectable individuals in the population, e.g. being infected with SARS-CoV-2 could be expected to influence the odds of someone seeking a PCR test. We assumed that the daily detection of PCR test positives could be modelled as samples from a Beta-Binomial distribution with two parameters to infer: 1) The bias of a PCR-detectable individual being PCR tested compared to a PCRundetectable individual (χ), and, 2) the effective sample size parameter (M PCR Where N PCR,n ) is the total number of PCR swab samples collected on day n and p n is the proportion of tests performed returning positive expected by the model, accounting for bias in the sampling regime. The bias parameter χ = 1 recovers an unbiased sample of PCR positives from the underlying population.
Supporting information gives further details on the data sources and the log-likelihood calculation including a full description of all observation model parameters and the functional forms and underlying evidence for Q PCR and Q sero . The data sources used were: The Kenya Ministry of Health National linelist, the Kenya Medical Research Institute Wellcome Trust Research Programme (KEMRI-WTRP) serological surveillance programme and Google mobility data 31 . The full Kenyan SARS-CoV-2 line list contains sensitive personal information that could potentially allow the identification of individual cases. The analysis performed in this study only required an aggregated dataset derived from the Kenyan linelist. Other data used in this paper was openly available. All data is available in the main text or as underlying data 32 .
We assumed that β t was piece-wise constant on days, and, therefore, could be reconstructed from daily effective contact rates (β n ) n=1,...,T . For any fixed estimate of the effective contact rate β t , we used Hamiltonian Markov-chain Monte Carlo (HMC) 33 to estimate the posterior distribution for the transmission model parameters; that is the initial condition values (E 0 , I 0 ) and fixed parameters (P eff , R 0 ) jointly with the observation model parameters θ OM . Prior distributions for parameters were chosen for groups of counties (e.g. largely rural counties had different priors to major urban conurbations like Nairobi and Mombasa; see supporting information for further details). Starting from an initial estimate that β t followed daily Google mobility trends 31 for the whole period, we sequentially improved our β t estimate using the expectation-maximisation (EM) algorithm 34 . The E-step corresponding to posterior distribution estimation using HMC, and the M-step corresponding to optimising the daily effective contact rate estimates (β n ) n=41,...,T using the popular stochastic gradient descent algorithm ADAM 35 . The first 40 days of effective contact rate estimates (β n ) n=1,...,40 were assumed to be fixed to their Google estimate; this improved identifiability jointly with R 0 and captured the observed sharp drop in mobility in response to Kenyan public health measures following the first identified case on 13th March 2020. See supporting information for further details on the use of Google mobility data and the EM algorithm method used in this study.
After inference of transmission parameters, the model implied a prediction of the expected number of daily deaths due to COVID, E(X + ) n , based on an overall population infection-to-fatality ratio (IFR), and, the delay distribution between infection and death, p ID , In this study, we assume that the IFR is constant for each county over the period of inference, which allows us to construct a Bayesian estimator of the crude IFR, IFR crude , by fitting to the observed daily numbers of test-positive deaths, (ObsX + ) n (see supporting information for details and background data informing p ID ). Because the observed test-positive deaths were not used in inferring model parameters, we treat the log-predictive density of deaths from the model as an out-of-sample validation metric for the model. However, we emphasise that the out-of-sample comparison is to the trend of daily deaths, because this is invariant to the IFR crude estimator, which is itself sensitive to under-reporting of COVID deaths. Supporting information gives full details on the Bayesian model validation used in this study.
This study was approved by the Kenya Medical Research Institute Scientific and Ethics Review Unit (KEMRI-SERU) with approval numbers KEMRI/SERU/CGMR-C/203/4085 and KEMRI/SERU/CGMR-C/203/3426 for the modelling and serosurvey studies respectively.

Data availability
Underlying data Zenodo: Revealing the extent of the first wave of the COVID-19 pandemic in Kenya based on serological and PCR-test data. https://doi.org/10.5281/zenodo.4705244 32 This project contains the following underlying data: • Data S4 (The number of positive, and negative where available, PCR-confirmed swab tests for each county by date of sample collection (21st Feb to 30th September)).
• Data S5. (The number of positive and negative sero-logical results for each county by date of sample collection (21st Feb to 6th August)). This is from the Kenyan Ministry of Health National linelist.
• Data S6. (The number of deaths with a PCR-confirmed swab test for each county by recorded date of death (21st Feb to 30th September)).
• Data S7. (Summary data of Kenyan epidemic, including reported total number of test performed in Kenya. • supp material.docx (A more detailed description of the data)

Software availability
The analysis code was written in Julia language version 1.4.
• The code base underlying the analysis is accessible at the open github repository https://github.com/ojal/Ken-yaSerology. For the analysis presented here we developed a module in the Julia programming language called KenyaSerology. Tutorial notebooks for running KenyaSerology and analysing the underlying data are available in the repository. September 2020, which encompasses the first wave of SARS-CoV-2 transmission in the country. Authors estimated that the first wave of the SARS-CoV-2 pandemic peaked before the end of July 2020 in the major urban counties, with 30-50% of residents infected.
This is an important study and likely has implications for other neighbouring countries in Africa as well. Data on COVID-19 from the African continent are very limited. I encourage indexing of this revised submission and I just had two minor comments: Abstract final phrase -"further wave of transmission" do you mean specifically a further wave of transmission with an antigenically-different strain such as a new variant which can escape the population immunity that has built up? Or you mean that the first wave was controlled before population immunity reached a herd immunity threshold? Even so, if the same strain circulates again, one wouldn't expect a large wave because of the existing immunity from the first wave. Waning immunity in medium-term could also play a role in allowing subsequent epidemics.

○
Conclusions -I don't find the Manaus estimate of 75% particularly compelling due to methodological issues in that study. There should be other locations with less extreme firstyear serological data?

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Yes
We thank the reviewer for useful review and comments.
Point 1 "Abstract final phrase -"further wave of transmission" do you mean specifically a further wave of transmission with an antigenically-different strain such as a new variant which can escape the population immunity that has built up? Or you mean that the first wave was controlled before population immunity reached a herd immunity threshold? Even so, if the same strain circulates again, one wouldn't expect a large wave because of the existing immunity from the first wave. Waning immunity in medium-term could also play a role in allowing subsequent epidemics." We were not, at this early stage in the pandemic, suggesting a further wave from a new variant. Instead, we inferred some heterogeneity in population susceptibility, transmissibility or social interactivity, encapsulated by the phenomenological term, the effective population size at risk of exposure (Peff). This population heterogeneity put a break on virus spread in the first wave but made possible a second wave that moved into less infected sections of the population. This was unmeasured and not well understood at the time. However, in our subsequent paper (DOI: 10.1126/science.abk0414 ) we were able to explicitly account for this heterogeneity as differences in mobility of lower (high transmission in wave 1) and higher (low transmission in wave 1) socio-economic classes, particularly in the urban setting.
No change has been made to the manuscript. We agree with the reviewer and revise the section of text referring to the estimate of 75% (see appended). We do already make other comparisons from serosurveys from that period from other settings including Spain, England, United States and Niger.
The revised text is 'Our analysis suggests that 30-50% of the urban population were already exposed by the end of September, and that the first wave of the Kenya epidemic peaked in the urban and semi-urban counties during a period of relatively little restrictions or physical distancing. This level of exposure however was not sufficient to prevent a second wave which came shortly after the first (October to December 2020), which we assume to have resulted from heterogeneous spread of the virus, perhaps due to variation in population susceptibility, transmissibility or social interactivity.' https://doi.org/10.21956/wellcomeopenres.18470.r46411

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound? Yes

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes understanding of the overall estimates of model predictions and Rt estimates by county would be incredibly valuable. For example, the authors provide the percentage infected by county in Figure 3, however these values are different to interpret in context, particularly without seeing the data and estimates by county.
This is a very good point. In our original analysis we plotted model prediction intervals against actual data by county as part of model diagnostic. We have now improved our diagnostic visualisations to match the format of the main manuscript plots for Nairobi and Mombasa, and generated county-specific plots for model-based prediction/credible intervals for (i) PCR swab test positives, (ii) population exposure, (iii) deaths, (iv) R(t) against data (where available). All 188 plots (4 x 47 counties) are available in the data and code repository associated with this paper https://github.com/ojal/KenyaSerology .
In Figure 1 (C,D) the model prediction CIs are incredibly narrow, which seems surprising. Does this occur across the country? Or is this mainly due to the higher quality data in both Nairobi and Mombasa?
This is correct, the reasonably high model certainty about seroprevalence in Nairobi and Mombasa was because of higher data quality in the main cities in Kenya. Other counties had much wider Cis for model predicted seroprevalence, in Figure 3 we attempted to visualise this by candy-striping the county shading for counties with a posterior standard deviation in model prediction of population exposure of greater than 10%; that is the counties where a >10% deviation from the posterior mean estimate of population exposure would not be highly unexpected. We have now added population exposure plots for every county, including credible intervals for the population seropositivity.
In Figure 2, it would be helpful (perhaps in the supplement or in this main figure) to provide context on when restrictions were lifted in addition to when they were put in place.
Over the time scale this paper is concerned with (February -October 2020) there were only two significant relaxations, however, this included lifting the movement restrictions on travel out of Nairobi and Mombasa (6 th July 2020), and, therefore, was an oversight to not include in Figure  Further, it is interesting that the IQR for the Rt estimates early in the pandemic seem exceptionally narrow. Additional elaborations on these points (is it likely due to overfitting? Some factors associated with the model fitting? Etc.) should be included.
The early tight estimate for Rt reflected (i) fairly tight estimates that Rt ~ 1.1-1.2 in most counties in late April 2020, and (ii) our assumption that for the first 40 days of the simulation (20 th Feb 2020 -31 st March 2020) Rt was proportional to Google data derived estimates of mixing in indoor settings outside the home. In the early stages of the epidemic in Kenya there was very limited testing capacity for detection of SARS-CoV-2 transmission in the community (first positive test result was on 12 th March 2020). This means that we were forced to make an assumption about the effective relative contact rates, because we could not infer them from epidemiological data. The Google data suggested a fairly uniform 40-45% decrease in mixing in inside settings (e.g., the workplace, etc) outside the home by mid-April 2020, by which time Rt ~ 1.1 -1.2 in most counties. Because of our modelling assumption, this confidence in Rt in April-May 2020 was translated into confidence in Rt ~ 1.8 -2.0 during an unobserved epidemic in February 2020.
We have added this sentence to the opening paragraph of the Underlying transmission rates in Mombasa and Nairobi during the first wave section: "However, we should note that there was very limited PCR testing available in Kenya before April 2020, and our estimates of R(t) pre-April 2020 rely on the assumption that R(t) dropped by ~45% in late March, in parallel to the drop in mobility data (see Methods and supporting data)." In addition, the Rt estimates for Mombasa in August/September seem substantially different than the rest of the country. Can the authors provide additional context? And do they see similar patterns across the coastal counties during this time?
This is an interesting observation. Upon inspection, other counties in coastal province (Kilifi, Lamu, Tana River and Taita Taveta, but not the Tanzanian border county Kwale) also have a spike in Rt estimates in late August/early September 2020 (see county specific plots). A possibility is that this is connected to the relaxation of movement into Mombasa, however, there is a delay of greater than 4 weeks between that relaxation and the Rt increase.
The authors do an excellent job appropriately combining different data sets, which is well explained in the supplementary information. Some of these details would be incredibly helpful to move to the main text, in particular additional detail on how the authors treat the serological versus PCR data (and when there were both negative and positive PCR results) and how these two pieces of evidence are integrated.
Thank you very much, the aim was not to overwhelm a non-specialist audience, whilst providing full details within the supporting information. We have now added a further paragraph in the Methods section which we hope makes our methodology clearer to the reader.
"The underlying transmission prediction depended only on parameters relevant to infection (e.g. basic pre-measures reproductive ratio etc), however, the statistical modelling of the observation of evidence of these infections varied by type of test and availability of negative PCR test data. Together these form a likelihood function, which integrates the different data sources, since they are all, ultimately, generated by the same underlying infection process.
The three statistical models of observation data were: Serological tests: On each day that serological samples were collected, the logprobability of the observed number of positive tests (ln f sero (( ObsS + ) n |( S + ) n , θ OM ○ )) was assumed to be that of a Beta-Binomial distribution with unbiased sampling of the underlying proportion of serologically detectable people in the county ( ( S + ) n /N). The extra dispersion compared to a Binomial sample being due to uncertainty in the underlying sensitivity of the serological assay (see supporting information in supporting data). PCR swab positive tests when no data on negative PCR tests was available: Negative PCR swab tests were not available in every county on every day of simulation. When negative swab tests were not available we assumed that the logprobability of the daily observed PCR test positives was from a Negative-binomial distribution: μn = p test TR(n)(P+)n (ObsP+)n NegBin(μ^=μn,α^ = α) (5) ○ Where the mean number of daily observed test positives, conditional on the model prediction of PCR-detectable people in the population, is based on sampling a fraction ptest TR(n).ptest was an observation parameter that was jointly inferred during inference, and TR(n) was a normalized testing rate based on nationally reported data (see supporting information in supporting data). α was a clustering factor for negative-binomial sampling, jointly inferred with other model parameters. PCR swab positive tests when data on negative PCR tests was available: When both positive and negative PCR test data was available, we assumed that the fraction of positive samples reflected a biased observation of the underlying true fraction of PCR-detectable individuals in the population, e.g. being infected with SARS-CoV-2 could be expected to influence the odds of someone seeking a PCR test. We assumed that the daily detection of PCR test positives could be modelled as samples from a Beta-Binomial distribution with two parameters to infer: 1) The bias of a PCRdetectable individual being PCR tested compared to a PCR-undetectable individual (χ), and, 2) the effective sample size parameter (M PCR ). MPCR→ ∞ recovered a Binomial distribution for the number of positive PCR tests were observed among the tests conducted that day, MPCR < ∞ allowed the model to infer much greater variance in daily proportion of test positives than would be expected from a Binomial distribution. On days where negative swab tests were available, we connect the observable status of epidemic to the data thus,

pn=χ(P+)n / (χ-1)(P+)n + N (ObsP+)n BetaBin((N s )^ = NPCR,n, p^=pn,M^ = MPCR). (6)
○ Where NPCR,n is the total number of PCR swab samples collected on day n and pn is the proportion of tests performed returning positive expected by the model, accounting for bias in the sampling regime. The bias parameter χ=1 recovers an unbiased sample of PCR positives from the underlying population." Finally, in the supplement the authors use the phrase 'Chinese epidemic', but a small point that it may be more appropriate to say 'epidemic in China'.
Noted, and we have changed our language in the supporting information.