Estimating the overdispersion in COVID-19 transmission using outbreak sizes outside China

Background: A novel coronavirus disease (COVID-19) outbreak has now spread to a number of countries worldwide. While sustained transmission chains of human-to-human transmission suggest high basic reproduction number R 0, variation in the number of secondary transmissions (often characterised by so-called superspreading events) may be large as some countries have observed fewer local transmissions than others. Methods: We quantified individual-level variation in COVID-19 transmission by applying a mathematical model to observed outbreak sizes in affected countries. We extracted the number of imported and local cases in the affected countries from the World Health Organization situation report and applied a branching process model where the number of secondary transmissions was assumed to follow a negative-binomial distribution. Results: Our model suggested a high degree of individual-level variation in the transmission of COVID-19. Within the current consensus range of R 0 (2-3), the overdispersion parameter k of a negative-binomial distribution was estimated to be around 0.1 (median estimate 0.1; 95% CrI: 0.05-0.2 for R0 = 2.5), suggesting that 80% of secondary transmissions may have been caused by a small fraction of infectious individuals (~10%). A joint estimation yielded likely ranges for R 0 and k (95% CrIs: R 0 1.4-12; k 0.04-0.2); however, the upper bound of R 0 was not well informed by the model and data, which did not notably differ from that of the prior distribution. Conclusions: Our finding of a highly-overdispersed offspring distribution highlights a potential benefit to focusing intervention efforts on superspreading. As most infected individuals do not contribute to the expansion of an epidemic, the effective reproduction number could be drastically reduced by preventing relatively rare superspreading events.

R characterised by so-called superspreading events) may be large as some countries have observed fewer local transmissions than others.
We quantified individual-level variation in COVID-19 Methods: transmission by applying a mathematical model to observed outbreak sizes in affected countries. We extracted the number of imported and local cases in the affected countries from the World Health Organization situation report and applied a branching process model where the number of secondary transmissions was assumed to follow a negative-binomial distribution.
Our model suggested a high degree of individual-level variation in Results: the transmission of COVID-19. Within the current consensus range of R (2-3), the overdispersion parameter of a negative-binomial distribution k was estimated to be around 0.1 (median estimate 0.1; 95% CrI: 0.05-0.2 for R0 = 2.5), suggesting that 80% of secondary transmissions may have been caused by a small fraction of infectious individuals (~10%). A joint estimation yielded likely ranges for and (95% CrIs: 1.4-12; R k R k 0.04-0.2); however, the upper bound of was not well informed by the R model and data, which did not notably differ from that of the prior distribution.
Our finding of a highly-overdispersed offspring distribution Conclusions: highlights a potential benefit to focusing intervention efforts on superspreading. As most infected individuals do not contribute to the

Introduction
A novel coronavirus disease (COVID-19) outbreak, which is considered to be associated with a market in Wuhan, China, is now affecting a number of countries worldwide 1,2 . A substantial number of human-to-human transmission has occurred; the basic reproduction number R 0 (the average number of secondary transmissions caused by a single primary case in a fully susceptible population) has been estimated around 2-3 3-5 .
More than 100 countries have observed confirmed cases of COVID-19. A few countries have already been shifting from the containment phase to the mitigation phase 6,7 , with a substantial number of locally acquired cases (including those whose epidemiological link is untraceable). On the other hand, there are countries where a number of imported cases were ascertained but fewer secondary cases have been reported than might be expected with an estimated value of R 0 of 2-3.
This suggests that not all symptomatic cases cause a secondary transmission, which was also estimated to be the case for past coronavirus outbreaks (SARS/MERS) 8,9 . High individual-level variation (i.e. overdispersion) in the distribution of the number of secondary transmissions, which can lead to so-called superspreading events, is crucial information for epidemic control 9 . High variation in the distribution of secondary cases suggests that most cases do not contribute to the expansion of the epidemic, which means that containment efforts that can prevent superspreading events have a disproportionate effect on the reduction of transmission.
We estimated the level of overdispersion in COVID-19 transmission by using a mathematical model that is characterised by R 0 and the overdispersion parameter k of a negative binomial branching process. We fit this model to worldwide data on COVID-19 cases to estimate k given the reported range of R 0 and interpret this in the context of superspreading.

Data source
We extracted the number of imported/local cases in the affected countries (Table 1) from the WHO situation report 38 10 published on 27 February 2020, which was the latest report of the number of imported/local cases in each country (as of the situation report 39, WHO no longer reports the number of cases stratified by the site of infection). As in the WHO situation reports, we defined imported cases as those whose likely site of infection is outside the reporting country and local cases as those whose likely site of infection is inside the reporting country. Those whose site of infection was under investigation were excluded from the analysis (Estonia had no case with a known site of infection and was excluded). In Egypt and Iran, no imported cases have been confirmed, which cause the likelihood value to be zero; data in these two countries were excluded. To distinguish between countries with and without an ongoing outbreak, we extracted daily case counts from an online resource 11 and determined the dates of the latest case confirmation for each country (as of 27 February).

Model
Assuming that the offspring distributions (distribution of the number of secondary transmissions) for COVID-19 cases are identically-and independently-distributed negative-binomial distributions, we constructed the likelihood of observing the reported number of imported/local cases (outbreak size) of COVID-19 for each country. The probability mass function for the final cluster size resulting from s initial cases is, according to Blumberg et al. 12 , given by If the observed case counts are part of an ongoing outbreak in a country, cluster sizes may grow in the future. To address this issue, we adjusted the likelihood for those countries with ongoing outbreak by only using the condition that the final cluster size of such a country has to be larger than the currently observed number of cases. The corresponding likelihood function is ).

Statistical analysis
Varying the assumed R 0 between 0-5 (fixed at an evenlyspaced grid of values), we estimated the overdispersion parameter k using the likelihood function described above. We used the Markov-chain Monte Carlo (MCMC) method to provide 95% credible intervals (CrIs). The reciprocal of k was sampled where the prior distribution for the reciprocal was weakly-informed half-normal (HalfNormal(σ = 10)). We employed the adaptive hit-and-run Metropolis algorithm 13

Amendments from Version 2
A typo in the equation on p 80% was corrected; originally it was The typo was only present in the manuscript and did not affect the analysis or other parts of the manuscript.
Any further responses from the reviewers can be found at the end of the article and obtained 500 thinned samples from 10,000 MCMC steps (where the first half of the chain was discarded as burn-in). We confirmed that the final 500 samples have an effective sample size of at least 300, indicating sufficiently low auto-correlation.
We also performed a joint-estimation of R 0 and k by the MCMC method (with a weakly-informed normal prior N(μ = 3, σ = 5) for R 0 and the weakly-informed half-normal prior (HalfNormal(σ = 10)) for the reciprocal of k.
Statistical analysis was implemented in R-3.6.1 with a package {LaplacesDemon}-16.1.1. The reproducible code for this study is available on GitHub 14 .
Proportion responsible for 80% of secondary transmissions Using the estimated R 0 and k, we computed the estimated proportion of infected individuals responsible for 80% of the total secondary transmissions. Such proportion p 80% is given as  represents the probability mass of a negative-binomial distribution with a mean R 0 and an overdispersion parameter k. This calculation is eased by the following rearrangement: We computed p 80% for each MCMC (Markov-chain Monte Carlo) sample to yield median and 95% CrIs.

Model comparison with a Poisson branching process model
To test if our assumption of overdispersed offspring distribution better describes the data, we compared our negative-binomial branching process model with a Poisson branching process model, which assumes that the offspring distribution follows a Poisson distribution instead of negative-binomial. Since a negative-binomial distribution converges to a Poisson distribution as k → ∞, we approximately implemented a Poisson branching process model by fixing k of the negative-binomial model at 10 10 . We compared the two models by the widelyapplicable Bayesian information criterion (WBIC) 15 .

Simulation of the effect of underreporting
We used simulations to investigate potential bias caused by underreporting, one of the major limitations of the present study. Underreporting in some countries may be more frequent than others because of limited surveillance and/or testing capacity, causing heterogeneity in the number of cases that could have affected the estimated overdispersion. See Extended data (Supplementary materials) 16 for detailed methods.
The effect of a differential reproduction number for imported cases Due to interventions targeting travellers (e.g. screening and quarantine), the risk of transmission from imported cases may be lower than that from local cases. As part of the sensitivity analysis in Extended data, we estimated k assuming that the reproduction number of imported cases is smaller than that of local cases.

Results
Our estimation suggested substantial overdispersion (k << 1) in the offspring distribution of COVID-19 ( Figure 1A and Figure 2). Within the current consensus range of R 0 (2-3), k was estimated to be around 0.1 (median estimate 0.1; 95% CrI: 0.05-0.2 for R 0 = 2.5). For the R 0 values of 2-3, the estimates suggested that 80% of secondary transmissions may have been caused by a small fraction of infectious individuals (~10%; Figure 1B).
The result of the joint estimation suggested the likely bounds for R 0 and k (95% CrIs: R 0 1.4-12; k 0.04-0.2). The upper bound of R 0 did not notably differ from that of the prior distribution (=13.5), suggesting that our model and the data only informed the lower bound of R 0 . This was presumably because the contribution of R 0 to the shape of a negative-binomial distribution is marginal when k is small (Extended data, Figure S1) 16 . A scatterplot (Extended data, Figure S2) 16 exhibited a moderate correlation between R 0 and k (correlation coefficient -0.4).
Model comparison between negative-binomial and Poisson branching process models suggested that a negative-binomial model better describes the observed data; WBIC strongly supported the negative-binomial model with a difference of 11.0 ( Table 2). The simulation of the effect of underreporting suggested that possible underreporting is unlikely to cause underestimation of overdispersion parameter k (Extended data, Figure S3) 16 . A slight increase in the estimate of k was observed when the reproduction number for imported cases was assumed to be lower due to interventions (Extended data, Table S1).

Discussion
Our results suggested that the offspring distribution of COVID-19 is highly overdispersed. For the likely range of R 0 of 2-3, the overdispersion parameter k was estimated to be around 0.1, suggesting that the majority of secondary transmission may be caused by a very small fraction of individuals (80% of transmissions caused by ~10% of the total cases). These results are consistent with a number of observed superspreading events observed in the current COVID-19 outbreak 17 , and also in line with the estimates from the previous SARS/MERS outbreaks 8 .
The overdispersion parameter for the current COVID-19 outbreak has also been estimated by stochastic simulation 18 and from contact tracing data in Shenzhen, China 19 . The former study did not yield an interpretable estimate of k due to the limited data input. In the latter study, the estimates of R e (the effective reproduction number) and k were 0.4 (95% confidence interval: 0.3-0.5) and 0.58 (0.35-1.18), respectively, which did not agree with our findings. However, these estimates were obtained from pairs of cases with a clear epidemiological link and therefore may have been biased (downward for R 0 and upward for k) if superspreading events had been more likely to be missed during the contact tracing.
Although cluster size distributions based on a branching process model are useful in inference of the offspring distribution from limited data 12,20 , they are not directly applicable to an ongoing outbreak because the final cluster size may not yet have been observed. In our analysis, we adopted an alternative approach which accounts for possible future  growth of clusters to minimise the risk of underestimation. As of 27 February 2020, the majority of the countries in the dataset had ongoing outbreaks (36 out of 43 countries analysed, accounting for 2,788 cases of the total 2,816). Even though we used the case counts in those countries only as the lower bounds of future final cluster sizes, which might have only partially informed of the underlying branching process, our model yielded estimates with moderate uncertainty levels (at least sufficient to suggest that k may be below 1). Together with the previous finding suggesting that the overdispersion parameter is unlikely to be biased downwards 21 , we believe our analysis supports the possibility of highly-overdispersed transmission of COVID-19.
A number of limitations need to be noted in this study. We used the confirmed case counts reported to WHO and did not account for possible underreporting of cases. Heterogeneities between countries in surveillance and intervention capacities, which might also be contributing to the estimated overdispersion, were not considered (although we investigated such effects by simulations; see Extended data, Figure S3) 16 . Reported cases whose site of infection classified as unknown, which should in principle be counted as either imported or local cases, were excluded from analysis. Some cases with a known site of infection could also have been misclassified (e.g., cases with travel history may have been infected locally). The distinction between countries with and without ongoing outbreak (7 days without any new confirmation of cases) was arbitrary. However, we believe that our conclusion is robust because the distinction does not change with different thresholds (4-14 days), within which the serial interval of SARS-CoV-2 is likely to fall 22,23 .
Our finding of a highly-overdispersed offspring distribution suggests that there is benefit to focusing intervention efforts on superspreading. As most infected individuals do not contribute to the expansion of transmission, the effective reproduction number could be drastically reduced by preventing relatively rare superspreading events. Identifying characteristics of settings that could lead to superspreading events will play a key role in designing effective control strategies.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
1.
written and the results are clearly presented. However, the data used in the study may have a few potential bias representing several alternative scenarios that I recommend the authors to explore: The number of imported cases is likely an underestimate of the true number of cases as screening of travelers is unlikely to reach high detection rate for SARS-CoV-2 . Certain studies estimated the reporting rate is only around 30-40%, even for countries with high surveillance intensity . This is likely different from the reporting rate of local cases (see point 3). I recommend the authors explore reporting rate of imported cases and local cases separately.
The over-dispersion estimated for SARS in the previous study is under un-controlled epidemic scenario . However, for imported cases detected through travel screening, certain control measures is likely in-place such as isolation/quarantine, which will reduce the effective reproduction number, thus in Figure 1, the effective R0 range for imported cases could extend to <1.
It's also quite likely that local transmission is heavily under-reported during February as well. A way to gauge this under-detection is to see when each country reported the first few deaths due to COVID-19. Assuming an infection fatality of 1% will suggest a few hundred cumulative infections about 2 weeks before the detection of death. The authors already listed a number of deaths at the same date of case reporting, I recommend the authors also reports the number of deaths 2-weeks later (or the delay from case detection to death that the authors finds appropriate) and comment on the possible rate of under reporting for local cases, and together with point 1, how it may affects the estimates of k.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Thank you for your constructive suggestions. We have discussed potential biases as suggested and performed additional analysis where applicable. Please find our responses below.

The number of imported cases is likely an underestimate of the true number of cases as screening of travelers is unlikely to reach high detection rate for SARS-CoV-2. Certain studies estimated the reporting rate is only around 30-40%, even for countries with high surveillance intensity. This is likely different from the reporting rate of local cases (see point 3). I recommend the authors explore reporting rate of imported cases and local cases separately.
> We agree that the number of imported cases is likely to be under ascertained. In our sensitivity analysis, we explored how the estimated value of k may be biased with different probabilities of reporting. We found that the same level of underreporting for both imported and local cases tend to result in overestimation of k, while the estimate was hardly affected if the imported cases are fully reported but the local cases are underreported. As the scenario suggested by the reviewer may likely to lie in between these two extreme scenarios we explored, we expect that such scenario would lead to the overestimation of k (but not as much as our "equally underreported" scenario; Figure S3A).
2. The over-dispersion estimated for SARS in the previous study is under un-controlled epidemic scenario. However, for imported cases detected through travel screening, certain control measures is likely in-place such as isolation/quarantine, which will reduce the effective reproduction number, thus in Figure 1, the effective R0 range for imported cases could extend to <1.
> We have included the suggested scenario as part of our sensitivity analysis. Assuming that R0 for local cases is 2.5, we varied the effective reproduction number for the imported cases (0.5, 0.8 and 1.2) and estimated the value of k. We found that the estimates of k were larger than those in our baseline analysis for R0 = 2.5 if the assumed reproduction number for the imported cases (R ) was below 1 (k = 0.3 for R = 0.5; k = 0.2 for R = 0.8). For R = 1.2, the estimate was similar to our baseline analysis (k = 0.1). We have added the description of this additional analysis in the supplementary document. I   I  I  I 1.

It's also quite likely that local transmission is heavily under-reported during February as well. A way to gauge this under-detection is to see when each country reported the first
2. 3.

well. A way to gauge this under-detection is to see when each country reported the first few deaths due to COVID-19.
Assuming an infection fatality of 1% will suggest a few hundred cumulative infections about 2 weeks before the detection of death. The authors already listed a number of deaths at the same date of case reporting, I recommend the authors also reports the number of deaths 2-weeks later (or the delay from case detection to death that the authors finds appropriate) and comment on the possible rate of under reporting for local cases, and together with point 1, how it may affects the estimates of k.
> As the reviewer suggests, the number of deaths is a useful measure to assess underreporting. Assuming the average infection fatality 1% for the initial deaths may be subject to bias because the earliest cases may have specific age profiles that result in a different fatality. However, when averaged over the dataset, the overall case fatality may suggest the possible degree of underreporting in the dataset. As the mean lags of 8-13 days from case confirmation to death have been used for early outbreaks in existing studies [1,2], we referred to the WHO situation report 45 (5 March) and 52 (12 March), published on the 7 and 14 day from the situation report 38 we used in the analysis [3,4]. The total number of deaths in the countries included in our analysis was 168 and 1,065, respecctively. Given 2,815 total confirmed cases as of February 27 , these suggest ascertainment ratios of 16.8% and 2.6%, respectively (assuming the true infection fatality risk is 1%). However, these ratios may be underestimates because of the rapid growth in both cases and deaths. It was suggested that the lag distribution from confirmation to death has a large variation (coefficient of variation 50%-100% [1,2]), and early-reported deaths from the cases confirmed later than February 27 may have inflated the number of deaths in situation reports 45 and 52. In either case, we believe that the assumed reporting probability in our sensitivity analysis ( Figure S3C) was consistent overall with these observations. Although we did not include the above calculation in the manuscript because it is only a rough estimation in which we are not completely confident, we cited Niehus et al. to show that our assumed range of the reporting probability in the sensitivity analysis was plausible: (Supplementary document, Section 3) An existing study suggested 38% as an optimistic global estimate of the detection probability for imported cases from Wuhan, China, with a substantial variation between countries [1].

1.
When countries have mitigations in place including wearing masks, this lowers the probability of transmission events from the more casual contacts and shifts the likelihood to more crowded close-contact indoor poor air exchange environments where they were always risky but now represent the main situations that exceed the "probability gate" limited by mask-wearing and physical distancing elsewhere. That is, . there is a bias towards super-spreading events Also, when countries put mitigations in place this shifts the probability of transmission to be more with those infected individuals with higher viral load (not necessarily measured in NP swabs but viral load where it counts in the lungs bringing virus-laden mucus to the vocal folds and epiglottis) or higher droplet/aerosol output because those with moderate viral load have much lower transmission such as from wearing masks or maintaining physical distancing. That is, there is a bias towards super-spreader . individuals This latter point requires a more subtle and complex analysis because it also requires non-linear limits on super-spreading individuals and these exist in the form of 1) the dose-response curve that flattens towards 100% probability of infection (i.e. you cannot infect someone more than once --that is, you can't count your very high viral load as infecting them three times as much) and 2) the limited number of contacts an individual has during the contagious period. So as mitigations are removed, the super-spreading individuals infect at a relative lower proportional rate than moderate viral-load individuals --that is, if the moderate viral-load individuals doubled in their R value when taking off their masks, the super-spreading individuals would be less than doubled (i.e. they are saturating in both the dose-response and opportunities). This means that the estimated "k" value would be higher during the unmitigated exponential growth phase of disease transmission and this appears to be the case in China where most transmission was within families. After mitigation and the more restrictive the mitigation, the lower the "k" estimate would be because only those rarer individuals and events would occur and have one person infect many but have most of those infected not infect many others.
So while it is reasonable to conclude that super-spreading individuals and/or super-spreader events are driving the majority of transmissions after mitigations are in place, it would be incorrect to conclude that this was (as much of) the situation when there were few mitigations . If mitigations are generally followed, then this can result in and there was wider community spread outbreaks that quickly die off unless the super-spreading individuals or events were numerous enough to create chains (i.e. if the R value of a super-spreading individual was more than 10 and they represent 10% of people who are infected then the overall R > 1 would continue the spread).
It would also be good to be able to distinguish between super-spreader individuals vs. super-spreading events because the latter is more readily controlled via public policy by limiting crowds and requiring wearing masks. So far it appears that outbreaks are largely super-spreading events and while they could be caused from individuals with higher viral load there does not appear to be evidence of people wearing masks indoors maintaining social distance and having proper air exchange causing outbreaks.

No competing interests. Competing Interests:
Reader Comment 06 Jun 2020 , Aspect Managemebt Corp, Houston, USA Alex Cranberg These results and conclusions may be significantly affected by heterogeneity deriving from susceptibility