Study participants
Ethical permission for this study was granted by The East of England (Essex) Research Ethics Committee (REC reference 15/EE/0327). The SardiNIA longitudinal study recruited individuals from four towns in the Lanusei Valley in Sardinia, capturing 5 phases of sample and data collection26 over more than 20 years. Informed consent was obtained from all participants. We analysed serial samples from 385 individuals in the SardiNIA project.
Targeted sequencing and variant calling
Target enrichment of whole-blood DNA was performed using a custom RNA bait set (Agilent SureSelect ELID 3156971), designed complementary to 56 genes implicated in clonal haematopoiesis and haematological malignancies (Supplementary Table 1). Libraries were sequenced on Illumina HiSeq 2000 and variant calling was performed as we described previously5,50. In brief, somatic single-nucleotide variants and small indels were called using Shearwater (v.1.21.5), an algorithm designed to detect subclonal mutations in deep sequencing experiments51. Two additional variant-calling algorithms were applied to complement this approach: CaVEMan (v.1.11.2) for single-nucleotide variants, and Pindel (v.2.2) for small indels52,53. VAF correction was performed using an in-house script (https://github.com/cancerit/vafCorrect). Finally, allele counts at recurrent mutation hotspots were verified using an in-house script (https://github.com/cancerit/allelecount). Variants were filtered as we described previously5,50, but were not curated with regard to existing notions of oncogenicity, that is, all somatic variants passing quality filters were retained for analysis.
If a variant was identified in an individual at any time point in the study, this site was re-queried in the same individual at all other time points, using an in-house script (cgpVAF) to provide pileup (SNV) and Exonerate (indel) output (https://github.com/cancerit/vafCorrect). No additional filters were applied to these back-called variants.
Selection analyses
To quantify selection, we used the dNdScv algorithm, a maximum-likelihood implementation of dN/dS, which measures the ratio of non-synonymous (N) to synonymous (S) mutations, while controlling for gene sequence composition and variable substitution rates27. We first applied this method to the mutation calls from the longitudinal SardiNIA cohort in order to identify which genes are under positive selection in the context of clonal haematopoiesis. For this analysis, any mutation that was present in a single individual at multiple time points was counted only once. We also compared dN/dS ratios at the beginning and end of study, and found the latter to be higher, consistent with stronger cumulative effects of selection at older ages (Supplementary Note 2).
To characterize patterns of selection in AML and MDS, we applied dNdScv to two published data sets. The AML set was derived from 1,540 patients enrolled in three prospective trials of intensive therapy41. The MDS set included 738 patients with MDS or closely related neoplasms such as chronic myelomonocytic leukaemia42. Both used deep targeted sequencing of 111 cancer genes, which overlapped with 13 of the 17 genes of interest in our longitudinal clonal haematopoiesis study (PPM1D, CTCF, GNB1 and BRCC3 were not sequenced in the AML or MDS studies). We called and filtered variants in the 13 overlapping genes using the strategy described above (‘Targeted sequencing and variant calling’). Variants were identified in all 13 genes in both AML and MDS datasets (Supplementary Table 10). We calculated dN/dS values both at the level of individual genes, and at single-site level for hotspots, the latter using the sitednds function in the dNdScv R package.
Finally, we compared dN/dS ratios in shared and private branches of the three phylogenies, and found selection to be stronger in the former, consistent with the fact that mutations along shared branches were the ones driving subsequent clonal expansions (and therefore were more strongly selected) (Supplementary Note 2).
Modelling of clone trajectories through time
We use Bayesian hierarchical modelling to model clonal trajectories. Since we are unable to reliably phase different mutations into specific clones (Supplementary Note 3) and given that individual clonal haematopoiesis clones typically harbour a single driver mutation54, we assume that each mutation is heterozygous and its VAF is representative of the prevalence of a single clone. Accordingly, for a given individual j and mutation i, we have a mutant clone cij. We model the counts ({rm{count}}{{rm{s}}}_{{c}_{{ij}}})for ({c}_{{ij}}) at age (t) as a binomial distribution (Bin), such that ({rm{count}}{{rm{s}}}_{{c}_{{ij}}}(t)sim {rm{Bin}}({rm{co}}{{rm{v}}}_{{ij}}(t),{p}_{{ij}}(t))), with ({rm{co}}{{rm{v}}}_{{ij}}) as the coverage of this mutation at age (t) and ({p}_{{ij}}(t)sim {rm{Beta}}(alpha (t),beta )) as the expected proportion of mutant allele copies. As such, ({rm{count}}{{rm{s}}}_{{c}_{{ij}}}(t)sim {rm{BB}}({rm{co}}{{rm{v}}}_{{ij}}(t),alpha (t),beta )), where BB is the beta binomial distribution. Here, (beta sim N({mu }_{{rm{od}}},{sigma }_{{rm{od}}})) is the technical overdispersion parameterized as a normal distribution whose parameters (μod and σod, the mean and standard deviation, respectively) are estimated using replicate data (details below) and (alpha (t)=frac{beta q(t)}{1-q(t)}), where (q(t)={rm{ilogit}}(left({b}_{{rm{gen}}{{rm{e}}}_{i}}+{b}_{{rm{sit}}{{rm{e}}}_{i}}+{b}_{{c}_{{ij}}}right)times t+{u}_{{ij}})), with ilogit representing the inverse function. We use this parameterization to guarantee that (E[{rm{count}}{{rm{s}}}_{{c}_{{ij}}}]={p}_{{ij}}{rm{co}}{{rm{v}}}_{{ij}}). ({b}_{{rm{gen}}{{rm{e}}}_{i}}sim N(mathrm{0,0.1})) and ({b}_{{rm{sit}}{{rm{e}}}_{i}}sim N(mathrm{0,0.1})) are the gene and site growth effects for mutation (i), respectively. ({b}_{{c}_{{ij}}}sim N(mathrm{0,0.05})) is the growth effect associated exclusively with mutation (i) in individual (j)—that is, of mutant clone ({c}_{{ij}})—and ({u}_{{ij}}) is the offset accounting for the onset of different clones at different points in time. We also define the growth effect of ({c}_{{ij}}) as ({b}_{{rm{tota}}{{rm{l}}}_{{ij}}}=({b}_{{rm{gen}}{{rm{e}}}_{i}}+) ({b}_{{rm{sit}}{{rm{e}}}_{i}}+{b}_{{{uc}}_{{ij}}})). Throughout this work we will refer to ({b}_{{rm{gen}}{{rm{e}}}_{i}}+{b}_{{rm{sit}}{{rm{e}}}_{i}}) as the driver (growth) effect and to ({b}_{{c}_{{ij}}}) as the unknown-cause (growth) effect—the fraction of growth that is quantifiable but not explained by the driver mutation, and is attributable to other factors that may affect clonal growth, but differ between individuals, such as age, sex, interclonal competition and others.
Preventing identifiability issues and reducing uninformed estimates
To address possible identifiability issues in our model, when a gene has a single mutation (JAK2V617F and IDH2R140Q), the effect is considered to occur only at the site level. To avoid estimating the dynamics of a site from a single individual, we only model ({b}_{{rm{sit}}{{rm{e}}}_{i}}) when two or more individuals have a missense mutation on site (i), we refer to these sites as ‘recurrent sites’. Overall, we consider a total of 17 genes and 39 recurrent sites (Supplementary Table 5).
Estimating and validating growth parameters
Using the model described above, we use Markov chain Monte Carlo (MCMC) with a Hamiltonian Monte Carlo (HMC) sampler with 150–300 leapfrog steps as implemented in greta55. We sample for 5,000 iterations and discard the initial 2,500 to get estimates for the distribution of our parameters. As such, our estimates for each parameter are obtained considering their mean, median and 95% highest density posterior interval for 2,500 samples.
We assess the goodness-of-fit using the number of outliers detected in any trajectory and consider only trajectories with no outliers as being explained by our model and, as such, growing at constant rate. Outliers are assessed by calculating the tail probabilities of the counts under our model with a hard cut-off at 2.5%. Thus, ({P}_{{rm{outlier}}}=1,)if (P({rm{counts}}|{b}_{{rm{gen}}{{rm{e}}}_{i}},{b}_{{rm{sit}}{{rm{e}}}_{i}},{b}_{{c}_{{ij}}},{u}_{{ij}},t))( < 0.025{|P}({rm{counts}}|{b}_{{rm{gen}}{{rm{e}}}_{i}},{b}_{{rm{sit}}{{rm{e}}}_{i}},) ({b}_{{c}_{{ij}}},{u}_{{ij}},t)) ( > 0.975) and ({P}_{{rm{outlier}}}=0) otherwise. We validate this approach using Wright–Fisher simulations (Supplementary Methods). We additionally assess the predictive power of this model on an additional time point that was available for a subset of individuals and that was not used in the inference of parameters in our model (Supplementary Methods).
Estimating the technical overdispersion parameter
Technical VAF overdispersion used two distinct sets of data:
-
(1)
Horizon Tru-Q-1 was serially diluted to VAFs of 0.05, 0.02, 0.01, 0.005 and 0 using Horizon Tru-Q-0 (verified wild-type at these variant sites), then sequenced in duplicate or triplicate;
-
(2)
19 SardiNIA samples with mutations across 15 genes at a range of VAFs, were sequenced in triplicate.
Sample processing and analysis was performed as described in ‘Targeted sequencing and variant calling’ section. Replicate samples were picked from the same stock of DNA, then library preparation and sequencing steps were performed in parallel. Variant calls for these replicate samples are in Supplementary Table 12.
For (1), we model the distribution over the expected ({VAF}) as a beta distribution such that ({rm{VAF}}sim {rm{Beta}}(alpha ,beta )) and for (2) we adopt a model identical to the one described earlier in this section but use only gene growth effects (({rm{count}}{{rm{s}}}_{{c}_{{ij}}}(t)sim {rm{BB}}({rm{co}}{{rm{v}}}_{{ij}}(t),alpha (t),beta ),,)(alpha (t)=frac{beta q(t)}{1-q(t)}), (q(t)={rm{ilogit}}({b}_{{rm{gen}}{{rm{e}}}_{i}}times t+{u}_{{ij}}))). Here, we model (beta sim {rm{exp }}(r)) with (r) as a variable with no prior. We use MCMC with HMC sampling with 400–500 leapfrog steps as implemented in greta55 to estimate the mean and standard deviation of (beta ). For this estimate we use 1,000 samples from the posterior distribution.
Non-mutation factors and clonal growth rate
Inherited polymorphisms and JAK2-mutant clonal growth
The SardiNIA cohort had previously been characterized using two Illumina custom arrays: the Cardio-MetaboChip and the ImmunoChip26. Inherited genotypes at 12 loci previously associated with MPN risk were extracted for the 12 individuals with JAK2V617F mutation23,24. The relationship between each individual’s total number of inherited risk alleles and JAK2-mutant clonal growth rate was assessed by Pearson’s correlation. The 46/1 haplotype, which harbours 4 SNPs in complete linkage disequilibrium, was considered as a single risk allele.
Age, sex and smoking experience
We assess the association between unknown-cause growth and age through the calculation of a Pearson correlation considering all genes, both together and separately while controlling for multiple testing. We also assess the association between unknown-cause growth and sex and smoking history using a multivariate regression where unknown-cause growth is the dependent variable and sex and previous smoking experience are the covariates, while also controlling for age.
Determining the age at clone onset
We consider that HSC clones grow according to a Wright–Fisher model. According to this, for an initial population of HSC (n/2), we can consider two scenarios—that of a single growth process where the time at which the cell first starts growing ({t}_{0}) is described as ({t}_{0}=frac{{rm{log }}left(frac{1}{n}right)-u,}{{b}_{{rm{tota}}{rm{l}}}}), or that of a two-step growth process, where ({t}_{0}{rm{adjuste}}{rm{d}}={t}_{0}+frac{{rm{log }}(g/{b}_{{rm{tota}}{rm{l}}})}{{b}_{{rm{tota}}{rm{l}}}}-frac{1}{{b}_{{rm{total}}}}), where (g) is the number of generations per year. The latter scenario is the one chosen, due to its strong theoretical foundation and previous application to mathematical modelling of cancer evolution56. The two regimes that describe it are an initial stochastic growth regime and, once the clone reaches a sufficient population size, a deterministic growth regime. The adjustment made to ({t}_{0}) in ({t}_{0}{rm{adjust}}{rm{ed}}) can be interpreted as first estimating the age at which the clone reached the deterministic growth phase (({t}_{0}+frac{{rm{log }}(g/{b}_{{rm{tota}}{rm{l}}})}{{b}_{{rm{tota}}{rm{l}}}})) followed by subtracting the expected time for a clone to overcome its stochastic growth phase ((frac{1}{{b}_{{rm{total}}}})). For both (n) and (g) we use the estimates based on ref. 33: (n=mathrm{50,000}) and (g=2). We validate this approach using simulations (Supplementary Methods) and test the approach against our serial VAF data and verify that changes in (n) and (g) do not have a marked effect on age at onset estimates by considering a range of values ((n={mathrm{10,000};mathrm{50,000};)(mathrm{100,000};mathrm{200,000};mathrm{600,000}}) and (g={1;2;5;10;13;20})).
Cell colonies and phylogenetic trees
Sample preparation and sequencing
We selected 3 individuals with splicing gene mutations from the SardiNIA cohort for detailed blood phylogenetic analysis. Peripheral blood samples were drawn into Lithium-heparin tubes (vacutest, kima, 9 ml) and buccal samples were taken (Orangene DNA OG-250). Peripheral blood mononuclear cells were isolated from blood and plated at 50,000 cells per ml in MethoCult 4034 (Stemcell Technologies). After 14 days in culture, 96 single haematopoietic colonies were plucked per individual (total 288 colonies, each made up of hundreds to thousands of cells) and lysed in 50 μl of RLT lysis buffer (Qiagen).
Library preparation for WGS was performed using our low-input pipeline as previously described57,58. The 150 bp paired-end sequencing reads were generated using the NovaSeq 6000 platform to a mean sequencing depth of 15× per sample. Reads were aligned to the human reference genome (NCBI build37) using BWA-MEM.
Variant calling and filtering
Single-nucleotide variants (SNVs) and small indels were called against an unmatched reference genome using the in-house pipelines CaVEMan and Pindel, respectively52,53. ‘Normal contamination of tumour’ was set to 0.05; otherwise, standard settings and filters were applied. For all mutations passing quality filters in at least one sample, in-house software (cgpVAF, https://github.com/cancerit/vafCorrect) was used to produce matrices of variant and normal reads at each mutant site for all colonies from that individual. Copy-number aberrations and structural variants were identified using matched-normal ASCAT59 and BRASS (https://github.com/cancerit/BRASS). Low-coverage samples (mean <4×) were excluded from downstream analysis (n = 1, PD41305). Samples in which the peak density of somatic mutation VAFs was lower than expected for heterozygous changes (in practice VAF < 0.4) were suspected to be contaminated or mixed colonies, and were also excluded from further analysis (n = 3, PD41305; n = 9, PD41276; n = 3, PD34493).
Multiple post-hoc filtering steps were then applied to remove germline mutations, recurrent library prep or sequencing artefacts, and in vitro mutations, as described previously60 and detailed in custom R scripts (https://github.com/margaretefabre/Clonal_dynamics). Buccal samples were used as an additional filter; mutations were removed if the variant:normal count in the buccal sample was consistent with that expected for a germline mutation (0.5 for autosomes and 0.95 for X and Y chromosomes, binomial probability >0.01), and were retained if (1) the variant:normal count in the buccal sample was not consistent with germline (binomial probability <1 × 10−4) and (2) the mutation was not present in either of 2 large SNP databases (1000 Genomes Project and Kaviar) with MAF > 0.001.
Phylogenetic tree construction and assignment of mutations back to the tree
These steps were also performed as described previously60 and are detailed here: https://github.com/margaretefabre/Clonal_dynamics. In brief, samples were assigned a genotype for each mutation site passing filtering steps (‘present’ = ≥2 variant reads and probability > 0.05 that counts came from a somatic distribution; ‘absent’ = 0 variant reads and depth ≥6; ‘unknown’ = neither ‘absent’ nor ‘present’ criteria met). The proportion of ‘unknown’ genotypes going into tree-building was low: 1.5% (PD34493), 1.4% (PD41276) and 1.3% (PD41305; Extended Data Fig. 5a–c). A genotype matrix of shared mutations was fed into the MPBoot program61, which constructs a maximum parsimony phylogenetic tree with bootstrap approximation. The in-house-developed R package treemut (https://github.com/NickWilliamsSanger/treemut), which uses original count data and a maximum likelihood approach, was then used to assign mutations back to individual branches on the tree. Since individual edge length is influenced by the sensitivity of variant calling, lengths were scaled by 1/sensitivity, where sensitivity was calculated as the proportion of germline variants called (mean sensitivity: 85.4%, 87.0% and 83.5% for PD41305, PD41276 and PD34493, respectively). The approaches we used to validate the phylogenies, including comparison of MPBoot with an alternative phylogeny-inference algorithm, SCITE62, are detailed in Supplementary Methods.
Reconstruction of population trajectories
Phylogenies were made ultrametric (branch lengths normalized) using a bespoke R function (make.tree.ultrametric, https://github.com/margaretefabre/Clonal_dynamics/my_functions). With the root of the tree representing conception and the tips representing age at sampling, we scaled the age axis in two phases by: (1) assigning the first 55 mutations to the period between conception and birth (in light of evidence for this higher rate of mutation acquisition during this period36,60, and (2) scaling the axis linearly throughout life after birth (in light of evidence for a constant rate of mutation acquisition in HSCs during postnatal life32,33,34,35,36,37. We then analysed population size trajectories by fitting Bayesian nonparametric phylodynamic reconstructions (BNPR) as implemented in the phylodyn R package38,39 to clades – sets of samples in a phylogenetic tree sharing a most recent common ancestor (MRCA)—defined by either having a driver mutation on the MRCA or a MRCA branch length that spans more than 10% of the tree depth and with 5 tips or more. We also estimated the lower and upper bounds for age at onset of clonal expansion to be the limits of the branch containing the most recent common ancestor.
Detection of clonal deceleration
We detect deceleration using two different approaches—the ratio between expected and observed clone size using phylodynamic estimates and the ratios between observed and historical (from longitudinal data) and between late and expected (from phylogenetic data), respectively. To obtain the late growth rate we fit a biphasic log-linear model to our phylodynamic estimation of Neff—this enables us to obtain an early and a late growth rate (details in the Supplementary Methods).
Expected and observed clone size
The expected clone size is calculated by extrapolating the early growth rate until the age of sampling; having this we can calculate the ratio between expected and observed growth. The ratio between these quantities is then used as a measure of deceleration (details in the Supplementary Methods).
Growth ratio in phylogenetic data
The late growth rate is defined as the late growth rate defined in the previous section of the methods. The expected growth rate for the phylogenies is calculated as the growth coefficient for a sigmoidal regression that assumes a population size of 200,000 HSC as the carrying capacity. We then use the ratio between these quantities as a measure of deceleration (1 implies no deceleration; <1 implies deceleration).
Growth ratio in longitudinal data
The observed growth rate is defined as the growth rate inferred directly from the data. The minimal historical growth is the growth rate estimate obtained by restricting clone initiation to a time after conception (age at at onset > −1).
Clonal haematopoiesis dynamics and malignant progression
To calculate the association between clonal haematopoiesis dynamics and AML we used the risk coefficients from our previous work in predicting the onset of AML5, which were calculated by fitting a Cox-proportional hazards model that calculated the risk of AML onset associated with each gene (agnostic of clone size) while controlling for age, sex and cohort, and estimate the coefficient of correlation between the expected value of the annual growth for the posterior distribution of each gene (considering gene, site and unknown-cause effects) and the AML progression risk.
The association between clonal haematopoiesis dynamics and selection in MDS and AML use the dN/dS values calculated with dNdScv as previously described in the methods, using two distinct cohorts from previous studies41,42. dN/dS values were calculated for all hotspots and their coefficient of correlation with the expected value of the annual growth for the posterior distribution of each hotspot (also considering gene, site and unknown-cause effects) was calculated.
Statistical analyses
All statistical analyses were conducted using the R software63 – MCMC models were fitted using greta55 and hypothesis testing, generalized linear models and maximum likelihood fits were performed in base R.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.