A Malvaceae mystery: A mallow maelstrom of genome multiplications and maybe misleading methods?

Previous research suggests that Gossypium has undergone a 5to 6-fold multiplication following its divergence from Theobroma. However, the number of events, or where they occurred in the Malvaceae phylogeny remains unknown. We analyzed transcriptomic and genomic data from representatives of eight of the nine Malvaceae subfamilies. Phylogenetic analysis of nuclear data placed Dombeya (Dombeyoideae) as sister to the rest of Malvadendrina clade, but the plastid DNA tree strongly supported Durio (Helicteroideae) in this position. Intraspecific Ks plots indicated that all sampled taxa, except Theobroma (Byttnerioideae), Corchorus (Grewioideae), and Dombeya (Dombeyoideae), have experienced whole genome multiplications (WGMs). Quartet analysis suggested WGMs were shared by Malvoideae-Bombacoideae and SterculioideaeTilioideae, but did not resolve whether these are shared with each other or Helicteroideae (Durio). Gene tree reconciliation and Bayesian concordance analysis suggested a complex history. Alternative hypotheses are suggested, each involving two independent autotetraploid and one allopolyploid event. They differ in that one entails an allopolyploid origin for the Durio lineage, whereas the other invokes an allopolyploid origin for Malvoideae-Bombacoideae. We highlight the need for more genomic information in the Malvaceae and improved methods to resolve complex evolutionary histories that may include allopolyploidy, incomplete lineage sorting, and variable rates of gene and genome evolution. Disciplines Ecology and Evolutionary Biology | Genetics and Genomics | Plant Breeding and Genetics Comments This is the peer reviewed version of the following article: Conover, Justin L., Nisa Karimi, Noah Stenz, Cécile Ané, Corrinne E. Grover, C. Skema, Jennifer A. Tate et al. "A Malvaceae mystery: A mallow maelstrom of genome multiplications and maybe misleading methods?." Journal of Integrative Plant Biology (2018), which has been published in final form at doi: 10.1111/jipb.12746. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions. Authors Justin L. Conover, Nisa Karimi, Noah Stenz, Cécile Ané, Corrinne E. Grover, C. Skema, Jennifer A. Tate, Kirsten Wolff, Samuel A. Logan, Jonathan F. Wendel, and David A. Baum This article is available at Iowa State University Digital Repository: https://lib.dr.iastate.edu/eeob_ag_pubs/322


INTRODUCTION
An exciting revelation from the genomics revolution is the large number of ancient whole genome multiplication (WGM) events that punctuate angiosperm evolution, with many plant lineages experiencing multiple WGMs over their histories (Adams and Wendel 2005;Cui et al. 2006; Van de Peer et al. 2009;Wood et al. 2009;Jiao et al. 2011;Bowers et al. 2013;Soltis et al. 2015; Van de Peer et al. 2017). Polyploidy is frequently considered a source of evolutionary innovation (Stebbins 1947;Grant 1971) leading to increased species diversification and phenotypic innovation (Soltis and Soltis 2016). Thus, WGMs just prior to the radiation of such families as the Asteraceae, Brassicaceae, Fabaceae, Poaceae, and Solanaceae have been assigned a causal role in those clade's species-richness (Paterson et al. 2004;Soltis et al. 2009;Cannon et al. 2015).
Despite the profound importance of WGMs in plant evolution, the identification and characterization of past events is often challenging. In particular, mechanisms such as rapid fractionation and rediploidization tend to remove evidence of the WGMs that precipitated them (Wendel 2015;Cheng et al. 2018). Maize, for example, was classically considered a pure diploid, but genomic research has since reclassified maize as a stabilized paleo-allotetraploid that rapidly jettisoned much of its duplicated genome (Gaut and Doebley 1997;Schnable et al. 2011;Zhao et al. 2017).
Multiple methods have been developed to infer ancient WGMs, each leveraging different, though sometimes overlapping, lines of evidence. Early inferences were based on replicated intragenomic synteny, such as seen in maize (Wendel et al.1986;Helentjaris et al.1988;Gaut and Doebley 1997) and Arabidopsis (Vision et al. 2000). Soon after, synonymous substitution "peaks" in EST distance data among paralogs within a genome were used to infer WGMs (Schlueter et al. 2004). Since then, methods including gene tree/species tree reconciliation (Ness et al. 2011), collinearity of full genome sequences (Tang et al. 2010), and statistical methods based on gene counts (Rabier et al. 2013) have been combined to uncover many WGM events, some of them very old.
Genomic evidence for WGM in the Malvaceae was first described in the publication of the Gossypium raimondii genome, where collinearity data supported an ancient 5-to 6-fold WGM sometime after its divergence from chocolate (Theobroma cacao), estimated at approximately 90 million years ago (Paterson et al. 2012). Subsequent analyses (using various lines of evidence) have attempted to infer whether this event consisted of two successive whole genome duplication/triplication events, or a single deca-or dodecaploidization event (Wang et al. 2016). In addition, studies have investigated if any other Malvaceae taxa show evidence of the same event(s), namely Durio zibethinus (Teh et al. 2017) and Firmiana danxiaensis . The draft genome of D. zibethinus (Teh et al. 2017) suggested a shared WGM history with G. raimondii; however, while the 28 chromosomes of D. zibethinus show remarkable (hexaploid, 3x) synteny with respect to the 10 chromosomes of T. cacao, this inference is insufficient to explain the decaploid (5x) or dodecaploid (6x) multiplication history of G. raimondii relative to Theobroma.
In this study, we used comparative genomics to further explore WGM(s) in the Malvaceae.
The Malvaceae includes nine subfamilies (Bombacoideae, Brownlowioideae, Byttnerioideae, Dombeyoideae Grewioideae, Helicteroideae, Malvoideae, Sterculioideae, and Tilioideae), several of which were at one time treated at the family rank (Alverson et al.1999;Bayer et al.1999). Phylogenetic analyses have consistently supported a sister relationship between Bombacoideae and Malvoideae, in a clade called Malvatheca (Baum et al. 1998;Alverson et al. 1999;Bayer et al.1999;Baum et al. 2004). Likewise, all subfamilies except Grewioideae and Byttnerioideae, the latter containing Theobroma, have been grouped into a large clade, Malvadendrina, which is supported by a unique and unreversed 21-bp deletion in the plastid-encoded ndhF gene . Relationships between Malvatheca and the other five subfamilies of Malvadendrina remain uncertain (Nyffeler et al. 2005).
Here, we employed transcriptome and genome sequences for representatives of eight subfamilies (all but Brownlowioideae) to investigate the phylogeny and timing of WGM events.
We analyzed distributions of synonymous divergence (Ks) to detect evidence of ancient WGMs, and inferred orthologous nuclear gene trees to test alternative possibilities for the timing and placement of WGMs within the family. Collectively, our analyses suggest a complicated history and considerable residual uncertainty. This study highlights the challenges of inferring WGM events in the face of reticulation and other phenomena that impact species-tree inference.
Nonetheless, we propose alternative hypotheses for the history of the Malvaceae and suggest how these might ultimately be distinguished through additional sampling.

Novel transcriptomic data sets
To complement already published genome and transcriptome data (see Methods and Supplemental Table 1 for summary of taxa sampled and references), we generated transcriptomes for Tilia cordata (Tilioideae), Adansonia digitata (Bombacoideae), and Bombax ceiba (Bombacoideae), although the Bombax transcriptome was superseded by the recent publication of a genome sequence for Bombax ceiba (Gao et al. 2018 (Table S1). For the sake of readability, we will generally refer to each taxon by generic name.

Malvaceae phylogeny inferred by plastid genes
For eight taxa, genomic DNA-based plastomes were available, either in published genome sequences or from nuclear targeted sequence capture data sets (Adansonia and Bombax). For three taxa (Dombeya, Heritieria, and Corchorus), DNA sequences were not available, so we instead mapped RNAseq reads to the published Theobroma plastid genome sequence. We aligned 67 plastid-encoded genes that were not located in the inverted repeat regions and were present in all taxa sampled in this study (see Methods). The tree was rooted with the published plastid sequence of Arabidopsis thaliana.
As genes derived from RNAseq may be affected by RNA editing, we were concerned that mixing transcriptomic and genomic data could create artefactual synapomorphies. To investigate this possibility, we analyzed three different datasets: genomic sequences only ( Figure 1A), genomic sequences plus Dombeya ( Figure 1B), or all sequences combined ( Figure   1C). The first two datasets yielded trees where all internal nodes had high bootstrap support A clade composed of Tilia (Tilioideae) and Firmiana+Heritiera (Sterculioideae) was sister to the well-supported Malvatheca clade. Of note, the internodes near the base of Malvadendrina are extremely short, suggesting a rapid radiation of the six subfamilies and the potential for a large amount of incongruence due to incomplete lineage sorting. Combined with other source of phylogenetic noise, such as homoplasy, gene conversion, and orthology-paralogy issues, these short branches help to explain prior difficulties in resolving subfamilial relationships.
To minimize dependence on transcriptomic datasets, and because the topology of the three trees did not change between datasets, we pruned Corchorus and Heritiera and used the resulting tree for downstream analyses. This pruning is further justified because the phylogenetic placement of Corchorus (Grewioideae, outside of Malvadendrina; Baum et al. 1998;Alverson et al. 1999;Bayer et al.1999) and Heritiera (in the same subfamily as Firmiana [Sterculioideae]; Wilkie et al. 2006) are uncontroversial.

Ks histograms identify subfamilies with a history of WGM
If a species were to have undergone a WGM event in its past, a large proportion of the paralogs would originate from this event. Therefore, when the synonymous rates of divergence (Ks) between paralogs is plotted against frequency, a WGM event may be evident as a "peak" in this distribution. The position of this peak along the Ks axis provides a proxy for divergence time. In autopolyploids this divergence time coincides with an estimate of when autopolyploidy occurred.
In allopolyploidy, this peak coincides with the time of divergence of the progenitor diploids that whereas the two members of Malvoideae (Gossypium and Hibiscus), each had two peaks at ~0.1/0.5 and ~0.3/0.6, respectively. We interpret the younger peak in Gossypium as part of the background gene duplication rate, most likely due to tandemly duplicated genes that retain high sequence similarity to one another, via gene conversion (Panchy et al. 2016;Train et al. 2017).
The younger peak in Hibiscus, in contrast, is likely the result of a recent WGM unique to this lineage, as previous work suggests this was an independent event with respect to Gossypium (Kim et al. 2016;Teh et al. 2017). Furthermore, we established Hibiscus to have twice as many gene models as other taxa sampled (Table S1).
Allowing for differences in rates of molecular evolution across lineages, which can be considerable among Malvaceae subfamilies (Baum et al. 2004), the pattern could be suggestive of a single shared WGM in all Malvadendrina subfamilies except Dombeyoideae (Kim et al. 2016;Teh et al. 2017). The peaks of these distributions closely coincide with Ks divergences associated with the divergence of Theobroma, Durio, Gossypium, and Bombax (see Methods), indicating that the WGM event(s) we detected occurred early in the group's radiation. However, the subfamilies that have experienced a WGM in Malvaceae are not monophyletic on the plastid tree, which places Durio rather than Dombeya as sister to the rest of Malvdendrina. This indicates either that WGM events occurred independently in Durio and other Malvadendrina, or there was deep allopolyploidy, or that the plastid DNA has a discordant evolutionary history (e.g., due to reticulation or incomplete lineage sorting).

Identification of shared whole genome multiplications using gene quartets
To help assess whether the Ks peaks represent the same WGM event shared between pairs of species, we created gene trees consisting of two paralogs with a Ks distance of 0.2-0.6 from each species (i.e. four genes total), tabulated whether inferred tree topologies were consistent with either a shared duplication (i.e., conspecific paralogs not sister to each other) or separate duplications (i.e., conspecific paralogs sister to each other).
The frequency of trees for each taxon pair suggestive of separate duplication events varied from well below to well above the 33% expected for random trees (Table 1). Evidence for shared duplications between Hibiscus and any other taxon is the least supported. This is likely because Hibiscus has experienced an independent WGM since its divergence from Gossypium (Kim et al. 2016) such that paralogs from this recent WGM event obscure evidence of earlier events shared with other taxa. With this exception, the gene topologies consistently support shared duplications for Gossypium and Adansonia-Bombax (Bombacoideae-Malvoideae) and for Firmiana+Heritiera+Tilia (Sterculioideae-Tilioideae). The signal for a duplication shared between these two groups and Durio (Helicteroideae) is equivocal, with the number of genes supporting separate or shared duplications being close to the null expectation.

Multi-labeled trees (MUL-trees) reveal hybridization and allopolyploidy events
We used GRAMPA (Gregg et al. 2017) to place WGM event(s) on the Malvaceae phylogeny, inferred from both the nuclear and plastid genes, to determine whether these were allo-or autopolyploidy events. GRAMPA uses a parsimony approach to reconcile multiply-labeled gene trees (which include paralogs) with various species trees. The method searches for multiplylabeled trees that minimizes the sum of the implied number of gene duplications and gene losses; i.e., the reconciliation score. The MUL-trees derived when using the plastid phylogeny as the species tree had lower reconciliation scores and were used instead of the nuclear topology for downstream analysis (although each yielded similar conclusions).
Using the 12,426 gene trees extracted from either the transcriptomic or genomic datasets, GRAMPA identified 53 MUL-trees (File S1) that had a lower reconciliation score, across all input gene trees, than did the single-labeled species tree ( Figure 3). The twenty topologies with the lowest reconciliation scores suggested that an allopolyploidization event generated either a Malvatheca+Sterculioideae+Tilioideae clade (six MUL-trees), Malvatheca alone (seven MULtrees), or Malvoideae alone (seven MUL-trees). For each of these possible allopolyploidization scenarios there is conflict regarding the parents involved.
One possible confounding factor in these analyses is the high rate of implied "deletion," which includes a lack of expression in the species represented only by transcriptomic data. Therefore, we repeated the analyses with only those taxa that had published genome sequences (Theobroma, Gossypium, Bombax, and Durio), plus our de novo transcriptome assembly of Dombeya and Firmiana. To ensure proper rooting of the gene trees, we also included the published genome sequence of Carica papaya L. (Brassicacee), which is not reported to have undergone any independent WGM events (Ming et al. 2008).
To be conservative, we restricted our dataset to only include those 1,032 gene trees in which all internal nodes had at least 90% bootstrap support (see Materials and Methods).
GRAMPA identified seven MUL-tree topologies that fit the data better than a single-labeled species tree matching the optimal plastid tree (File S2). The three MUL-tree topologies with the lowest reconciliation scores suggest that Malvatheca (Gossypium and Bombax) arose via an allopolyploidization event between an ancestor of Malvatheca and either an extinct lineage sister to Malvadendrina, or to Malvadendrina minus Durio, or to Durio ( Figure 3). The fourth lowest scoring tree topology suggested that the clade consisting of Malvatheca + Firmiana was formed, via an allopolyploidy event between an ancestor of the clade and a member of the Durio lineage.
The remaining three topologies suggested a hybrid origin of clades including Dombeya, which based on Ks plots, does not have evidence of WGM in its evolutionary history. We attribute these topologies to an artifact of GRAMPA, possibly due to the very short internodes in the species tree and many other possible causes of gene tree incongruence, including erroneous splitting of isoforms and/or paralogs during transcriptome assembly, incomplete lineage sorting (ILS), and homoplasy.

Species tree inference: Bayesian concordance analysis and maximum likelihood inference of nuclear genes
The short internal branches of the species tree, possibly combined with a history of allopolyploidy, should lead us to expect a large amount of phylogenetic incongruence among single-copy gene trees. One way around this is to use genes that are represented by a single gene in each species (presumably because they are lost rapidly after each WGM). However, because only two orthogroups were identified once, and only once in all taxa, we focused on a smaller taxon set, which included the published genomic data of four genera (Theobroma, Gossypium, Bombax, and Durio) as well as transcriptomic data from Dombeya. We included Dombeya, even though it does not contain published genomic information, because it is the lone sampled Malvadendrina taxon whose Ks plot does not show evidence of a WGM. Theobroma was treated as the outgroup.
We identified 1,214 singleton groups (genes present as only one copy in each taxon sampled). We conducted Bayesian concordance analysis using BUCKy (Ané et al. 2007;Larget et al. 2010), which, taking account of uncertainty in individual gene trees, estimates the optimal population ("species") tree under the assumption that all discordance is due to incomplete lineage sorting (ILS). The resulting primary concordance (Baum 2007) and population trees had the same topology ( Figure 4A). There is strong support for a Malvatheca clade composed of Specifically, this result could be explained by a population tree with the topology (Theobroma, (Dombeya, (Durio, Malvatheca))) with subsequent gene flow between the Durio and Dombeya lineages, perhaps involving allopolyploid hybridization between the Dombeya and ancestral Durio lineages to generate an allopolyploid Durio lineage. This hypothesis, and why it did not emerge from GRAMPA analysis is discussed further below.
To further characterize the possible scenarios of hybridization suggested by BUCKy and GRAMPA, we identified 381 orthogroups that are present in the genomes of Gossypium, Theobroma, Durio, and Bombax plus the transcriptome of Dombeya, and have every node supported by a maximum likelihood bootstrap support >80%. We then used a custom R script to recursively combine sister groups from the same species (treating Malvatheca as a single species). Of the 381 trees, 265 reduced down to a tree with one tip per taxon, and of these, 113 trees had the rooted topology (Dombeya, (Durio, Malvatheca)), 95 had (Malvatheca, (Durio, Dombeya)), and 57 had (Durio, (Dombeya, Malvatheca)). This distribution is significantly different from 1/3 proportions (p-value=9.584e-05), although the frequency of the first two is not significantly different. Thus, despite using a different (though overlapping) gene tree set than was analyzed in BUCKy, we see a very similar pattern: significantly more trees supporting either Dombeya or Malvatheca as sister to the rest of Malvadendrina over a Durio-sister topology.

Modeling orthologous gene family sizes
Analyses of gene trees provides rich information for a subset of gene families, but a majority of families are ignored because they yield inadequately resolved gene trees or because the interpretation of gene trees is confounded by a history of gene duplication and gene loss.
Therefore, as a complement to gene topology-based methods, we conducted analysis of gene family size, which can also provide information on WGM events (Rabier et al. 2013;Parks et al. 2018).
Using the four published genome sequences of Gossypium, Theobroma, Durio, and Bombax, we used OrthoFinder to identify 19,621 gene families, with an average number of genes/family/genome ranging from 1.13 to 1.58 (Table S2, column A). Excluding the 1,368 gene families that are not represented by at least one gene in all taxa, increases the number of genes/family, especially in Theobroma (Table S2, column B). Further excluding the 68 clusters that had 31 or more genes in total (the largest cluster had 94 genes) had only a modest effect on the average gene family size in each species (Table S2, column C).
The R-program WGDgc (Rabier et al. 2013) was used to analyze the resulting gene count data and test alternative models for the distribution of WGMs over the three terminal and two internal branches of the rooted four-taxon tree used. While the tree topology is not in doubt, relative branch-lengths could significantly affect the conclusions because WGDgc explicitly models rates of gene duplication and loss. We used two alternative sets of branch lengths: (1) the branch lengths in substitutions/site as estimated from concatenated analysis of the singleton genes, and (2) branch length in substitutions per unit time, as inferred by imposing ultrametricity using penalized likelihood.
When raw branch lengths are used, the best model supports three separate WGM events ( Figure 5A): one whole genome duplication (WGD) shared by Gossypium + Bombax, one whole genome triplication (WGT) specific to Bombax, and one WGT specific to Durio. When the tree is rendered ultrametric the best model suggests four separate events ( Figure 5B): a WGT+WGD specific to Gossypium, a WGT specific to Bombax, and a WGT specific to Durio. Hypotheses with shared events are disfavored (see Tables S3, 4 for AIC differentials), but the best such scenario has one WGT event shared among the ingroup taxa (Durio, Bombax and Gossypium) and another event (WGD) specific to Bombax. Detailed results for likelihood, Alkaike Information Criterion (AIC), and Bayesian Information Criterion (BIC) of all models can be found in Tables   S2, 3.

DISCUSSION
Since the discovery that the genome of Gossypium raimondii has an evolutionary history that includes WGMs (Paterson et al. 2012), it has been unclear whether WGMs detected in Gossypium are independent or shared with other clades of Malvaceae. Here, we attempted to ascertain the number and phylogenetic placements of WGM events, using a combination of tools, including nuclear and plastid phylogenetic reconstructions, Ks plots, quartet tree inferences, a reconciliation method (GRAMPA), Bayesian concordance analysis, and a method based on gene copy number variation. We anticipated integrating these multiple approaches to reconcile our results with those of previously published studies. However, there was no agreement among approaches, leaving considerable uncertainty as to the history of genome evolution in the Malvaceae. We discuss some of the better-supported alternative hypotheses and note some methodological issues that complicate WGM inference, including ILS, asymmetrical gene fractionation, varying rates of evolution, and gene conversion (Panchy et al. 2016;Train et al. 2017).

Nuclear-plastid discordance
To investigate the phylogenetic placement of such WGM events, we first sought to infer a phylogeny of the Malvaceae using multiple plastid gene sequences, as previously explored by La Duke and Doebley (1995) and Nyffeler et al. (2005), among others. We compared this to the trees inferred by 1214 nuclear genes for a core set of taxa. The nuclear and plastid data yielded similar trees supporting Malvatheca (Malvoideae + Bombacoideae) sister to Sterculioideae + Tilioideae (represented by Firmiana + Tilia, respectively). They differed in having either Dombeya (Dombeyoideae) or Durio (Helicteroideae) as strongly supported sister group to the rest of the Malvadendrina for the nuclear and plastid data, respectively.
One possible explanation of this discordance is that our plastome tree does not track the dominant evolutionary history. Due to the lower effective population size of plastid DNA, however, plastomes should be less prone to ILS. Nonetheless ILS remains a possibility, as do other sources of phylogenetic conflict, such as errors in tree estimation. For example, it is possible that, despite high bootstrap support, the short and ancient internodes of the plastid tree were incorrectly resolved due to artifacts such as long-branch attraction (Bergsten 2005;Qu et al. 2017). That said, it is reasonable to assert that the plastid tree is the best current estimated of at least the maternal side of what appears to be a complex history of reticulation and genome multiplication.
Bayesian concordance analysis of singleton nuclear orthologs documented extensive phylogenetic incongruence, much of it presumably due to ILS along short branches near the base of Malvadendrina. These short branches, in fact, are expected under the scenario of a rapid radiation, which the plastid tree also implies. Although BUCKy does not directly return a credibility interval (CI) on a population tree topology as a whole, by looking at the CIs of conflicting concordance factors it is possible to evaluate the implied population tree. In this case credibility interval analysis on a small subset of taxa with genome sequences (The outgroup Theobroma and ingroup taxa Durio, Bombax, and Gossypium), plus transcriptome-derived data from Dombeya, allowed rejection of a topology in which Durio and Dombeya formed a clade, but could not distinguish between the alternative topologies in which either Durio or Dombeya is sister to Malvatheca (Bombax plus Gossypium). This pattern is most easily explained by a history of reticulation (discussed below). It should be borne in mind, however, that WGM followed by different rates of gene loss has the potential to cause erroneous inferences of orthology, which could potentially lead to systematic errors in population tree inference.

Ks plots and quartet analyses imply shared WGMs
The release of a draft genome sequence of Hibiscus syriacus (Kim et al. 2016) indicated two independent WGD events since its divergence from G. raimondii. We detected two broad peaks in the Ks plot for H. cannabinus, which has less than half as many chromosomes as H. syriacus (CCDB online: http://ccdb.tau.ac.il/), leading us to infer that at least the older Ks peak is a WGM shared with Gossypium. Likewise, phylogenetic inference, quartet analysis, and Ks plots suggest that Bombacoideae (Bombax and Adansonia) shares at least one WGM with Gossypium. However, while the two other subfamilies closely related to Malvatheca, namely Tilioideae and Sterculioideae, have experienced at least one WGM, our results are equivocal as to whether this was shared with Malvatheca (i.e., Malvoideae + Bombacoideae).
The Ks approach identified a Ks peak in most species, corroborating prior analyses (Paterson et al. 2012;Chen et al. 2015;Kim et al. 2016;Teh et al. 2017;Gao et al. 2018). While the modes of these distributions differ between species, the full distributions overlap markedly among species, and also overlap with speciation estimates. These results are compatible with a shared WGM event early in the radiation of Malvadendrina, but multiple separate events cannot be ruled out. Furthermore inference from Ks plots alone should be treated with caution. As a case in point, there are both known and suspected recent WGM events in our phylogeny that are not visible in the Ks plots. For example, Adansonia digitata is a recent tetraploid (Baum & Oginuma 1994), but no recent peak is observed. One possibility is that this was an autotetraploidy event that occurred recently enough that tetrasomic inheritance is still in effect, which could prevent paralogs diverging enough to result in two separate gene models during transcriptome assembly (Scott et al. 2016).
There is also a suspected WGM in the lineage leading to Dombeya burgessiae based on chromosome counts (2n = 46, 54; Seyani 1991) that are more than double that reported for some other Dombeyoideae (e.g., Corchoropsis with n=10; Tang 1992), but this is not visible in the Ks plot except, perhaps, in the relatively high frequency of paralogs with small Ks values (0.0-0.1). We note that the transcriptome assembly for Dombeya was the lowest quality assembly of our sampled taxa, highlighting that there should remain a degree of caution in interpreting results based solely on transcriptomic data. Additional high-quality genome sequences, especially for Dombeya, would help overcoming the limitations of transcriptomebased analyses.

Modeling WGMs by gene counts
Whereas Ks plots, quartet analyses, and GRAMPA analyses suggest some sharing of WGM events among subfamilies, gene cluster size analyses generally supported separate WGM events. We suspect that this is a failure of the WGDgc method resulting from errors during clustering and orthology detection, which could lead to violations of model assumptions. The model of gene size evolution assumes a constant rate of gene turnover (gene loss and small scale gene duplications) within and among lineages. The model seems sensitive to this assumption, as evidenced by the different results on trees with different branch lengths. The different lineages might have different rates of gene deletion and the assembled genomes also differ in quality, which could inflate implied deletion rates in less well validated genomes. As a case in point, the preference for more WGM events in the Bombax lineage than in the Gossypium lineage might be caused by a smaller number of genes per family in Gossypium than in any other ingroup species.
Another assumption of the gene count model is that, after each WGM, fractionation occurs rapidly compared to the rate of speciation. Here, we found evidence of rapid diversification between subfamilies, resulting in short internal branch lengths. Consequently, fractionation might have spanned one or more rapid speciation events, violating the model assumption. This might explain the prevalence of non-shared WGM events in WGDgc results. Thus, pending availability of more high-quality genome assemblies, we consider gene family size to provide insufficient information for resolving WGM events in Malvaceae.

Hypotheses to reconcile the conflicting signals from the different data sets
The foregoing synopsis explains why there might be reasons to doubt inference coming from Ks plots or gene-count based analysis. However, the discrepancy between the analysis of singlylabelled gene trees (BUCKy and gene tree counting) and multiple-labelled gene trees (GRAMPA) is less easily discounted. Specifically, as discussed below and illustrated in Figure   6, these two approaches suggest somewhat different hypotheses to account for the patterns observed in our data, both involving shared WGMs and allopolyploidy.

Hypothesis one
The first hypothesis to explain these data is that soon after the initial split separating Dombeya This hypothesis provides an explanation for why all Malvadendrina except Dombeya (Dombeyoideae) have a peak in Ks plots. Additionally, collinearity analysis shows that at least some duplicate chromosomal arms are shared between Durio and Gossypium (Teh et al. 2017), which supports such a shared WGM. This hybridization model is also consistent with the BUCKy results, since it explains why 1) a plurality of genes support a Durio+Malvatheca clade, which is the expected pattern for Durio genes from the tetraploid Malvatheca parent, 2) a significant subset of genes support a Durio+Dombeya clade, which is the expected pattern for Durio genes from the diploid dombeyoid parent, and 3) there are many fewer genes showing a Dombeya+Malvatheca clade, which can only arise by ILS. On the other hand, this hypothesis would predict a plastome phylogeny in which either Dombeya or Dombeya+Durio was sister to the remainder of Malvadendrina, rather than the observed Durio-sister topology. It should be borne in mind, however, that the plastid DNA tree is inferred from a single locus, which is subject to ILS, and might, therefore, not match the population tree.

Hypothesis two
A second hypothesis proposes an allopolyploid origin for Malvatheca (see Figure 6B). The initial This hypothesis is one of the scenarios supported by GRAMPA, using both transcriptomic and genomic data, which identified the Malvatheca clade as having an allopolyploid origin. It readily explains the Ks peaks in all of Malvadendrina except Dombeya; the observation that Durio has undergone an ancient hexaploidy event after its divergence from Theobroma; that Gossypium has undergone an ancient 5-or 6-fold multiplication; and that at least one of the paleopolyploidy event(s) in cotton must have been an allopolyploid event that resulted in biased fractionation (Renny-Byfield et al. 2015). Additionally, the pattern of WGMs is consistent with our plastid DNA tree. While this hypothesis is contradicted by the BUCKy results of a Dombeya-Malvatheca clade being significantly rarer than other resolutions, this could be attributed to noise or an artifact, for example caused by using a poor quality de novo transcriptome assembly for Dombeya.

CONCLUSIONS
Here we have illustrated the application of a number of approaches to dissect the complexities involved in phylogenetic inference for taxa that have experienced multiple polyploidization and hybridization events. Our analyses demonstrate that even when large-scale genomic and transcriptomic datasets are analyzed, with the most current methodologies, teasing apart the true history of WGM events is a challenging endeavor, and that large ploidy increases magnify these difficulties.
Even in sampling genomic-scale data for eight of the nine subfamilies in the Malvaceae, it is still uncertain as to whether cotton has undergone a 5-or 6-fold multiplication (Paterson et al. 2012;Wang et al. 2016), and whether this same history is shared by other subfamilies in the Malvaceae. Nonetheless, our data suggest two alternative hypotheses, which are quite different despite both invoking two autotetraploidy events and one allopolyploidy event.
There is reason to be hopeful that future work can determine which of these is correct. A higher quality transcriptome or genome assembly for Dombeya could resolve whether it has a history of WGM. Furthermore, genome sequences of additional exemplar species are likely to be able to distinguish our main competing models, one of which suggests independent WGMs in Malvatheca and Sterculioideae/Tilioideae and also that Durio is an allopolyploid, whereas the second model suggests that Malvatheca, Sterculioideae, and Tilioideae share a WGM and that members of Malvatheca are allopolyploid. Such work would not only help clarify the history of WGM in the Malvaceae, but would also help the community better understand causes of disagreement among methods used to study genome evolution.  Table S1 summarizes transcriptome assembly statistics.

Sources of transcriptomic/genomic data
Transcriptomes from Adansonia digitata L. and Bombax ceiba L. were generated at the University of Wisconsin -Madison. RNAs were extracted from greenhouse-grown leaves of Adansonia digitata and Bombax ceiba using an optimized CTAB extraction protocol based on (Chang et al. 1993)  Raw reads were quality trimmed and removed of adapter sequences using TrimGalore (Krueger 2015) and assembled with SOAP-Denovo-Trans (Luo et al. 2012) with Kmer size of 51. Open reading frames were identified with Transdecoder (Haas et al. 2013), and only those ORFs with sequence similarity to those of the published Gossypium raimondii (Paterson et al. 2012) reference genome annotation were kept for further analysis. Quality of transcriptome assembly was assessed using BUSCO v2 and the Embryophyta odb9 database (Simão et al. 2015).

Phylogenetic analysis of plastid DNA
Published plastid DNA sequences were downloaded from NCBI ( Figure S2), for all taxa in this study, including Arabidopsis thaliana as an outgroup, except for the following taxa: Adansonia, Bombax, Corchorus, Dombeya, and Heritiera. Data for A. digitata and B. ceiba were generated by mapping nuclear targeted sequence capture reads (Karimi et al. unpublished) to the published plastid genome of Theobroma as described below. For the remaining three taxa for which no published plastid DNA sequences were available for the loci of interest (Heritiera, Corchorus, and Dombeya), we utilized our transcriptomic data. Trimmed RNAseq reads from these remaining taxa were mapped to the Theobroma published plastid genome using HISAT version 2.0.4 (Kim et al. 2015). Alignment (bam) files were converted into fasta files using bam2consensus (Page et al. 2014). Genes which were present in all taxa in single copy (i.e., not encoded in the IR regions of the plastid genome) were retained, and any gap sites were excluded to decrease phylogenetic errors caused by missing data. Nucleotide sequences were aligned with MAFFT (Katoh et al. 2002) using FFT-NS-2.
We performed phylogenetic analyses on three concatenated data sets: 1) genomic-derived sequences only (excluding taxa from RNA-Seq data, i.e., no Dombeya, Heritiera, or Corchorus), 2) genomic-derived sequences plus Dombeya (Dombeyoideae), 3) all sequences combined. We compared topologies from these three data sets given that transcriptome data are subject to RNA editing while samples represented by genomic data are not; thus the inclusion of more than one sample with RNA editing could potentially alter tree topology and bootstrap support due to the introduction of artificial homoplasy (Bowe and dePamphilis1996). For the concatenated datasets, maximum likelihood phylogenetic inference was performed with RAxML version 8.2.10 (Stamatakis 2014) using the GTR + Γ model (as determined by jModelTest2; Darriba et al. 2012) and 100 bootstrap replicates.

Ks distributions
We used full predicted gene regions for those taxa that have published genome sequences (G. raimondii, and Theobroma cacao) and full coding region predictions from Transdecoder (Haas et al. 2013) for all remaining taxa. We clustered the predicted protein sequences of all open reading frames (ORFs) from the de novo assembled transcriptomes (Table S1) with the published protein sequences of G. raimondii and T. cacao using OrthoFinder version 2.1.2 (Emms & Kelly 2015) to create orthogroups, groups of homologs (i.e., groups containing both orthologs and paralogs) defined by inclusion of the outgroup Theobroma. For all orthogroups that contained more than one gene for a given species (i.e., potential paralogs), we used every possible pairwise combination of genes and calculated the synonymous site (Ks) difference of these genes using the codeml package of PAML (Yang 1997). We then created species-specific distributions of Ks values to look for peaks that may indicate a WGM event.

Gene quartets
If a gene duplication event that gives rise to paralogs occurs after a speciation event (i.e., the gene duplication event was not shared between taxa of interest), then we expect conspecific paralogs to be sister in the resulting gene tree. Alternatively, if the duplication event occurred prior to speciation, we expect one paralog from each of the species to be sister to one another in the resulting gene tree. By comparing the proportion of all gene quartets that share the latter topology, we can assess the likelihood that two species share the same WGM event. Each species was compared to itself with BLAT (Kent 2002), to find protein pairs in the Ks range of 0.2-0.6, with a minimum of 300bp in their alignment. ProteinOrtho (Lechner et al. 2011) was then used to cluster proteins across two species of interest, using one representative sequence from each pair obtained in the previous step. ProteinOrtho is similar to OrthoMCL or OrthoFinder, but uses "spectral" partitioning to decompose large clusters into smaller clusters.
The one-to-one clusters were retained, and the second protein in each pair was merged back to form 4-protein clusters, each with 2 proteins from each of the 2 species. Each cluster was aligned with MUSCLE, and alignments were restricted to homologous sites with BLASTn.
Finally, a maximum likelihood tree and its bootstrap support were obtained for each quartet with RAxML version 8.2.10 (Stamatakis 2014), run with the GTR + Γ model and 100 bootstrap replicates. A given quartet was only kept if its tree (with a single internal edge) had a bootstrap >70%. Quartet trees were then summarized across all quartets, for each given pair of species.

Bayesian concordance analysis of nuclear genes
To infer phylogenies based on nuclear genes, we used Orthofinder version 2.1.2 (Emms and Kelly 2015) on the genomes of Bombax, Gossypium, Durio, and Theobroma, and the transcriptome of Dombeya to identify single homologous genes for analysis. Bayesian phylogenetic analysis was performed independently on each gene using MrBayes version 3.2.2 (Ronquist and Huelsenbeck 2003) with 2 million generations, 3 chains with a swap rate of 0.45, and 10% discarded as burn-in. The resulting posterior probabilities were used for Bayesian concordance analysis implemented in BUCKy version 1.4.4 (Ané et al. 2007;Larget et al. 2010), with an alpha of 1 and one million generations.

Classifying MUL gene tree topologies
We extracted protein-sequence orthogroups using OrthoFinder version 2.1.2 (Emms and Kelly 2015) from genomic datasets of Bombax, Durio, Gossypium, and Theobroma, and our de novo assembly of Dombeya. Sequences were aligned with MAFFT (Katoh et al. 2002) and gene trees were estimated using the PROTGAMMAWAG model of substitution in RAxML (Stamatakis 2014) with 100 bootstrap replicates. Gene trees were filtered to include only those with >80% bootstrap support on every branch. Trees were rooted on Theobroma and we used custom R scripts to classify the topologies (see Results).

Multiple-Labeled (MUL) gene tree analysis and hybridization testing
Using the same orthogroups used to construct the Ks plots, we aligned protein sequence data and obtained gene trees using OrthoFinder version 2.1.2 (Emms and Kelly 2015), which is based on the DendroBlast (Kelly and Maini 2013) algorithm. The resulting 21,757 non-trivial gene trees were rooted (-O flag) using Urec version 1.02 (Górecki and Tiuryn 2006) with the plastid DNA tree defined as the most probable species tree. Because gene trees often contain multiple sets of orthologs (each derived from a single gene at the root of the species tree), there may be more than one correct rooting. Urec minimizes the number of gene gains and gene loss events to reconcile each gene tree with the given species tree. These rooted gene trees were then input into GRAMPA (Gregg et al. 2017) with the maximum number of multiply labeled taxa in each gene tree set to 8 (default), and no restrictions as to which nodes in the species tree (if any) were involved in allopolyploidy.
For additional analyses using only published genomic datasets and our de novo assembly of Dombeya, protein sequences were clustered using OrthoFinder version 2.1.2 (Emms and Kelly 2015), and sequence alignments for each orthogroup were constructed using MAFFT (Katoh et al. 2002). Gene trees were inferred using the PROTGAMMAWAG model of substitution in RAxML (Stamatakis 2014) with 100 bootstrap replicates. Only those gene trees in which all internal nodes had at least 50% bootstrap support were retained for further analyses. Gene trees were rooted with Urec version 1.02 (Górecki and Tiuryn 2006) using the species tree topology generated from our plastid sequences (see Results), and GRAMPA was run with the maximum number of multiply labeled taxa in each gene tree set to 18. No restrictions were set on which lineages were involved in allopolyploidy.

Modeling orthologous gene family sizes
To further test the hypothesis of a single or separate WGM events, we implemented a statistical framework for using counts of orthologous genes between species to comparing hypotheses for the placement of WGD or WGT events on a phylogenetic tree. Counts of orthologs per orthogroup were obtained from OrthoFinder version 2.1.2 (Emms and Kelly 2015) for the published genomes (Gossypium, Theobroma, Bombax, and Durio). We used concatenation of single-copy nuclear-encoded proteins whose gene trees matched the expected species tree topology to infer a phylogeny in RAxML (Stamatakis 2014) using the PROTGAMMAWAG model of substitution with branch lengths in substitutions/site. This phylogeny was also rendered ultrametric with penalized likelihood (PL) using Chronos (Sanderson et al. 2002;Kim and Sanderson 2008) in the R package APE (Paradis et al. 2004;Paradis 2011;Popescu et al. 2012), with default parameters of a correlated clock and lambda=1 (as in Paradis 2013). On the non-calibrated tree, to make birth rate comparable with that on the calibrated tree, branch lengths were rescaled to have an average distance of 1 between the root and the tips. Analyses described below were run both on this rescaled maximum likelihood tree and on the PLcalibrated tree.
Modeling gene family sizes was implemented in the R package WGDgc as described in Rabier et al. (2013). Using the phylogenetic tree inferred above, gene family size was modelled using equal background (small-scale) gene duplication and gene loss rates, with the same rate per unit branch rate assumed for the entire tree. Shared duplication or triplication events were modeled by doubling or tripling counts in each orthogroup, followed by immediate fractionation and loss of each duplicate with some probability, 1 minus the retention rate, with each event having an independent retention rate. Parameters were estimated jointly by maximum likelihood and models were compared using the Bayesian Information Criterion (BIC). The estimated parameters include the background duplication/loss rate, one retention rate for each WGM, the timing of each event, and the average number of genes per family at the root of the tree (with the number of genes at the root assumed to follow a geometric distribution, translated to start at 1). The likelihood calculation accounts for some sources of ascertainment bias. In particular, we corrected for the fact that we cannot observe families that went extinct or are unsampled (zero genes in all the sampled taxa) and that OrthoFinder excluded singletons, i.e. families with a single gene. To correct for these missing data, means were calculated excluding gene families that went extinct either in Theobroma, or in all of the other three taxa.
Given our four-taxon tree, we tested a total of 89 models accounting for possible phylogenetic placements of WGD/WGT events. Tables S3 and 4 present results for each scenario, but briefly the models tested were: 1) No events, 2) one WGD or one WGT event occurring on five possible tree edges resulting in total of 10 models, 3) either two WGD or one WGD + one WGT, each event placed on any of the 5 possible edges, resulting in 40 possible models, 4) three events with one specific to Durio and two events (two WGD or one WGD + one WGT) in Malvatheca (3 edges), accounting for 30 models, and 5) four events, but none of them shared between taxa, accounting for 8 models. Our constraint on four-event scenarios were to have two in Gossypium (two WGD or one WGD + one WGT), one in Bombax, and one in Durio.
The likelihood of shared WDG/WDT models was compared by BIC.  Table S1. Taxa sampled in this study, subfamily, and genomic or transcriptomic, source, and transcriptome assembly statistics, including BUSCO Table S2. Average gene family size in each taxon Table S3. Gene count analyses from WGDgc. Models, their likelihoods and parameter estimates for the starting tree with branch lengths proportional to their raw values in substitutions/site Table S4. Gene count analyses from WGDgc. Models, their likelihoods and parameter estimates for the starting tree with branch lengths calibrated with Penalized Likelihood     Numbers by WGM symbols are the estimated retention probability of each extra copy. λ is the estimated gene turnover (duplication & loss) rate. Numbers in parentheses, next to taxon names, are the mean number of genes/family among the families that were used for analysis. Figure 6. Hypotheses presented to explain the data (A) Hypothesis one, which is consistent with Bayesian concordance analysis, ML phylogeny of concatenated nuclear genes, Ks histograms, and gene tree topology testing, suggests that Durio is an allopolyploid. (B) Hypothesis two, which is consistent with Ks histograms, plastid phylogeny, and GRAMPA, suggests that the Malvatheca clade is allopolyploid.

Table 1. Quartet-based gene tree topology test
For any two species that contained a Ks peak, we extracted quartets (pairs of paralogs) under the peak and constructed gene trees to test if the tree topology is consistent with a shared gene duplication event. Below diagonal: number of genes that support separate gene duplication event + genes suggesting shared duplications. Above diagonal: percentage of quartets (pairs of paralogs) yielding a topology consistent with a separate duplication. Values well-below or well-above the expected 33% for random sampling are colored: <15% = red; >45% = blue.