FRAGARIA.Zurn.etal.PopulationStructure.2022

Evaluation location: Oregon, United States

Population Structure and Diversity Analysis All analyses of diversity and structure were conducted using R v 4.0.3 (ref. 31). Accessions were stratified into 13 geographical regions based on their passport information on GRIN Global. The regions within North America included: Alaska, U.S., Northwest U.S. (Idaho, Oregon, Washington, and Wyoming), California, U.S., Midwest U.S. (Illinois, Indiana, Iowa, Michigan, Minnesota, Missouri, and Wisconsin), Southeastern U.S. (Arkansas, Florida, Louisiana, Mississippi, North Carolina, Tennessee, Texas, and South Carolina), Mid-Atlantic U.S. (Maryland and Delaware), New England U.S. (Connecticut, Maine, Massachusetts, New Hampshire, New Jersey, and New York), Western Canada (British Columbia and Alberta), and Eastern Canada (Ontario and Nova Scotia). Europe was divided into Western Europe (Belgium, Denmark, France, Germany, Italy, Ireland, the Netherlands, Norway, Sweden, and the United Kingdom) and Eastern Europe (Belarus, Lithuania, Poland, and Western Russia). The Asia geographic region included accessions from Eastern Russia, Japan, and Taiwan. Finally, a single accession that originated from South Africa made up the final geographic region. Three methods were used to evaluate population structure. The first was via principal component analysis (PCA) followed by k-means clustering as implemented in adegenet v. 2.1.3 (refs. 32 & 33). When conducting k-means clustering the maximum number of principal components were retained and the optimum number of clusters was selected using the minimum Bayesian information criterion. The sparse non-negative matrix factorization (sNMF) algorithm as implemented in the R package LEA 34, 35 was used to evaluate population structure and admixture between populations. The number of K subpopulations evaluated ranged from 2 to 14 and each analysis was repeated 10 times. The elbow method was used to identify K clusters for the sNMF algorithm. Sample orders were calculated using CLUMPP v. 1.1.2 (ref. 37) and results were visualized using the barchart function from LEA 34. Finally, population structure and admixture between populations was assessed using STRUCTURE v. 2.3.4 (ref. 36) for 2 to 14 subpopulations. Parameters were set to 25,000 burn-in steps and 50,000 Markov-Chain Monte Carlo (MCMC) steps with 10 replications per k subpopulation. All remaining parameters were set to default. The optimal number of k subpopulations for the STRUCTURE results was identified using STRUCTURE HARVESTER v. 0.6.94 (refs. 38 & 39). Sample orders were calculated using CLUMPP v. 1.1.2 (ref. 37) and results were visualized using Structure Plot v2.0 (ref. 40). The geographic sub-populations, except for the South African accession, were evaluated for population richness, intra group diversity, expected heterozygosity, and evenness. Intra group diversity was evaluated using Simpson’s index 36 and expected heterozygosity was evaluated using Nei’s expected heterozygosity 37. Richness, Simpson’s index, Nei’s expected heterozygosity, and evenness were calculated using the R package poppr v 2.8.7 (refs. 41 & 42). The pairwise fixation index (FST) was also calculated for each geographic sub-population, excluding South Africa, using hierfstat v 0.5-7 (ref. 43) to assess the amount of interbreeding/sharing of germplasm between breeding programs in these regions. Core Collection Creation Two 100 individual core collections were created using the R package corehunter (v. 3.2.1 (ref. 6). The first core collection was a type 1 core collection (also known as a CC-I collection) designed to evenly represent the diversity of the collection. The second was a type 2 core collection (also known as a CC-X collection) designed to represent the extremes of the entire collection. The type 1 collection used the average distance between each accession and the nearest entry (A-NE) criterion and works to minimize this value 44. The type 2 collection used the average distance between each entry and the nearest neighboring entry (E-NE) criterion and works to maximize this value 7. For each collection, a set of 13 accessions were pre-selected as “seeds”. These individuals were selected based on their geographical origin and because they are positive controls for various DNA tests, were sequenced or a parent of a major mapping population, or have been known to be notable cultivars from their geographic region 12, 45, 46. These 13 accessions were as follows: ‘Camarosa’ (PI 670238), ‘Charm’ (PI 664911), ‘Deutsch Evern’ (PI 551626), ‘Holiday’ (PI 551653), ‘Korona’ – Netherlands (PI 666636), ‘Mara des Bois’ (PI 687353), ‘Ooishi shikinari 2’ (PI 641185), ‘Senga Sengana’ (PI 264680), ‘Strawberry Festival’ (PI 664337), ‘Tochiotome’ (PI 617008), ‘Totem’ (PI 551501), ‘Tribute’ (PI 551953), and US 4809 (PI 637938). Prevosti’s absolute genetic distance was used in construction of each core collection 47. The corehunter package was run 2,000 times when constructing each core collection, retaining the core collection with minimum A-NE or maximum E-NE criterion depending on the collection type, due to the stochastic algorithms used in the package. Pedigree Confirmation Percent identity by state (IBS) was calculated between each pair of individuals for all individuals. Individuals with greater than or equal to 98% were considered to be synonyms. The software COLONY v 2.0.6.6 (ref. 48) was used for parentage inference. The parameters polygamy for both males and female, inbreeding mating, without clones, monoecious, and diploid were used to describe hybridization within strawberry. The full-likelihood estimates algorithm with precision set to high was used. All remaining parameters were set to the default. Potential parents with a pairwise likelihood under 90% were excluded as parental candidates unless a full-likelihood estimate was provided. Trait Associated Haplotype Prevalence For haplotype identification, markers from the curated dataset that had been mapped to the F. ×ananassa ‘Camarosa’ v. 1.0 assembly were used 14. Data were imputed and phased using Beagle v 5.2 (refs. 49 & 50). Pairwise linkage disequilibrium was calculated using VCFtools v 0.1.16 (ref. 51) to assess linkage disequilibrium decay. Haploblocks for each region of interest were defined as N nucleotides proximally and distally from markers associated with each gene or QTL, where N is the genome-wide distance required to reach an r2 of 0.20 when estimating linkage disequilibrium. The genetic regions for the remontancy gene FaPFRU 20 and disease resistance genes FaRCa1 (anthracnose fruit rot; ref. 26), FaRCg1 (Colletotrichum crown rot; ref. 22), FaRMp1 (charcoal rot; ref. 24), FaRMp2 (charcoal rot; ref. 24), FaRPc2 (Phytophthora crown rot; ref. 23), and Fw1 (Fusarium wilt; ref. 25) were investigated. Haplotypes associated with perpetual flowering and disease resistance were identified using previously reported favorable alleles in known positive accessions within the collection. The prevalence of these haplotypes within the collection and their geographical distributions were assessed. Haplotypes with identical sequences were arbitrarily named except for those that have been previously identified. Previously identified haplotypes were named using the gene name followed by any signifying haplotype in previous research.

Trait(s) evaluated

TRUE TO TYPE

238 Accessions

Citation(s)

Zurn, J. D., K. E. Hummer & N. V. Bassil. 2022. Exploring the diversity and genetic structure of the US National Cultivated Strawberry Collection. Hort. Res. 9, uhac125. DOI: 10.1093/hr/uhac125. Note: ISSN 2052-7276 (online)