1
Whole genome sequence of Mapuche-Huilliche Native Americans
Elena A. Vidal1,2,15,* Tomás C. Moyano1,2,*, Bernabé I. Bustos1,3,*, Eduardo Pérez-Palma1,3,*, Carol
Moraga1,2,*, Alejandro Montecinos1,2, Lorena Azócar1,4, Daniela C. Soto1,2, Eleodoro Riveras1,5,
Mabel Vidal1,2, Alex Di Genova1,5, Klaus Puschel6, Peter Nürnberg7, Stephan Buch8, Jochen
Hampe8, Miguel L. Allende1,9, Verónica Cambiazo1,10, Mauricio González1,10, Christian Hodar1,10,
Martín Montecino1,3, Claudia Muñoz-Espinoza1,11, Ariel Orellana1,11, Angélica Reyes-Jara1,11,
Dante Travisany1,5, Paula Vizoso1,11, Mauricio Moragal2,13, Susana Eyheramendy14, Alejandro
Maass1,6, Giancarlo V. De Ferrari1,3,**, Juan Francisco Miquel1,4,**, Rodrigo A. Gutiérrez1,2,**
1FONDAP Center for Genome Regulation. 2Departamento de Genética Molecular y
Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile. 3Centro
de Investigaciones Biomédicas, Facultad de Ciencias Biológicas and Facultad de Medicina,
Universidad Andrés Bello. 4Departamento de Gastroenterología, Facultad de Medicina, Pontificia
Universidad Católica de Chile. 5Departamento de Ingeniería Matemática and Centro de
Modelamiento Matemático, Facultad de Ciencias Físicas y Matemáticas, Universidad de Chile.
6Departamento de Medicina Familiar, Escuela de Medicina, Pontificia Universidad Católica de
Chile, Chile. 7Cologne Center for Genomics (CCG), University of Cologne, Cologne, Germany.
8Medical Department I, University Hospital Dresden, TU Dresden, Dresden, Germany.
9Departamento de Biología. Facultad de Ciencias, Universidad de Chile. 10Laboratorio de
Bioinformática y Expresión Génica, Instituto de Nutrición y Tecnología de los Alimentos,
Universidad de Chile. 11Centro de Biotecnología Vegetal, Facultad de Ciencias Biológicas,
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
2
Universidad Andrés Bello. 12Instituto de Ciencias Biomédicas, Facultad de Medicina, Universidad
de Chile. 13Departamento de Antropología, Facultad de Ciencias Sociales, Universidad de Chile.
14Departmento de Estadística, Facultad de Matemáticas, Pontificia Universidad Católica de Chile.
15Current address: Centro de Genómica y Bioinformática, Universidad Mayor.
*These authors contributed equally to this work.
**These authors co-supervised this work.
Correspondence should be addressed to RAG ([email protected]), JFM
([email protected]) or GVD ([email protected]).
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
3
Abstract
Background
Whole human genome sequencing initiatives provide a compendium of genetic variants that help
us understand population history and the basis of genetic diseases. Current data mostly focuses on
Old World populations and information on the genomic structure of Native Americans, especially
those from the Southern Cone is scant.
Results
Here we present a high-quality complete genome sequence of 11 Mapuche-Huilliche individuals
(HUI) from Southern Chile (85% genomic and 98% exonic coverage at > 30X), with 96-97% high
confidence calls. We found approximately 3.1x106 single nucleotide variants (SNVs) per
individual and identified 403,383 (6.9%) of novel SNVs that are not included in current
sequencing databases. Analyses of large-scale genomic events detected 680 copy number variants
(CNVs) and 4,514 structural variants (SVs), including 398 and 1,910 novel events, respectively.
Global ancestry composition of HUI genomes revealed that the cohort represents a marginally
admixed population from the Southern Cone, whose genetic component is derived from early
Native American ancestors. In addition, we found that HUI genomes display highly divergent and
novel variants with potential functional impact that converge in ontological categories essential in
cell metabolic processes.
Conclusions
Mapuche-Huilliche genomes contain a unique set of small- and large-scale genomic variants in
functionally linked genes, which may contribute to susceptibility for the development of common
complex diseases or traits in admixed Latinos and Native American populations. Our data
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
4
represents an ancestral reference panel for population-based studies in Native and admixed Latin
American populations.
Keywords
Mapuche-Huilliche - Native American - whole genome - structural DNA variation
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
5
Background
Sequencing complete human genomes has greatly expanded the knowledge of our genetic
diversity, providing insights into the evolutionary history of man and the bases of human diseases.
Large-scale genomic initiatives such as HapMap [1], the 1000 Genomes Project (1kGP) [2] or the
ExAC initiave [3] have revealed that individuals from multiple populations carry different profiles
of rare and common variants that differ substantially among human continental groups. While
current high-coverage full genome efforts have mostly focused on Old World populations
(Europeans, Asians and Africans), there is still limited information concerning the genetic
structure of Native American groups [4].
Genome-wide sequence data from ancient and present-day humans has been described from
Greenland, Arctic Canada, Alaska, Aleutian Islands and Siberia and used to understand migration
pulses into the Arctic regions of America [5, 6]. Likewise, whole genome/exome and large-scale
genotyping data has been used to study the genetic history, multiple streams of migration and
population-genomic variables that underlie patterns of deleterious variation in African, Asian,
European, and Native American ancestry in populations of Latin America and the Caribbean [7-
10], as well as the Pacific Northwest [11]. More recently, studies on the demographic history and
population structure of admixed South American Latinos has been reported with the aid of
genome-wide genotyping technologies [12-14]. Nevertheless, whole genome sequencing efforts
aimed to identify genetic variants that may influence susceptibility to develop complex common
disorders affecting modern Native American and admixed American populations are still scarce.
Mapuche-Huilliches descend from early hunter-gatherers who colonized the subcontinent about
15000 years ago and are the modern representatives of one of the most prominent indigenous
groups in the Southern Cone of South America [15]. Here we sequenced at high-coverage and
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
6
analyzed the complete genome of 11 individuals belonging to a native Mapuche-Huilliche
population from Southern Chile. Addressing genetic contributions from Native populations
represents a major step towards understanding the genetic basis for common traits or disorders in
current admixed populations.
Results
Overview of Mapuche-Huilliche genomes
Individuals sequenced belong to a community living in Ranco Lake (latitude 40°13'27.62"S,
longitude 72°22'50.16"W), in the Los Rios region of Southern Chile (Additional file 1: Figure
S1). A common difficulty encountered while studying Native American genetic history is the
admixture with individuals from Europe and Africa that occurred since the arrival of Europeans to
America in 1492. For example, Spanish conquerors arrived in Chile in the sixteenth century and
began interbreeding with native females, primarily of the Mapuche-Huilliche group, giving birth
to the current Chilean population [12, 16, 17]. Therefore, to ensure a priori as little admixture as
possible, 11 individuals from the Mapuche-Huilliche community (10 females and 1 male) with > 3
surnames of Mapuche origin, ABO type O and Rh+, the most common blood type in southern
Cone Native Americans, and that were homozygous for a SNV located 13.9 kb upstream of the
human lactase gene (LCT: C>T-13910; rs4988235), which determines the lactase nonpersistent
state in Native Americans [18], were selected for sequencing (Table 1), hereinafter referred as
“HUI” cohort. HUI DNA samples were sequenced using the combinatorial probe-anchor ligation
and DNA nanoarray technology of Complete Genomics [19]. We obtained an average of 85%
genomic and 98% exonic coverage of at least 30X, with 97% and 98% high confidence calls,
respectively.
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
7
Approximately 3.1x106 single nucleotide variants (SNVs) were determined for each individual,
with a total of 5,847,034 SNVs in the cohort (Additional file 1: Table S1). We found a high level
of concordance (99.70%) in the SNV calling rate between genome sequencing and genotyping
results obtained using the Illumina Infinium Human Core Exome BeadChip on the same
individuals (Additional file 1: Table S2). Since the Illumina chip is focused only in exonic and
gene-surrounding variants, we determined the genome-wide transition versus transversion
mutations (Ts/Tv) ratio for all variants to assess for the presence of false positive calls. We
observed that the Ts/Tv ratio in sequenced genomes is 2.1:1 (Additional file 1: Table S3), in
agreement with the 1kGP expected ratio of 2:1 [2], further confirming the good quality of variants.
Identity by descent (Additional file 1: Table S4) and inbreeding (Additional file 1: Table S5)
analyses indicated that individuals sequenced were not closely related or inbred. A genome-wide
summary of main genetic elements in HUI individuals is presented in Fig. 1.
We identified 403,383 (6.9%) novel SNVs that are not included in dbSNP build 144 release or do
not have a reported frequency either in the 1kGP-phase 3 database [2], the Exome Sequencing
Project [20] or the Exome Aggregation Consortium (ExAc) [21] (Additional file 1: Table S1). To
reduce the amount of variants with missing calls, we applied a calling rate (CR) threshold of 90%
giving a total number of 321,803 SNVs, including an excess of individual privative variants (1/22
alleles; 256,550 SNVs, 79.7 %; Additional file 1: Fig. S2a), in agreement with the current
literature [22]. A number of 175,897 novel SNVs fell in intergenic regions (54.66%), 118,777 in
introns (36.91%) and 1,769 in the coding portion of the genome (0.55%, exonic) (Additional file
1: Fig. S2b). Likewise, we observed 88,173 (19.92%) insertions and 54,718 (12.06%) deletions
that are novel and observed in at least 1 individual. In addition, analyses of large-scale genomic
events detected 680 copy number variants (CNVs) and 4,514 structural variants (SVs), including
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
8
398 and 1,910 novel events (Additional file 1: Table S6 and S7, respectively), that did not
overlap any region reported in the CNV map from the Database of Genomic Variants [23] or the
1kGP-phase 3 release [24]. We found 1,096 genes partially or completely overlapped by ≥1 CNVs
or SVs in at least one HUI sample (Additional file 1: Fig. S3a and Additional file 2: Table S8)
and 37 genes consistently affected by novel events in all HUI genomes analyzed. Interestingly,
some of these large structural events alter the coding sequence of genes and thus may have a
potential functional impact (Additional file 1: Fig. S3b).
Global ancestry composition of HUI and Chilean Latino genomes
We used ADMIXTURE software [25] to determine ancestry composition of HUI genomes by
comparing a set of 105,252 SNVs shared with those present in the complete set of samples of the
1kGP-phase 3 (n = 2,504 individuals) [2]. In addition, we included a panel of 1,191 Chilean
Latinos genotyped by the Affymetrix Axiom World Array LAT 1, to represent the general Chilean
population. We ran ADMIXTURE from K = 1 to K = 15 models and obtained near minimum
values beyond K = 5 (CV error = 0.51976, Additional file 1: Fig. S4a) and a minimum cross-
validation error at K = 10 (CV error = 0.51868, Additional file 1: Fig. S4b). First, we explored a
continental model that considers 5 ancestry components (K = 5), which includes the super-
populations from European (EUR), East Asian (EAS), South Asian (SAS), African (AFR) and
admixed American (AMR) ancestries from 1kGP-phase 3 data. We found that HUI samples have
a high degree of admixed American ancestry (average = 93.8%) with a minimal contribution of
other founder populations (Fig. 2a, Top), which validates our ascertainment scheme for selection
of individuals to be sequenced. In agreement with recent data regarding the ancestry and complex
demographic history of South America [13], the Chilean Latino panel behaves as an admixed
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
9
population with a strong genetic contribution of both EUR and AMR ancestries (average =
49.97% and 45.54%, respectively). Other AMR cohorts from 1kGP-phase 3, such as Peruvians
from Lima (PEL), Mexicans from Los Angeles (MXL), Colombians from Medellin (CLM) and
Puerto Ricans in Puerto Rico (PUR), showed important contributions from EUR and AMR and to
a lesser extent from AFR and other super-populations. Second, when we ran ADMIXTURE
considering the minimum cross-validation error (K = 10) we observed that all super-populations
split into two major components within each specific cluster, as described before for K = 5 (Fig.
2a, Bottom). Notably, a large component of the AMR ancestry in PEL, MXL, CLM and PUR
populations (dark gray) is not present in HUI genomes (average = 0.5%) and is marginally
represented in Chilean Latino individuals (average = 6.9%, compared with 76.2% in PEL, 42.9%
in MXL, 25.6% in CLM and 13.5% in PUR samples). These results suggest that HUI individuals
and the broader Chilean cohort derive this genetic component from shared Native American
ancestors with low genetic representation in other admixed American populations.
To further explore the genetic structure of the HUI cohort, we used the same set of samples and
SNVs and run a principal component analysis (PCA) using EIGENSTRAT [26] (Fig. 2b). PCA
results clearly defined world population structure, showing clusters composed of African (AFR),
European (EUR) and Asian (EAS and SAS) populations, while the admixed American
superpopulation (AMR) was widely distributed between EUR and HUI individuals. Consistent
with absence of significant recent admixture, our HUI cohort cluster in the PCA plot at one edge
of the AMR samples (red vs. gray dots, respectively). As expected, the largest genetic distance
existed between the AFR population and the rest of the groups. In turn, we observed that
genotyped Chilean Latino samples (black dots) spread in a relatively narrow cluster that begins at
the edge of HUI individuals and ends with individuals from the EUR founder population and
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
10
overlapped most of the AMR cluster. However, in contrast to admixed Chilean Latinos, the AMR
super populations were widely distributed between EUR, AFR and HUI populations. These results
are in agreement with the admixture analysis showing that the Chilean Latino population exhibits
minimal genetic contribution of other population beyond EUR and HUI/AMR, in accordance to
the Chilean demographic history. Unlike other AMR populations with considerable contribution
or African/Asiatic immigration (i.e. PUR; average = 15.0% of AFR contribution) [7, 27, 28],
African and Asian ancestry in Chilean samples was almost negligible (average = 1.5% and 1.1%,
AFR and EAS + SAS, respectively).
Analysis of mitochondrial DNA showed that all HUI individuals belong to the Native American
haplogroups C and D, two of the major pan-continental founder haplogroups. The majority of
genomes sequenced (7 out of 11) belong to the C1b haplogroup and 6 of them were assigned to
the clade C1b13 (Additional file 1: Fig. S5a), which is a branch found mainly in the Southern
Cone of South America between 38° and 42°S [15, 29]. While the other 4 individuals belong to
the D haplogroup, 3 of them are in the D1g clade, which is found almost exclusively in the
central-southern part of Chile and Argentina, and only one is in the D4h3a clade (Additional file
1: Fig. S5b), found mainly in the Southern Patagonia [15, 29]. These results are in agreement with
the admixture data (K = 10, as described before) showing that the genetic component of the HUI
cohort differs from the genetic component of other Native American populations living in the
northern region of South America.
Highly divergent and novel variants with potential functional impact in HUI genomes
Genetic variation explaining differential susceptibility to diseases or metabolic conditions derives
mostly from studies in populations from developed countries, and more recently from genotyping
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
11
admixed Latin American populations [12, 14, 30]. To identify highly divergent SNVs in HUI
individuals compared to other world populations, as a measure of differentiation due to population
structure, we determined pairwise fixation index (Fst) statistics [31] between HUI and 1kGP-
phase 3 populations from Africa (AFR: YRI, LWK, GWD, MSL, ESN, ASW and ACB), Europe
(EUR: CEU, IBS, GBR, FIN, TSI), America (AMR: CLM, PUR, MXL and PEL), Southern Asia
(SAS: PJL, GIH, BEB, STU and ITU) and East Asia (EAS: CHB, JPT, CHS, CDX and KHV) [2].
At the population level, weighted Fst statistics revealed that HUI individuals are genetically closer
to admixed American individuals (i.e. PEL, MXL, CLM and PUR), then with individuals from
Eastern and Southern Asian and finally with European and African populations (Fig. 3a), in
agreement with the settlement and population history of ancient Native Americans [7, 10].
Likewise, and at the variant level, Fst analyses allowed the identification of an average number of
165,746 SNVs surpassing the 95th percentile of the genome-wide Fst distribution, which was used
as a threshold of high divergence (HUI95th: average Fst cutoff of 0.51, vertical lines in Fig. 3b and
Additional file 1: Table S9). The union of all Fst variants yielded a total number of 842,780
HUI95th divergent SNVs, representing the whole genetic variability of HUI compared to other
1kG-P3 populations.
To identify HUI divergent variants with potential functional impact (VPFIs) we followed a two-
step variant/gene prioritization procedure, focused solely on exonic/exonic-splicing variants that
may result in nucleotide changes affecting protein folding, structure or stability. First, Fst-derived
HUI95th variants detected in 2 or more populations were queried for functional impact using
ANNOVAR [32] with the Combined Annotation Dependent Depletion (CADD) database [33],
which quantitatively prioritize functional, deleterious and disease causal variants. We detected a
total number of 529 intolerant HUI VPFIs annotated to 500 genes (Additional file 3: Table S10).
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
12
Second, we searched for significant protein-protein interaction connectivity among these HUI
VPFIs containing genes using the online database resource Search Tool for the Retrieval of
Interacting Genes (STRING) [34], using only highest confidence interaction scores (0.9;
probability that a predicted link exists between two proteins), and recovered 115 gene products
that were significantly connected in multiple specific molecular networks (STRING Whole
Network p-value=2.2x10-10; Fig. 4). Gene-annotated HUI VPFIs were enriched in essential gene
ontology (GO) categories or pathways, including: extracellular matrix organization (GO.0030198,
p-value=7.0x10-23); axonemal dynein complex (GO.0005858, p-value=1.4x10-11); proteasome
complex (GO.0000502, p-value=2.3x10-10) and protein ubiquitination (GO.0016567, p-
value=3.9x10-6); glycerolipid metabolism (KEGG.00561, p-value=1.9x10-7) and lipoprotein
transport (GO.0042953, p-value=6.7x10-5); 90S preribosome (GO.0030686, p-value=8.1x10-5);
and cell adhesion molecules (KEGG.04514, p-value=0.00023), which are involved in the response
of the immune system to viral pathogens (Additional file 4: Table S11).
To detect novel VPFIs we filtered HUI variants through the two-step prioritization protocol
described above. Starting with 68,615 novel variants (HUI alleles ≥ 2, no singleton/no
monomorphic), 145 surpassed CADD, fell within exonic boundaries and annotated to same
number of genes (Additional file 5: Table S12). No significant protein-protein connectivity was
observed among these products (STRING, highest confidence = 0.9, PPI enrichment p-
value=0.303). Nonetheless, novel VPFIs increased the number and further connected the networks
of interacting proteins (STRING, highest confidence = 0.9, PPI enrichment p-value=2.2x10-9)
previously observed with Fst-derived HUI95th variants (Additional file 1: Fig. S6). Most notably,
the extracellular matrix organization network increased both the number of direct interacting
partners and also connected with genes from the glycerolipid metabolism and lipoprotein transport
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
13
networks. Other enriched networks similarly increased interacting proteins and connections. We
finally analyzed overrepresentation of genes linked to diseases or disorders using available
information on DisGeNET database [35] and found that divergent and novel variants clustered in
networks associated with diseases, most notably in Hypercholesterolemia (umls:C0020443,
adj_pvalue=0,00073). Altogether, these findings reveal that HUI genomes contain highly
divergent and novel VPFIs that converge in essential ontological categories, which may contribute
to common traits, metabolic conditions or prevalent disorders in HUI Native Americans.
Discussion
We report here for the first time the complete genome sequences of a group of 11 Mapuche-
Huilliche individuals and describe common and novel SNVs and large-scale structural variants.
Global ancestry composition revealed that HUI genomes have a minimal contribution of
European, East and South Asian and African founder populations (K=5) and therefore represent
and original source of genetic variation for modern admixed individuals living in America. We
found that contemporary American populations, including admixed Latinos from Chile and PEL,
have a high component of a previously unknown ancestral genetic contribution identified only in
HUI genomes. Such contribution decreased in other Amerindian groups represented in MXL,
CLM and PUR populations, likely due to the demographic history of Central and North America.
Importantly, in these AMR populations we detected a large genetic component that was not
present in HUI genomes and that is also marginally represented in Chilean Latino individuals
(gray color in K=10 admixture model), suggesting that AMR ancestors once they reached and
settled in the Southern Cone remained isolated and did not mix with other ancestral groups
inhabiting the northern tip of South America until very recently, when Spanish conquerors arrived
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
14
to America. Such idea is further supported by PCA analyses, mtDNA haplogrous, and Fst
analyses, which show that HUI individuals are positioned in a narrow cluster at the edge of the
distribution of AMR populations.
Native Americans, particularly those from the Southern Cone, are a neglected group in
population-based epidemiological studies and remain poorly investigated at the whole genome
level [4]. We found highly divergent and novel HUI VPFIs significantly enriched in essential
cellular metabolic processes that may help us to understand the differential susceptibility of
Mapuche-Huilliches to prevalent metabolic conditions. It has been observed that Native
Americans and Latin populations with Amerindian heritage exhibit a substantial predisposition to
dyslipidemias, diseases of the circulatory system and neoplasms [14, 36, 37]. We note that the
most significant network detected in our study assembled around genes carrying VPFIs in
extracellular matrix molecules, glycerolipid metabolism and lipoprotein transport components.
Since lipid metabolic pathways tightly control cholesterol homeostasis, it is likely that functional
variants altering these cellular processes are involved in triggering lipid-related metabolic
disorders in admixed Latin American individuals with Mapuche ancestry.
In summary, here we present the genome sequence of a marginally admixed Native American
population from the Southern Cone. HUI genomes show highly divergent large-scale structural
and single nucleotide variants that may contribute to modulate susceptibility or resistance to
develop complex common diseases in the Chilean and Native American populations. Our study
represents an important resource providing a reference panel for Native American populations to
be used in future population-based studies on traits of interest (i.e. GWAS with a Native American
SNV panel), as well as in the design of early diagnostics and prevention tools.
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
15
Methods
Whole-genome sequencing
Mapuche-Huilliche individuals from Huapi Island are part of an ongoing longitudinal
ultrasonographic study on the prevalence and risk factors of common metabolic diseases in Chile.
Mapuche-Huilliche DNA samples were sequenced with the combinatorial probe-anchor ligation
sequencing process of Complete Genomics [19]. The standard Complete Genomics bioinformatics
pipeline (Assembly Pipeline version 1.10 and CGA Tools 1.4) was used for sequence alignment,
read mapping, assembly, and variant call. Human reference genome used was GRCh37.p5
(ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/).
Genotyping of admixed Chilean Latinos
We used a Chilean Latino panel composed of individuals belonging to the ANCORA family
health centers located in Santiago-Chile (La Florida and La Pintana) that constitutes an admixed
(European-Amerindian) population aged between 20-80 years old from an urban area and
representative of the Chilean general population. All subjects were genotyped under the AXIOM®
Genome-Wide Platform (version LAT 1) using the GeneTitan® Multi-Channel (MC) Instrument
following manufacturer instructions. Samples with discordant sex, elevated missing genotypes rate
(≥0.03) or outlying heterozygosity rate (>3SD) were excluded.
1000 Genomes Project Phase 3 samples used in our study
The complete dataset from the 1kGP-phase 3 was obtained through the NCBI FTP site
(http://www.ncbi.nlm.nih.gov/public/).
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
16
Identification of total SNVs and Indels
Genomic variants were obtained from MasterVar archives delivered by Complete Genomics
(MasterVar file description in
http://www.completegenomics.com/documents/DataFileFormats_Standard_Pipeline_2.5.pdf).
Only high quality reads were used (low_coverage and half variants were filtered). We registered
zygosity for every genomic variation: homozygous variations (both alleles are the same and are
different from the reference), heterozygous-reference variations (one of the alleles is different
from the reference) and heterozygous-alternative variations (both alleles are different and are
different from the reference). Variant Call Format files (VCF) were generated from all 11
MasterVar files using the CGAtools software v.1.8 (http://cgatools.sourceforge.net/). Novel
SNVs, Insertions and Deletions were defined as genomic variants that are not included in dbSNP
build 144 (http://www.ncbi.nlm.nih.gov/SNP) or that have no frequency reported in the 1kGP-
phase 3 Database (http://www.1000genomes.org/home) [2], the Exome Sequencing Project
(http://evs.gs.washington.edu/EVS/), the Exome Aggregation Consortium
(http://exac.broadinstitute.org/), and 46 whole-genome sequences from the Complete Genomics
public data (http://www.completegenomics.com/public-data/), which were extracted using the
ANNOVAR software [38]. Circular representation of SNVs, CNVs and SVs across HUI genomes
was drawn with Circos [39].
Validation of variants
Variant calling was validated by microarray genotyping using the Illumina Infinium® Human
Core Exome BeadChip. The chip consisted in 538,448 variants of which 537,385 are SNVs and
1,063 are indels; 263,929 variants fall in exons. Genotyping study was performed in 9 of the 11
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
17
whole-genome sequenced individuals: GS000011194, GS000011195, GS000011196,
GS000011198, GS000011200, GS000011201, GS000011215, GS000020403 and GS000020711.
The comparison was performed using Variant Call Format (VCF) files which were generated from
the Illumina raw genotyping data (Final Report format) taking the genomic positions
(chromosome, base pair), the reference allele for the SNV extracted from the NCBI GRCh build
37 reference human genome and the alternative allele from the beadchip annotation data provided
by Illumina. The genotype for each SNV (reference homozygote: Hom Ref; reference
heterozygote: Het-Ref and alternative homozygote: Hom Alt) was obtained taking the Allele1 and
Allele2 Plus information from the raw genotypes. Variants with a Gene Call score (GC) equal or
above 0.15 were taken as confident calls, as reported elsewhere [40, 41]. The concordance
percentage (Conc%) for each individual was obtained taking the number of matching variants and
calculating the percentage according to the total number of matching positions.
Identity by descent (IBD) and inbreeding analysis
The IBD proportion for each pair of the 11 HUI individuals was calculated in PLINK v.1.9, using
a subset of 391,284 SNVs that had > 90% call-rate and were shared between HUI and 1kGP-phase
3 individuals as described before [42]. Linkage disequilibrium (LD) pruning was applied with
PLINK so that no pair of SNPs within a 50 SNPs window present an r2 value greater than 0.2.
Likewise, inbreeding coefficients were calculated in PLINK v.1.9 for each HUI individual using
the same subset of LD-pruned SNVs. IBD and Inbreeding analyses shows little cryptic
relationships among the samples used in this work. IBD results show no duplicate individuals of
first-degree relationships among our sample (IBD > 0.5). Only two couples of individuals
(GS000011194-GS000011201 and GS0000111-GS000012210) show second-degree relationships
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
18
(IBD 0.25, 0.23, respectively). The inbreeding F coefficient analysis shows only 1 individual
above the cutoff of -0.12 (GS000012242, F= -0.14). Negative inbreeding values indicate an excess
of heterozygosity or “outbreeding” which could happen because of sample contamination,
admixture, or genotyping errors. In this case, this particular HUI sample has the highest European
(16.5%) and African ancestries (4,4%) among all 11 individuals. These results indicate HUI
individuals sequenced are not closely related or inbred.
Identification of structural variants
Large-scale structural variants (SVs) were identified by two independent methods: The first one is
specifically designed to detect copy number variants (CNVs) based on sequence coverage among
samples using a Complete Genomics Hidden Markov Model (HMM) that detects significant
abnormal coverage over sliding windows. Under the assumption that a sample is diploid, the
method can determine if a genomic segment behaves as a “Gain” or “Loss” in comparison to the
reference genome (GRCh37.p5) providing Ploidy -number of times the genomic segment in
present- and a PHRED-like score which denotes the confidence of the call (computed as -10*log10
of the probability of the assigned call being wrong). When whole genome coverage variability is
greater than expected, the sample is assigned to a “no-call” state, which impede CNV analysis.
The second method uses the CGATOOLS junctions2events pipeline and is based on junctions
analysis – defined as regions of the genome where sequences are not adjacent or in the same
orientation as present in the reference genome – that rationalize junctions sets in to the following
event types: Deletions, Distal Duplications, Tandem Duplications, Inversions and Translocations.
Novel CNVs and SVs events were defined as a variable genomic segment present in at least one
sample that did not intersect with any region reported in the inclusive map of the Database of
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
19
Genomic Variants [23] or any of the structural variants reported in the latest release of the 1000
genomes project [24]. Genes affected by structural variants were annotated using Refseq genes
from the UCSC table browser and classified into the following categories regarding their overlap
context with a structural variant: 5-Prime, 3-Prime, Internal, Complete and Chimeric.
ADMIXTURE and Principal Component Analysis (PCA)
We selected a subset of 105,252 SNVs with common and eligible genotypes within the 2,504
unrelated samples from the 1000 genomes project phase 3 population and the Chilean panel
composed of 1,191 samples giving a final total of 3,706 samples to be analyzed. LD pruning was
performed with PLINK so that no pair of SNPs within a 50 SNPs window present an r2 value
greater than 0.2. This dataset was then introduced to the ADMIXTURE [26] software under
default parameters exploring from K = 1 to K = 15 models. Cross validation errors values (CV
error) were extracted from each iteration and plotted with R statistical software. Next, this dataset
was used to perform a principal component analysis to model ancestry differences between
populations using smartpca from EIGENSOFT 5.0 software with default settings [43, 44].
Maternal ancestry analysis
The complete sequences of mitochondrial DNA of the 11 Mapuche-Huilliches were obtained from
the sequencing performed by Complete Genomics. Additionally, 1,016 bp corresponding to the
mtDNA control region (rCRS positions 16032-16544 and 051-555) were amplified, purified and
Sanger sequenced by Macrogen, South Korea as described [15, 29]. Sequences were aligned and
edited with Alignment Explorer (MEGA 4.0)[45]. There was complete concordance between
Sanger sequencing and mtDNA genome sequences provided by Complete Genomics.
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
20
Polymorphisms were confirmed directly using Sequencher 4.9 vDemo (http://genecodes.com/).
Sequences were grouped by mitochondrial haplogroup and analyses were performed separately.
The results were confirmed by comparison with mtDNA tree Build 15 (rCRS-oriented version of
Build 15) available on the PhyloTree.org website. Mitochondrial DNA haplogroups from different
Native-Chilean and other southern South American populations have been described elsewhere
[15, 29]. Calculations were performed using the Network 4.5.0 program
(www.fluxusengineering.com/sharenet_rn.htm); median joining and maximum parsimony were
used as post-processing options.
Fixation index (Fst) analysis in HUI genomes
Fst analysis was performed using the Weir & Cockerham's Fst estimator (wcFst) [31] function
inside VCFTools software V.0.1.12b [46]. We obtained Fst statistics at population (weighted Fst)
and at SNV level from the comparison between HUI and all individuals from each of the 26
populations of the 1kGP-phase 3: Africa (AFR: YRI, LWK, GWD, MSL, ESN, ASW and ACB),
Europe (EUR: CEU, IBS, GBR, FIN and TSI), America (AMR: CLM, PUR, MXL, and PEL),
Southern Asia (SAS: PJL, GIH, BEB, STU and ITU) and East Asia (EAS: CHB, JPT, CHS, CDX
and KHV). Briefly, to get genotype homogeneity between all HUI individuals, we merged no
singleton and no monomorphic SNVs with call rate above 90% with the whole panel of genotypes
from 1kGP-phase 3 (also filtered using Vcftools V.0.1.12b) to get a set of common variants
between all populations. In this subset of variants, we calculated the number of divergent SNVs
per population using the genome-wide Fst distribution and setting a 95th percentile cutoff as a
divergence threshold. Violin plots for the population level Fst distributions were done using the
ggplot2 R software [47, 48].
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
21
Identification of VPFIs with CADD, STRING networks
Genetic variants were queried for functional impact using ANNOVAR [32] with the Combined
Annotation Dependent Depletion (CADD) database [33]. We selected variants surpassing the
Phred-Scale score of 20, which represent the top 1% of most deleterious substitutions, and defined
them as Variants with Potential Functional Impact (VPFI). Protein-protein interaction connectivity
among products encoded by divergent and novel HUI VPFIs was examined using the online
database resource Search Tool for the Retrieval of Interacting Genes (STRING) [34]. Significant
networks were visualized using Cytoscape 3.5.1 (http://www.cytoscape.org).
Declarations
Ethics approval and consent to participate
Mapuche-Huilliche individuals from Huapi Island are part of an ongoing longitudinal
ultrasonographic study on the prevalence and risk factors of common metabolic diseases in Chile.
Informed consent for the study of genetic and metabolic risk factors for prevalent metabolic
diseases was obtained from all studied participants in years 1993 and/or 2001 [49, 50]. Oral and
written informed consent from the legal representatives of Huapi Island Mapuche-Huilliche
community for whole genome sequencing in some members of the community was obtained in
January 2012. The present study was approved by the Institutional review Board for Human
Studies of the Faculty of Medicine at Pontificia Universidad Católica de Chile.
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
22
Availability of data and material
Whole genome sequencing data used in this manuscript is available at the Sequence Read Archive
(SRA, https://www.ncbi.nlm.nih.gov/sra), under accession number SUB2180735, BioProject ID at
http://www.ncbi.nlm.nih.gov/bioproject/358028.
Competing interests
The authors declare no conflict of interest.
Funding
This work was funded by grants from Fondo de Areas Prioritarias (FONDAP) Center for Genome
Regulation (number 1509000), FONDAP Center for intercultural and indigenous research
(number 15110006), Fondo Nacional de Desarrollo Científico y Tecnológico (FONDECYT)
numbers 1130303 to J.F.M., 1140353 to G.V.D. and 1120813 and 1160833 to S.E.
Authors' contributions
RAG, JFM, and GVD designed the study; EAV, TCM, BIB, EP-P, CM, AM, LA, DS, ER, MV,
AD and MM developed analytical tools, performed analysis and interpretation of the data. KP,
PN, SB, JH, CH, CM, AR-J, DT, FV, PV, MM, SE, analyzed the data. MLA, VC, MG, MM, AO
and AM helped in drafting the manuscript. EAV, RAG, JFM and GVD wrote the paper.
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
23
Acknowledgements
The authors thank the Mapuche-Huilliche community of Huapi Island (Ranco Lake), without
whose participation and encouragement this work would not have been made possible.
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
24
References
1. Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Peltonen L, et al: Integrating common and rare genetic variation in diverse human populations. Nature 2010, 467:52-58.
2. Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, Donnelly P, Eichler EE, Flicek P, Gabriel SB, et al: A global reference for human genetic variation. Nature 2015, 526:68-+.
3. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O'Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, et al: Analysis of protein-coding genetic variation in 60,706 humans. Nature 2016, 536:285-291.
4. Harris K, Pritchard JK: Rapid evolution of the human mutation spectrum. Elife 2017, 6. 5. Rasmussen M, Li Y, Lindgreen S, Pedersen JS, Albrechtsen A, Moltke I, Metspalu M, Metspalu
E, Kivisild T, Gupta R, et al: Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 2010, 463:757-762.
6. Raghavan M, DeGiorgio M, Albrechtsen A, Moltke I, Skoglund P, Korneliussen TS, Gronnow B, Appelt M, Gullov HC, Friesen TM, et al: The genetic prehistory of the New World Arctic. Science 2014, 345:1255832.
7. Reich D, Patterson N, Campbell D, Tandon A, Mazieres S, Ray N, Parra MV, Rojas W, Duque C, Mesa N, et al: Reconstructing Native American population history. Nature 2012, 488:370-374.
8. Gravel S, Zakharia F, Moreno-Estrada A, Byrnes JK, Muzzio M, Rodriguez-Flores JL, Kenny EE, Gignoux CR, Maples BK, Guiblet W, et al: Reconstructing Native American migrations from whole-genome and whole-exome data. PLoS Genet 2013, 9:e1004023.
9. Szpiech ZA, Xu J, Pemberton TJ, Peng W, Zollner S, Rosenberg NA, Li JZ: Long runs of homozygosity are enriched for deleterious variation. Am J Hum Genet 2013, 93:90-102.
10. Raghavan M, Steinrucken M, Harris K, Schiffels S, Rasmussen S, DeGiorgio M, Albrechtsen A, Valdiosera C, Avila-Arcos MC, Malaspinas AS, et al: POPULATION GENETICS. Genomic evidence for the Pleistocene and recent population history of Native Americans. Science 2015, 349:aab3884.
11. Verdu P, Pemberton TJ, Laurent R, Kemp BM, Gonzalez-Oliver A, Gorodezky C, Hughes CE, Shattuck MR, Petzelt B, Mitchell J, et al: Patterns of admixture and population structure in native populations of Northwest North America. PLoS Genet 2014, 10:e1004530.
12. Eyheramendy S, Martinez FI, Manevy F, Vial C, Repetto GM: Genetic structure characterization of Chileans reflects historical immigration patterns. Nat Commun 2015, 6.
13. Homburger JR, Moreno-Estrada A, Gignoux CR, Nelson D, Sanchez E, Ortiz-Tello P, Pons-Estel BA, Acevedo-Vasquez E, Miranda P, Langefeld CD: Genomic Insights into the Ancestry and Demographic History of South America. PLoS Genet 2015, 11:e1005602.
14. Lorenzo Bermejo J, Boekstegers F, Gonzalez Silos R, Marcelain K, Baez Benavides P, Barahona Ponce C, Muller B, Ferreccio C, Koshiol J, Fischer C, et al: Subtypes of Native American ancestry and leading causes of death: Mapuche ancestry-specific associations with gallbladder cancer risk in Chile. PLoS Genet 2017, 13:e1006756.
15. de Saint Pierre M, Bravi CM, Motti JMB, Fuku N, Tanaka M, Llop E, Bonatto SL, Moraga M: An alternative model for the early peopling of southern South America revealed by analyses of three mitochondrial DNA haplogroups. PLoS ONE 2012, 7:e43486.
16. Rocco P, Morales C, Moraga M, Miquel JF, Nervi F, Llop E, Carvallo P, Rothhammer F: Composición genética de la población chilena: distribución de polimorfismos de DNA
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
25
mitocondrial en grupos originarios y en la población mixta de Santiago. Rev Med Chil 2002, 130:125-131.
17. Cifuentes L, Morales R, Sepúlveda D, Jorquera H, Acuña M: DYS19 and DYS199 loci in a Chilean population of mixed ancestry. Am J Phys Anthropol 2004, 125:85-89.
18. Morales E, Azocar L, Maul X, Perez C, Chianale J, Miquel JF: The European lactase persistence genotype determines the lactase persistence state and correlates with gastrointestinal symptoms in the Hispanic and Amerindian Chilean population: a case–control and population-based study. BMJ Open 2011, 1.
19. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, Carnevali P, Nazarenko I, Nilsen GB, Yeung G, et al: Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 2010, 327:78-81.
20. 21. Exome Aggregation Consortium (ExAC) C, MA. Available: http://exac.broadinstitute.org.
Accesed: November 2015. 22. Genome of the Netherlands C: Whole-genome sequence variation, population structure and
demographic history of the Dutch population. Nat Genet 2014, 46:818-825. 23. Zarrei M, MacDonald JR, Merico D, Scherer SW: A copy number variation map of the human
genome. Nature Reviews Genetics 2015, 16:172-183. 24. Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K,
Jun G, Fritz MHY, et al: An integrated map of structural variation in 2,504 human genomes. Nature 2015, 526:75-+.
25. Loh P-R, Lipson M, Patterson N, Moorjani P, Pickrell JK, Reich D, Berger B: Inferring admixture histories of human populations using linkage disequilibrium. Genetics 2013, 193:1233-1254.
26. Alexander DH, Novembre J, Lange K: Fast model-based estimation of ancestry in unrelated individuals. Genome Res 2009, 19:1655-1664.
27. Silva-Zolezzi I, Hidalgo-Miranda A, Estrada-Gil J, Fernandez-Lopez JC, Uribe-Figueroa L, Contreras A, Balam-Ortiz E, del Bosque-Plata L, Velazquez-Fernandez D, Lara C, et al: Analysis of genomic diversity in Mexican Mestizo populations to develop genomic medicine in Mexico. Proc Natl Acad Sci U S A 2009, 106:8611-8616.
28. Zhou D, Udpa N, Ronen R, Stobdan T, Liang J, Appenzeller O, Zhao Huiwen W, Yin Y, Du Y, Guo L, et al: Whole-Genome Sequencing Uncovers the Genetic Basis of Chronic Mountain Sickness in Andean Highlanders. Am J Hum Genet 2013, 93:452-462.
29. de Saint Pierre M, Gandini F, Perego UA, Bodner M, Gómez-Carballa A, Corach D, Angerhofer N, Woodward SR, Semino O, Salas A, et al: Arrival of paleo-indians to the southern cone of south America: new clues from mitogenomes. PLoS ONE 2012, 7:e51311.
30. Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L: The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic acids research 2014, 42:D1001-D1006.
31. Weir BS, Cockerham CC: Estimating F-Statistics for the Analysis of Population Structure. Evolution 1984, 38:1358-1370.
32. Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010, 38:e164.
33. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J: A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 2014, 46:310-315.
34. Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, Santos A, Doncheva NT, Roth A, Bork P, et al: The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res 2017, 45:D362-D368.
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
26
35. Piñero J, Queralt-Rosinach N, Bravo À, Deu-Pons J, Bauer-Mehren A, Baron M, Sanz F, Furlong LI: DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015, 2015.
36. LaRosa JC, Brown CD: Cardiovascular risk factors in minorities. The American journal of medicine 2005, 118:1314-1322.
37. Aguilar-Salinas CA, Canizales-Quinteros S, Rojas-Martínez R, Mehta R, Villarreal-Molina MT, Arellano-Campos O, Riba L, Gómez-Pérez FJ, Tusié-Luna MT: Hypoalphalipoproteinemia in populations of Native American ancestry: an opportunity to assess the interaction of genes and the environment. Current Opinion in Lipidology 2009, 20:92-97.
38. Wang K, Li MY, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research 2010, 38:7.
39. Krzywinski M, Schein J, Birol İ, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA: Circos: an information aesthetic for comparative genomics. Genome Res 2009, 19:1639-1645.
40. Yokoyama J, Erdman C, Hamilton S: Array-based whole-genome survey of dog saliva DNA yields high quality SNP data. PLoS ONE 2010, 5:e10809.
41. Li G, Gelernter J, Kranzler HR, Zhao H: M3: an improved SNP calling algorithm for Illumina BeadArray data. Bioinformatics 2012, 28:358-365.
42. Lohmueller KE, Sparso T, Li Q, Andersson E, Korneliussen T, Albrechtsen A, Banasik K, Grarup N, Hallgrimsdottir I, Kiil K, et al: Whole-exome sequencing of 2,000 Danish individuals and the role of rare coding variants in type 2 diabetes. Am J Hum Genet 2013, 93:1072-1086.
43. Patterson N, Price A, Reich D: Population structure and eigenanalysis. PLoS Genet 2006, 2:e190. 44. Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L,
Suchard MA, Huelsenbeck JP: MrBayes 3.2: efficient bayesian phylogenetic inference and model choice across a large model space. Syst Biol 2012, 61:539-542.
45. Tamura K, Dudley J, Nei M, Kumar S: MEGA4: molecular evolutionary genetics analysis (MEGA) software version 4.0. Mol Biol Evol 2007, 24:1596-1599.
46. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al: The variant call format and VCFtools. Bioinformatics 2011, 27:2156-2158.
47. Wickham H: ggplot2: elegant graphics for data analysis. 2009. 48. Team RDC: R: A language and environment for statistical computing. Vienna, Austria: The R
Foundation for Statistical Computing; 2015. 49. Miquel JF, Covarrubias C, Villaroel L, Mingrone G, Greco AV, Puglielli L, Carvallo P, Marshall
G, Pino GD, Nervi F: Genetic epidemiology of cholesterol cholelithiasis among Chilean Hispanics, Amerindians, and Maoris. Gastroenterology 1998, 115:937-946.
50. Krawczyk M, Lütjohann D, Schirin-Sokhan R, Villarroel L, Nervi F, Pimentel F, Lammert F, Miquel JF: Phytosterol and cholesterol precursor levels indicate increased cholesterol excretion and biosynthesis in gallstone disease. Hepatology 2012, 55:1507-1517.
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
27
Figure Legends
Fig. 1. Genetic and structural variants in Mapuche-Huilliche genomes.
Circos plot of the spatial distribution of SNV densities (i), deletions and insertions (ii), structural
variant (SV) loses and gains (iii), copy number variant (CNV) losses and gains (iv), inversions (v)
and translocations (vi). Light or dark colors in different tracks indicate known or novel variants,
respectively. Tandem (red lines) and distal duplications (blue arrows) are shown within the inner
circle of the plot. Translocation events are shown as green arrows.
Fig. 2. Ancestry analysis of HUI and Chilean Latino individuals.
(a) ADMIXTURE plots for K = 5 (Continental model) and K = 10 (minimum error model). All
3,706 samples included are depicted as vertical thin bars colored by their corresponding ancestry
percentage. HUI genomes are highlighted at the left with thicker bars followed by Chilean Latino
genotyped individuals and samples included in 1kGP-phase 3, which are clustered in 5 super-
populations (AMR, EUR, EAS, SAS and AFR). For K = 5, the colors were defined as follows:
Red for “Amerindian”, yellow for “European”, blue for “East Asian”, green for “South Asian” and
purple for “African”. For K = 10, light colors are used to show subcomponents within super-
populations EUR, EAS, SAS and AFR. Grey color is used to represent the AMR component
common to PEL, MXL, CLM and PUR populations but almost absent in HUI. Bottom thick bars
define key colors used in the PCA. (b) Principal Component (PC) analysis including the same set
of samples (colored dots) and markers. Color legend and number of samples belonging to each
super population defined in (a) is provided in the legend inside brackets. Left Panel: PC1 vs. PC2,
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
28
right panel: PC3 vs. PC4. Percentage of variance explained by each component is given in
parenthesis in the corresponding axis.
Fig. 3. Highly divergent SNVs in HUI genomes.
(a) World map showing all 26 populations from 1kGP-phase 3 coming from the 5 super
populations (AFR, SAS, EAS, EUR and AMR) and their Weir and Cockerham’s Fst statistic
(weighted Fst) from yellow to red according to their lower or higher divergence obtained from the
comparison with the HUI sequenced individuals. (b) Violin plots comparing SNV density between
HUI and other 26 populations from 1kGP-phase 3. Fst distributions are sorted by decreasing
genetic divergence from HUI (top to bottom). Vertical bars on each population plot indicate 95th
percentile cutoff used to identify highly divergent SNVs. SuperPop = Super populations from
1kGP-phase 3: AFR = Africans, AMR = Admixed Americans, ASN = Asians, EUR= Europeans.
Fig. 4. Functional networks between divergent VPFIs detected in Mapuche-Huilliches.
Each node in the network represents a gene containing at least 1 highly divergent Fst variant. The
scale of green colors (light to dark green) represents the number of 1kGP-phase 3 populations
(less to more, respectively) where the variant is divergent. Node size (small to large) indicates the
number of alleles present in the sequenced HUI individuals (less to more, respectively).
Interactions observed between genes correspond to the information retrieved from the STRING
database. Subnetwork names in grey clouds correspond to the most significant
ontological/pathway categories obtained from the respective databases.
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
10 20 40 60 80 100
120
140
160
180
200
220
240
2
020
4060
80
100
120
140
160
180
200
220
240
3
020
4060
80100
120
140
160
180
4
020406080100120
140160
180
5
020
4060
80100
120140
160180
6
020
4060
80100120140160
7
020406080100120140
8
020406080100120041
9
020406080100
120
14010
02040608010012
0
11 02040
608010
0120
120
2040
6080
100120
13
020
4060
80100
14
020
406080
100
15
020406080
100
16
02040
6080
17
0
2040
6080
18
020
4060
19
020
40
20
020
4060
21
020
40
22
020
40
X
020
4060
80100
120140
i
ii
iii
ivvvi
i SNV densityii Deletions and Insertionsiii SV losses a gainsiv CNV losses and gainsv Inversionsvi Translocations
TranslocationsDistal duplicationsTandem duplications
Vidal_Fig. 1
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
aK=5
Anc
estry
(%)
PEL MXL CLM PURHUI EUR SASEAS AFRCHI
Anc
estry
(%)
K=10
AFR (661)AMR (347)EAS (504)EUR (503)SAS (489)HUI (11)CHI (1191)
bPCA Results PC3 vs PC4
−0.10 −0.08 −0.06 −0.04 −0.02 0.00 0.02Principal Component 4 (6,77%)
PCA Results PC1 vs PC2
Principal Component 1 (48,67%)−0.03 −0.02 −0.01 0.00 0.01 0.02
−0.0
3−0
.02−
0.01
0.00
0.01
0.02
0.03
Prin
cipa
l Com
pone
nt 2
(28,
07%
)
−0.0
4−0
.02
0.00
0.02
Prin
cipa
l Com
pone
nt 3
(11,
03%
)
Vidal_Fig. 2
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
0.0 0.2 0.4 0.6
PELMXLCLMPURBEBJPT
CHBCHSPJL
KHVSTUITU
CDXGIHFIN
GBRCEUIBSTSI
ASWACBLWKMSLESNYRI
GWD
Popu
latio
ns
Fst
SuperPop
AFR AMREASEUR SAS
AMR EASSAS EUR AFR0 0.22
Fst
HUI
HUI
a b
Vidal_Fig. 3
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
Gene Size
Number of HUI Alleles2 21
Protein ubiquitination (BP)
COL6A3
COL27A1
COL6A5
COL11A1COL22A1
COL5A3
COL23A1
LAMA5
ITGA6
ITGAV
PLAUR
NID1
ITGA1
MMP15
SHB
ACAN
COL2A1
SPHK1
TNC
FBN3
CERS4
COL24A1
ABCA1
OBSCN
NEB
HELZ2
MYH14
TTN
TRIB3CYP1A1
LRR1
HERC3
LTN1
HERC6
SPSB2
PSMD9
PSMD5
ASXL1
PSMB4
PSMB3
CEP192
CEP63
SFI1
PSMD8
UGT2A3
KL
SCNN1D
INHBC
ERBB2IP
IMMT
CLSPN
NRG1
ACVR1C
WNK2
EXO1
CDH23
MUC6
MUC19
HSD17B4
CHCHD6
PCDH15
ECH1
ASZ1
PLXNA2
NRP1
SYNE2
SYNE1
MOV10L1
CEBPZ
GRWD1
HIST3H2A
PHF21A
LIG4
NOP9
PES1
MPHOSPH10
WDR18
RRP7A
DNAH1
DNAH8
DNAH7
DNAH5
DNAH2
LMO7
IQCB1
PNLIPRP3
PLD1
LPIN1
SCAMP2
PNPLA3LIPF
GAK
MSR1
LRP2
HSPA1L
APOB
SYNJ1RBP1
NUP153
CUBN
Lipoprotein transport (BP)
Glycerolipid metabolism (KEGG)
Extracellular matrix organization (BP)
ICAM1
SELE PTPN22
TRIM45
HLA−DPB1
IRF6GBP1
IL27
TRIM31
GTF2E1
STAT2
HLA−DPA1
Cell adhesion molecules (KEGG)
Steroid hormone biosynthesis (KEGG)
Centriole (CC)
90S preribosome (CC)
Axonemal dynein complex (CC)
Proteasome complex (CC)
Chromosome organization (BP)
Number of edges = 174Expected number of edges = 104STRING PPI enrichment p-value = 2.21x10
Number of nodes = 496
-10
Gene Color
Divergent populations2 25
Vidal_Fig. 4
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
Table 1. Details of HUI individuals selected for this study and genome sequencing statistics.
Assembly ID Sex Mapuche surnames
Age at screening
Called genome fraction
Genome coverage
30X
Exome coverage
30X
Mapping yield (Gb)
GS000011194 F 4 22 0.97 0.87 0.98 180.69 GS000011195 F 4 47 0.97 0.87 0.98 181.74 GS000011196 F 4 42 0.97 0.85 0.98 180.88 GS000011198 F 4 46 0.97 0.84 0.98 180.61 GS000011200 F 4 18 0.97 0.84 0.98 177.51 GS000011201 F 4 53 0.96 0.83 0.98 175.46 GS000011215 F 4 43 0.97 0.88 0.98 180.44 GS000012210* M 4 49 0.96 0.86 0.99 266.52 GS000012242* F 4 27 0.96 0.92 0.99 357.46 GS000020403 F 4 39 0.97 0.82 0.98 151.00 GS000020711 F 3 60 0.97 0.84 0.98 177.72
Called genome fraction: Fraction of the reference genome with full (diploid) calls in the sequenced sample following assembly; Genome coverage 30X: Fraction of the reference genome bases where coverage is greater than or equal to 30X; Exome coverage 30X: Fraction of the reference exome bases where coverage is greater than or equal to 30X; Mapping yield: Total base-pairs of sequence reads mapped to the reference genome. Samples marked with an asterisk were sequenced twice, thus they have greater mapping yields and average coverage than the other samples.
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint
Table 2. Total variants identified in HUI individuals listed by chromosome.
Chr SNVs Novel SNVs
MNVs Novel MNVs
Insertions Novel insertions
Deletions Novel deletions
1 445,194 30,238 18,046 9,526 34,697 7,135 35,957 4,355 2 462,032 32,788 19,053 10,377 34,634 6,997 35,359 4,264 3 393,966 28,174 17,287 9,587 28,805 5,536 29,884 3,503 4 422,384 28,848 17,561 9,207 29,285 5,205 31,284 3,257 5 354,460 27,064 14,966 8,410 26,126 4,974 26,884 3,207 6 370,596 24,192 17,430 8,580 27,368 5,217 28,325 3,336 7 338,891 22,801 15,530 8,280 25,440 5,008 25,845 3,046 8 301,338 21,588 12,258 6,666 20,756 4,236 21,420 2,621 9 237,998 16,520 9,719 5,150 17,436 3,517 17,242 2,181
10 289,851 17,280 11,519 5,605 21,799 4,122 22,931 2,651 11 279,792 19,926 13,604 7,655 20,608 4,065 21,206 2,543 12 269,146 19,103 12,360 6,706 20,936 4,072 21,744 2,497 13 212,821 14,028 8,489 4,377 15,859 2,796 16,282 1,590 14 179,874 12,881 7,538 3,993 13,876 2,690 14,100 1,655 15 173,696 10,567 6,897 3,286 13,434 2,636 13,764 1,609 16 180,184 10,960 7,890 3,701 13,037 2,762 12,828 1,577 17 152,629 9,571 6,832 3,245 13,808 3,141 13,738 1,779 18 168,802 10,195 6,324 3,144 12,140 2,138 12,912 1,426 19 130,506 7,699 7,184 3,221 12,346 2,673 12,245 1,766 20 125,336 7,736 5,342 2,589 10,128 2,106 9,852 1,263 21 83,605 5,078 4,147 1,985 6,664 1,166 6,579 697 22 75,040 4,086 3,193 1,419 6,420 1,381 6,350 772 X 196,744 21,790 11,258 7,446 16,699 4,561 16,546 3,090
Y* 2,066 242 183 62 160 39 109 31 Mt 83 28 5 3 1 0 3 2
Total 5,847,034 403,383 254,615 134,220 442,462 88,173 453,389 54,718
*indicates information for chromosome Y for one male individual (GS000012210).
.CC-BY-NC-ND 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/252619doi: bioRxiv preprint