- Open Access
Joint analysis of sequence data and single-nucleotide polymorphism data using pedigree information for imputation and recombination inference
© Song et al.; licensee BioMed Central Ltd. 2014
Published: 17 June 2014
We developed a general framework for family-based imputation using single-nucleotide polymorphism data and sequence data distributed by Genetic Analysis Workshop 18. By using PedIBD, we first inferred haplotypes and inheritance patterns of each family from SNP data. Then new variants in unsequenced family members can be obtained from sequenced relatives through their shared haplotypes. We then compared the results of our method against the imputation results provided by Genetic Analysis Workshop organizers. The results showed that our strategy uncovered more variants for more unsequenced relatives. We also showed that recombination breakpoints inferred by PedIBD have much higher resolution than those inferred from previous studies.
Next-generation sequencing (NGS) technologies have profoundly changed the landscape of genetic studies . Although the cost of sequencing is becoming more affordable, increasingly more studies are choosing NGS as the primary platform to collect data, either at the whole genome level or for targeted regions. However, costs of sequencing thousands of individuals and the downstream analysis are still prohibitively high. On the other hand, many projects have already accumulated single-nucleotide polymorphism (SNP) data from previous studies. In such cases, researchers only need to sequence a small subset of family members (e.g., proband and parents) to reduce the costs. By jointly analyzing sequence data from a subset of family members together with SNP data from the families, computational approaches may fully recover variant information in unsequenced members. The data distributed by Genetic Analysis Workshop 18 (GAW18) provide an excellent example based on this design strategy. Many of the pedigrees are very large, and all of them have a significant number of members without SNP genotypes, which makes the imputation computationally very challenging. Our laboratory has recently developed an efficient haplotype inference algorithm called PedIBD, which is designed specifically for large pedigrees with many untyped individuals . By taking advantage of haplotypes inferred by PedIBD using SNP data, we developed a procedure to computationally impute variants for unsequenced individuals based on haplotype sharing between them and their sequenced relatives. The advantage of our approach over the imputation provided by GAW lies in the fact that whereas our approach can take each pedigree as a whole when inferring haplotype or inheritance, GAW had to partition big pedigrees into smaller families. Our approach thus will provide more complete and more accurate results. In addition, based on the provided SNP data, we can also provide inferred recombination breakpoints with high resolution within each pedigree.
We focused our analysis on chromosome 3 of the GAW18 dataset. The dataset consists of 1389 individuals from 20 families. Among them, 959 individuals were genotyped using SNP chips. In addition, a subset of 464 genotyped individuals were also sequenced. The total number of SNPs from the chip data is 65,519. Because only rs numbers of these SNPs were provided, we obtained their map positions from the NCBI dbSNP database (Build 37). Nineteen SNPs were removed because they either had no matched rs numbers or the SNPs with the same rs numbers were mapped to a different chromosome. Sequence data was converted to A/C/G/T format using VCFtools . The total number of SNPs called from sequence data is approximately 1.75 million per individual. After removing SNPs with a high missing rate (>5%), the total number of sequence variants that used in our analysis is approximately 1.69 million.
However, because most new variants from sequence data are rare, the probability of having no homozygous genotypes is extremely low.
For each offspring in a family, a switch on its haplotype assignment indicates a recombination event. We collected all recombination events on chromosome 3 and examined the resolution of recombination breakpoints.
Inconsistency between single-nucleotide polymorphism chip data from genome-wide association studies and sequence data
Missing rate and Inconsistency between whole genome sequencing and genome-wide association studies
Total number of individuals
Genotype inconsistency between GWAS and WGS
GWAS and WGS
Cause of genotype inconsistency
Missing in GWAS
Missing in WGS
Comparison of imputation results between Genetic Analysis Workshop and our approach
Pedigree information of masked individual and imputation accuracy.
Masked individual ID
Half-sibling, grandparent, grandchildren, aunt and uncle, niece and nephew
5(G) + 1(GS)
10(G) + 2(GS)
3(U) + 1(G) + 1(GS)
1(U) + 3(G) + 2(GS)
3(G) + 1(GS)
The haplotypes and recombination breakpoints have been obtained from all families based only on GWAS data. Overall, there are a total of 3089 recombination events identified. Among them, a fraction still cannot be determined from their parental sources because of missing genotypes in parents. After filtering out recombination events with unknown parental origins, our final dataset had 1361 maternal and 933 paternal recombination events. Because of homozygous genotypes, recombination breakpoints cannot always be within two adjacent SNPs. Still, the resolution of our inferred recombination breakpoints is very high, with more than 94% of them within 20-kb range, and the median length is about 8 kb, which is a great improvement over previous results [7, 8].
In this study, we have proposed a computational framework to infer haplotypes and recombination breakpoints and finally impute genotypes based on a subset of sequenced members in a pedigree. Results on GAW18 data have demonstrated that (a) our approach is efficient for extremely large pedigrees and (b) we imputed more variants and more individuals than the one provided by GAW organizers.
Our approach can be further improved in several directions. First, data quality, including missing and genotyping errors, can have a substantial effect on the final results. Many genotyping errors are actually Mendelian consistent, which makes error detection a challenging task. With the development of sequencing technologies as well as SNP calling algorithms, we expect the quality of genotyping calling from sequence data will improve, which in turn will improve our imputation results (e.g., reduce the number of conflicting loci). Second, given the high density of SNPs, population-level linkage disequilibrium can be used in imputation even for family data. Investigating approaches that can jointly consider information within families and information at the population level will be our future work. Third, our haplotype segments are defined based on all observed recombination events in a family. Therefore, the haplotype segments of a particular individual may have been cut short unnecessarily from recombination breakpoints of other individuals, resulting in some variants between haplotype segments being dropped. We will define haplotype segmentations of each individual based on her or his own recombination breakpoints, which will reduce the number of dropped variants. Last, our results show that the strategy of sequencing only a small subset of family members and imputing others is very effective. However, the final imputation results may depend on many factors, such as number and type of relationships of sequenced relatives, as well as the quality (e.g., missing rate) of sequence data. A truly important decision is how researchers select individuals to sequence to optimize the amount of information acquired within the constraints of a budget.
This research was supported by National Institutes of Health (NIH)/National Library of Medicine grant LM008991. JL was supported in part by NIH/DC012380 and NSF/IIS 1162374. SS and SR were supported in part by Choose Ohio First Scholarship. The GAW18 whole genome sequence data were provided by the Type 2 Diabetes Genetic Exploration by Next-Generation Sequencing in Ethnic Samples (T2D-GENES) Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The GAW is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- Mardis ER: The impact of next-generation sequencing technology on genetics. Trends Genet. 2008, 24: 133-141. 10.1016/j.tig.2007.12.007.View ArticlePubMedGoogle Scholar
- Li X, Li J: Haplotype reconstruction in large pedigrees with untyped individuals through IBD inference. J Comput Biol. 2011, 18: 1411-1421. 10.1089/cmb.2011.0167.PubMed CentralView ArticlePubMedGoogle Scholar
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al: The variant call format and VCFtools. Bioinformatics. 2011, 27: 2156-2158. 10.1093/bioinformatics/btr330.PubMed CentralView ArticlePubMedGoogle Scholar
- Sobel E, Papp JC, Lange K: Detection and integration of genotyping errors in statistical genetics. Am J Hum Genet. 2002, 70: 496-508. 10.1086/338920.PubMed CentralView ArticlePubMedGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.View ArticlePubMedGoogle Scholar
- Browning SR, Browning BL: Haplotype phasing: existing methods and new developments. Nat Rev Genet. 2011, 12: 703-714. 10.1038/nrg3054.PubMed CentralView ArticlePubMedGoogle Scholar
- Coop G, Wen X, Ober C, Pritchard Jk, Przeworski M: High-resolution mapping of crossovers reveals extensive variation in fine-scale recombination patterns among humans. Science. 2008, 319: 1395-1398. 10.1126/science.1151851.View ArticlePubMedGoogle Scholar
- Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir S, Barnard J, Hallbeck B, Masson G, et al: A high-resolution recombination map of the human genome. Nat Genet. 2002, 31: 241-247.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.