Vignette: Calling IBD with ancIBD

This notebook walks you through the steps to call IBD segments with ancIBD. It assumes one has data in hdf5 format, including genetic map and ideally also allele frequency data. For how to produce such a .hdf5 file from an imputed VCF, please see the vignette notebook create_hdf5_from_vcf.ipynb.

[1]:

import sys as sys
import matplotlib.cm as cm
import pandas as pd
import os as os

### Set working directory to your ancIBD vignette folder
path = "/n/groups/reich/hringbauer/git/hapBLOCK/notebook/vignette/"

os.chdir(path)  # Set the right Path (in line with Atom default)
print(os.getcwd())

### The following Code sets the working directory to your ancIBD code
### Only uncomment if you want to use not the pip installed package
#sys.path.insert(0,"/n/groups/reich/hringbauer/git/hapBLOCK/package/")  # hack to get development package first in path

/n/groups/reich/hringbauer/git/hapBLOCK/notebook/vignette

Code to run IBD calling

Note: For a quick test run for a pairs of indivdiuals and a specific chromosome including a visualization of the posterior, see the vignette notebook ./plot_IBD.ipynb. This visual test is always recommended to verify whether your data is sound and everything works as expected.

Here, the walk-through is through the calling of IBD in a full data set including multiple individuals and chromosomes.

[2]:

from ancIBD.run import hapBLOCK_chroms

Specify list of Indivdiuals to screen for IBD

This list is the list of indivdiuals that will be loaded into memory by ancIBD.

Please Remember: ancIBD is data-hungry, and only works for ancient genomes with >600,000 1240k SNPs covered (or WGS data with >200,000 1240k SNPs covered). If you run it for indivdiuals with less coverage, you will still receive output, but that output is not trustworthy, with little power to call IBD and high false positive rates.

[3]:

iids = ["I12439", "I12440", "I12438",
        "I12896", "I21390", "I30300"] # The six Hazelton Indivdiuals of the example data.

1) Run ancIBD on your sample to output all IBD segments

The function hapBLOCK_chroms calls IBD of an input hdf5. It outputs a dataframe of IBD, which below the function is saved into a specified output folder.

Important parameters varying from application to application are: - folder_in: The path of the hdf5 files used for IBD calling. The format is so that everything up to the chrosome number is specified. - iids: All iids to load from the hdf. Has to match the IIDs in the sample field of the HDF5 - run_iids: Which pairs to run [[iid1, iid2], …] If this parameter is left empty, then all pairs are run. - p_col: This specifies which field to use for reference allele frequencies (that should be encoded into your HDF5 file). You can set ‘variants/RAF’ - which are the 1000G reference allele frequencies, or ‘variants/AF_ALL’, which encodes the allele frequencies calculated from the HDF5 file (using all high confidence genotypes). If your dataset is small, the 1000G allele frequencies are a better choice as there will be large sampling error in the sample allele frequencies - and for most Eurasian aDNA 1000G allele frequencies are proven to work well. If your data set in the h5 is very big (100s or even 1000s of samples), it might be better to use the sample specific allele frequencies.

The other parameters specifiying the various modes of ancIBD, and the parameters are default values. E.g. ibd_in, ibd_out, ibd_jump control the jump rates of the underlying HMM (in rates per Morgan), and the various _model parameter specificy various modes to load the data (l_model), to specify the HMM (e_model: Emission, t_model: Transition) and run the ancIBD HMM (h_model).

The default values and modes are the recommend parameters for typical human aDNA data, but power users can modify those advanced settings.

[4]:

%%time

for ch in range(1,23):
    df_ibd = hapBLOCK_chroms(folder_in='./data/hdf5/example_hazelton_chr',
                             iids=iids, run_iids=[],
                             ch=ch, folder_out='./output/ibd_hazelton/',
                             output=False, prefix_out='', logfile=False,
                             l_model='h5', e_model='haploid_gl2', h_model='FiveStateScaled', t_model='standard',
                             p_col='variants/RAF',
                             ibd_in=1, ibd_out=10, ibd_jump=400,
                             min_cm=6, cutoff_post=0.99, max_gap=0.0075,
                             processes=1)

CPU times: user 7.8 s, sys: 314 ms, total: 8.12 s
Wall time: 8.81 s

Congrats, now you have all the IBD segments called! Notice the speed of the IBD caller: All pairs of six indivduals (15 in total) and all chromosomes only took few seconds to run.

2) Combine IBD calls from all chromosomes

Now we combine the IBD calls from each chromosome into one overall file. The reason why this is done as seperate function is to allow for parallelization of the above function (i.e. as an array submission on a scientific cluster).

[5]:

from ancIBD.IO.ind_ibd import combine_all_chroms

[6]:

combine_all_chroms(chs=range(1,23),
                   folder_base='./output/ibd_hazelton/ch',
                   path_save='./output/ibd_hazelton/ch_all.tsv')

Chromosome 1; Loaded 22 IBD
Chromosome 2; Loaded 15 IBD
Chromosome 3; Loaded 15 IBD
Chromosome 4; Loaded 10 IBD
Chromosome 5; Loaded 10 IBD
Chromosome 6; Loaded 13 IBD
Chromosome 7; Loaded 14 IBD
Chromosome 8; Loaded 7 IBD
Chromosome 9; Loaded 20 IBD
Chromosome 10; Loaded 9 IBD
Chromosome 11; Loaded 12 IBD
Chromosome 12; Loaded 11 IBD
Chromosome 13; Loaded 12 IBD
Chromosome 14; Loaded 10 IBD
Chromosome 15; Loaded 11 IBD
Chromosome 16; Loaded 10 IBD
Chromosome 17; Loaded 12 IBD
Chromosome 18; Loaded 11 IBD
Chromosome 19; Loaded 19 IBD
Chromosome 20; Loaded 15 IBD
Chromosome 21; Loaded 11 IBD
Chromosome 22; Loaded 10 IBD
Saved 279 IBD to ./output/ibd_hazelton/ch_all.tsv.

3) Postprocess output into single summary table

For easy screening for IBD between pairs, the function create_ind_ibd_df produces a summary table. Each row is one pair of individuals, and there are columns for summary statistics: - max_ibd: Maximum Length of IBD - sum_IBD>x: The total length of all IBD segments longer than x Morgan - n_IBD>x: The total number of all IBD segments longer than x Morgan

By default, these are recorded for >8,>12,>16 and >20 Morgan. This can be changed with the min_cms keyword.

The function also does post-processing of trustworthy IBD blocks. Most importantly, only IBD with at least a certain SNP density are kept. The reason for this is that areas of low SNP density (such as regions with large gaps of SNPs) are very prone to false positives.

Note: Only pairs with at least one IBD are recorded (in the above >6 cM). So if a pair of indivdiuals is missing means that this pair of indivdiuals does not have any shared IBD segments. The reason for this omission is that in large samples, most indivdiual pairs will have 0 IBD - thus there would be a large memory requirement but for little informational gain.

[16]:

from ancIBD.IO.ind_ibd import create_ind_ibd_df

[17]:

%%time
df_res = create_ind_ibd_df(ibd_data = './output/ibd_hazelton/ch_all.tsv',
                      min_cms = [8, 12, 16, 20], snp_cm = 220, min_cm = 5, sort_col = 0,
                      savepath = "./output/ibd_hazelton/ibd_ind.d220.tsv")

> 5 cM: 279/279
Of these with suff. SNPs per cM> 220:               177/279
2     13
11    12
6     12
13    11
3     11
1     11
9     10
16     9
21     9
10     9
20     9
5      8
4      8
12     7
14     6
8      6
7      6
18     6
22     6
15     4
17     4
Name: ch, dtype: int64
Saved 15 individual IBD pairs to: ./output/ibd_hazelton/ibd_ind.d220.tsv
CPU times: user 86.7 ms, sys: 3.91 ms, total: 90.6 ms
Wall time: 130 ms

Congrats, that closes the post-processing of IBD. Let’s have a look into the output data:

[18]:

df_ibd = pd.read_csv('./output/ibd_hazelton/ibd_ind.d220.tsv', sep="\t")
df_ibd

[18]:

	iid1	iid2	max_IBD	sum_IBD>8	n_IBD>8	sum_IBD>12	n_IBD>12	sum_IBD>16	n_IBD>16	sum_IBD>20	n_IBD>20
0	I12438	I30300	283.700303	3336.351792	21.0	3336.351792	21.0	3336.351792	21.0	3336.351792	21.0
1	I12440	I30300	268.787893	3318.813892	21.0	3318.813892	21.0	3318.813892	21.0	3318.813892	21.0
2	I12440	I12438	176.751903	1660.115506	21.0	1650.284006	20.0	1650.284006	20.0	1650.284006	20.0
3	I12439	I12440	153.428499	1628.101396	21.0	1605.868303	19.0	1605.868303	19.0	1605.868303	19.0
4	I12439	I30300	98.273498	920.435097	13.0	920.435097	13.0	920.435097	13.0	920.435097	13.0
5	I12439	I12438	93.381101	476.404998	10.0	460.227589	8.0	460.227589	8.0	460.227589	8.0
6	I12896	I30300	12.916911	41.172510	4.0	12.916911	1.0	0.000000	0.0	0.000000	0.0
7	I12896	I21390	11.861992	40.497494	4.0	0.000000	0.0	0.000000	0.0	0.000000	0.0
8	I12438	I12896	13.538799	22.422002	2.0	13.538799	1.0	0.000000	0.0	0.000000	0.0
9	I12439	I12896	10.088903	18.503708	2.0	0.000000	0.0	0.000000	0.0	0.000000	0.0
10	I21390	I30300	14.905101	14.905101	1.0	14.905101	1.0	0.000000	0.0	0.000000	0.0
11	I12439	I21390	11.116302	11.116302	1.0	0.000000	0.0	0.000000	0.0	0.000000	0.0
12	I12440	I12896	10.285699	10.285699	1.0	0.000000	0.0	0.000000	0.0	0.000000	0.0
13	I12438	I21390	0.000000	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.0
14	I12440	I21390	0.000000	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.0

Oh, there are a couple of individual pairs with ample IBD. Some of them in all chromosomes!

Congrats - you have now learned how to call IBD segments and to post-process the output.

The next step is visualizing the output. The vignette notebook ./plot_IBD.ipynb will walk you through that step.