Vignette: Calling IBD with ancIBD

This notebook walks you through the steps to call IBD segments with ancIBD. It assumes one has data in hdf5 format, including genetic map and ideally also allele frequency data. For how to produce such a .hdf5 file from an imputed VCF, please see the vignette notebook create_hdf5_from_vcf.ipynb.

[1]:
import sys as sys
import matplotlib.cm as cm
import pandas as pd
import os as os

### Set working directory to your ancIBD vignette folder
path = "/n/groups/reich/hringbauer/git/hapBLOCK/notebook/vignette/"

os.chdir(path)  # Set the right Path (in line with Atom default)
print(os.getcwd())

### The following Code sets the working directory to your ancIBD code
### Only uncomment if you want to use not the pip installed package
#sys.path.insert(0,"/n/groups/reich/hringbauer/git/hapBLOCK/package/")  # hack to get development package first in path
/n/groups/reich/hringbauer/git/hapBLOCK/notebook/vignette

Code to run IBD calling

Note: For a quick test run for a pairs of indivdiuals and a specific chromosome including a visualization of the posterior, see the vignette notebook ./plot_IBD.ipynb. This visual test is always recommended to verify whether your data is sound and everything works as expected.

Here, the walk-through is through the calling of IBD in a full data set including multiple individuals and chromosomes.

[2]:
from ancIBD.run import hapBLOCK_chroms

Specify list of Indivdiuals to screen for IBD

This list is the list of indivdiuals that will be loaded into memory by ancIBD.

Please Remember: ancIBD is data-hungry, and only works for ancient genomes with >600,000 1240k SNPs covered (or WGS data with >200,000 1240k SNPs covered). If you run it for indivdiuals with less coverage, you will still receive output, but that output is not trustworthy, with little power to call IBD and high false positive rates.

[3]:
iids = ["I12439", "I12440", "I12438",
        "I12896", "I21390", "I30300"] # The six Hazelton Indivdiuals of the example data.

1) Run ancIBD on your sample to output all IBD segments

The function hapBLOCK_chroms calls IBD of an input hdf5. It outputs a dataframe of IBD, which below the function is saved into a specified output folder.

Important parameters varying from application to application are: - folder_in: The path of the hdf5 files used for IBD calling. The format is so that everything up to the chrosome number is specified. - iids: All iids to load from the hdf. Has to match the IIDs in the sample field of the HDF5 - run_iids: Which pairs to run [[iid1, iid2], …] If this parameter is left empty, then all pairs are run. - p_col: This specifies which field to use for reference allele frequencies (that should be encoded into your HDF5 file). You can set ‘variants/RAF’ - which are the 1000G reference allele frequencies, or ‘variants/AF_ALL’, which encodes the allele frequencies calculated from the HDF5 file (using all high confidence genotypes). If your dataset is small, the 1000G allele frequencies are a better choice as there will be large sampling error in the sample allele frequencies - and for most Eurasian aDNA 1000G allele frequencies are proven to work well. If your data set in the h5 is very big (100s or even 1000s of samples), it might be better to use the sample specific allele frequencies.

The other parameters specifiying the various modes of ancIBD, and the parameters are default values. E.g. ibd_in, ibd_out, ibd_jump control the jump rates of the underlying HMM (in rates per Morgan), and the various _model parameter specificy various modes to load the data (l_model), to specify the HMM (e_model: Emission, t_model: Transition) and run the ancIBD HMM (h_model).

The default values and modes are the recommend parameters for typical human aDNA data, but power users can modify those advanced settings.

[4]:
%%time

for ch in range(1,23):
    df_ibd = hapBLOCK_chroms(folder_in='./data/hdf5/example_hazelton_chr',
                             iids=iids, run_iids=[],
                             ch=ch, folder_out='./output/ibd_hazelton/',
                             output=False, prefix_out='', logfile=False,
                             l_model='h5', e_model='haploid_gl2', h_model='FiveStateScaled', t_model='standard',
                             p_col='variants/RAF',
                             ibd_in=1, ibd_out=10, ibd_jump=400,
                             min_cm=6, cutoff_post=0.99, max_gap=0.0075,
                             processes=1)
CPU times: user 7.8 s, sys: 314 ms, total: 8.12 s
Wall time: 8.81 s

Congrats, now you have all the IBD segments called! Notice the speed of the IBD caller: All pairs of six indivduals (15 in total) and all chromosomes only took few seconds to run.

2) Combine IBD calls from all chromosomes

Now we combine the IBD calls from each chromosome into one overall file. The reason why this is done as seperate function is to allow for parallelization of the above function (i.e. as an array submission on a scientific cluster).

[5]:
from ancIBD.IO.ind_ibd import combine_all_chroms
[6]:
combine_all_chroms(chs=range(1,23),
                   folder_base='./output/ibd_hazelton/ch',
                   path_save='./output/ibd_hazelton/ch_all.tsv')
Chromosome 1; Loaded 22 IBD
Chromosome 2; Loaded 15 IBD
Chromosome 3; Loaded 15 IBD
Chromosome 4; Loaded 10 IBD
Chromosome 5; Loaded 10 IBD
Chromosome 6; Loaded 13 IBD
Chromosome 7; Loaded 14 IBD
Chromosome 8; Loaded 7 IBD
Chromosome 9; Loaded 20 IBD
Chromosome 10; Loaded 9 IBD
Chromosome 11; Loaded 12 IBD
Chromosome 12; Loaded 11 IBD
Chromosome 13; Loaded 12 IBD
Chromosome 14; Loaded 10 IBD
Chromosome 15; Loaded 11 IBD
Chromosome 16; Loaded 10 IBD
Chromosome 17; Loaded 12 IBD
Chromosome 18; Loaded 11 IBD
Chromosome 19; Loaded 19 IBD
Chromosome 20; Loaded 15 IBD
Chromosome 21; Loaded 11 IBD
Chromosome 22; Loaded 10 IBD
Saved 279 IBD to ./output/ibd_hazelton/ch_all.tsv.

3) Postprocess output into single summary table

For easy screening for IBD between pairs, the function create_ind_ibd_df produces a summary table. Each row is one pair of individuals, and there are columns for summary statistics: - max_ibd: Maximum Length of IBD - sum_IBD>x: The total length of all IBD segments longer than x Morgan - n_IBD>x: The total number of all IBD segments longer than x Morgan

By default, these are recorded for >8,>12,>16 and >20 Morgan. This can be changed with the min_cms keyword.

The function also does post-processing of trustworthy IBD blocks. Most importantly, only IBD with at least a certain SNP density are kept. The reason for this is that areas of low SNP density (such as regions with large gaps of SNPs) are very prone to false positives.

Note: Only pairs with at least one IBD are recorded (in the above >6 cM). So if a pair of indivdiuals is missing means that this pair of indivdiuals does not have any shared IBD segments. The reason for this omission is that in large samples, most indivdiual pairs will have 0 IBD - thus there would be a large memory requirement but for little informational gain.

[16]:
from ancIBD.IO.ind_ibd import create_ind_ibd_df
[17]:
%%time
df_res = create_ind_ibd_df(ibd_data = './output/ibd_hazelton/ch_all.tsv',
                      min_cms = [8, 12, 16, 20], snp_cm = 220, min_cm = 5, sort_col = 0,
                      savepath = "./output/ibd_hazelton/ibd_ind.d220.tsv")
> 5 cM: 279/279
Of these with suff. SNPs per cM> 220:               177/279
2     13
11    12
6     12
13    11
3     11
1     11
9     10
16     9
21     9
10     9
20     9
5      8
4      8
12     7
14     6
8      6
7      6
18     6
22     6
15     4
17     4
Name: ch, dtype: int64
Saved 15 individual IBD pairs to: ./output/ibd_hazelton/ibd_ind.d220.tsv
CPU times: user 86.7 ms, sys: 3.91 ms, total: 90.6 ms
Wall time: 130 ms

Congrats, that closes the post-processing of IBD. Let’s have a look into the output data:

[18]:
df_ibd = pd.read_csv('./output/ibd_hazelton/ibd_ind.d220.tsv', sep="\t")
df_ibd
[18]:
iid1 iid2 max_IBD sum_IBD>8 n_IBD>8 sum_IBD>12 n_IBD>12 sum_IBD>16 n_IBD>16 sum_IBD>20 n_IBD>20
0 I12438 I30300 283.700303 3336.351792 21.0 3336.351792 21.0 3336.351792 21.0 3336.351792 21.0
1 I12440 I30300 268.787893 3318.813892 21.0 3318.813892 21.0 3318.813892 21.0 3318.813892 21.0
2 I12440 I12438 176.751903 1660.115506 21.0 1650.284006 20.0 1650.284006 20.0 1650.284006 20.0
3 I12439 I12440 153.428499 1628.101396 21.0 1605.868303 19.0 1605.868303 19.0 1605.868303 19.0
4 I12439 I30300 98.273498 920.435097 13.0 920.435097 13.0 920.435097 13.0 920.435097 13.0
5 I12439 I12438 93.381101 476.404998 10.0 460.227589 8.0 460.227589 8.0 460.227589 8.0
6 I12896 I30300 12.916911 41.172510 4.0 12.916911 1.0 0.000000 0.0 0.000000 0.0
7 I12896 I21390 11.861992 40.497494 4.0 0.000000 0.0 0.000000 0.0 0.000000 0.0
8 I12438 I12896 13.538799 22.422002 2.0 13.538799 1.0 0.000000 0.0 0.000000 0.0
9 I12439 I12896 10.088903 18.503708 2.0 0.000000 0.0 0.000000 0.0 0.000000 0.0
10 I21390 I30300 14.905101 14.905101 1.0 14.905101 1.0 0.000000 0.0 0.000000 0.0
11 I12439 I21390 11.116302 11.116302 1.0 0.000000 0.0 0.000000 0.0 0.000000 0.0
12 I12440 I12896 10.285699 10.285699 1.0 0.000000 0.0 0.000000 0.0 0.000000 0.0
13 I12438 I21390 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0
14 I12440 I21390 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0

Oh, there are a couple of individual pairs with ample IBD. Some of them in all chromosomes!

Congrats - you have now learned how to call IBD segments and to post-process the output.

The next step is visualizing the output. The vignette notebook ./plot_IBD.ipynb will walk you through that step.