Preparing Input for ancIBD: VCF->HDF5

This Vignette Notebook runs you through the steps for producing the input that ancIBD needs, a so called hdf5 file.

Starting Point: A imputed and phased VCF

The IBD software starts from an externally imputed and phased VCF. ancIBD is optimized to work with data that is output from GLIMPSE1 or GLIMPSE2 and was imputed using the publicly available 1000G reference haplotype panel. Note that GLIMPSE output phased and imputed bcfs separately (in folder GLIMPSE/phased and GLIMIPSE/ligated, respectively). For ancIBD, however, the two bcfs need to be combined (and converted to vcf). You can use the following bash command to merge the phased and imputed bcfs.

bcftools annotate -a ./GLIMPSE_ligated/YOUR_BCF_FILE -c FORMAT/GP ./GLIMPSE_phased/YOUR_BCF_FILE -Oz -o OUTPUT_COMBINED_VCF_FILE

Transforming VCF to HDF5

The notebook here showcases how to produce a HDF5 file from the Glimpse output VCF.

Requirement of input VCF:

Importantly, the input VCF should have two fields:

  • GT: Diploid Genotype (the most likely imputed diploid genotype)

  • GP: Genotype probabilities

These two fields are in the standard output of GLIMPSE. They get also transformed to the HDF5 file, and these data is key for a successful run of ancIBD.

[ ]:
### First do Imports
import sys as sys
import os
import matplotlib.cm as cm
import pandas as pd

###
# Edit the following path to your vignette folder:
path = "/n/groups/reich/hringbauer/git/hapBLOCK/notebook/vignette/"
os.chdir(path)  # Set the right Path (in line with Atom default)
print(os.getcwd())
###

Code to Transform HDF5 to VCF

The funcion ancIBD.IO.prepare_h5.vcf_to_1240K_hdf runs the transformation from VCF and outputs a hdf5 for 1240k SNPs.

Input and Output:

Generally, data is organized per chromosomes with one file for each chromosome.

The example of vcf_to_1240k_hdf below transforms the full example vcf data in ./data/vcf.raw/ into hdf5 files suitable for ancIBD into ./data/hdf5/

Parameters

One needs to set paths for the intermediate files:

  • path_vcf: Path of an intermediate VCF file - which is internally filtered to 1240k data

  • path_h5: Path of the output HDF5 files

  • marker_path: Path of the 1240k SNPs to use (a simple table provided with the example data)

  • map_path: Path of the map file to use (eigenstrat .snp provided with the example data, it has the map data included)

  • af_path (optional): Path of allele frequencies to merge into hdf5 file

The data for the 1240k SNPs (marker_path), for the linkage map (map_path) and allele frequencies (af_path) are provided with the vignette, and only needs to be changed for custom SNP sets.

Below you find the function call for a single chromosome. You can write a loop (see below), or also use an array job to run this function on a cluster. The latter will save a lot of runtime on big input data as it runs in parallel.

[3]:
%%time

from ancIBD.IO.prepare_h5 import vcf_to_1240K_hdf
ch = 22
#base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
vcf_to_1240K_hdf(in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
                 path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf",
                 path_h5 = f"./data/hdf5/example_hazelton_chr{ch}.h5",
                 marker_path = f"./data/filters/snps_bcftools_ch{ch}.csv",
                 map_path = f"./data/v51.1_1240k.snp",
                 af_path = f"./data/afs/v51.1_1240k_AF_ch{ch}.tsv",
                 col_sample_af = "",
                 buffer_size=20000, chunk_width=8, chunk_length=20000,
                 ch=ch)
Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr22.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 15483 variants.
Loaded 6 individuals.
Loaded 16420 Chr.22 1240K SNPs.
Intersection 15408 out of 15483 HDF5 SNPs
Interpolating 75 variants.
Finished Chromosome 22.
Adding map to HDF5...
Intersection 15408 out of 15483 target HDF5 SNPs. 75 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr22.h5

CPU times: user 6.48 s, sys: 415 ms, total: 6.89 s
Wall time: 8.59 s

Loop Glimpse vcf -> hdf5 over all chromosomes

This is the same as the run for a single chromsome above, but now looped over multiple chromsomes.

Note

The runtime can go into the minutes or even hours for larger datasets. Parallelization of this transformation (e.g. via a parralel array job on a cluster) can speed up that time if needed.

[4]:
%%time

chs = range(1,23)

for ch in chs:
    #base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
    vcf_to_1240K_hdf(in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
                     path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",
                     path_h5 = f"./data/hdf5/example_hazelton_chr{ch}.h5",
                     marker_path = f"./data/filters/snps_bcftools_ch{ch}.csv",
                     map_path = f"./data/v51.1_1240k.snp",
                     af_path = f"./data/afs/v51.1_1240k_AF_ch{ch}.tsv",
                     col_sample_af = "",
                     buffer_size=20000, chunk_width=8, chunk_length=20000,
                     ch=ch)
Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr1.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 88408 variants.
Loaded 6 individuals.
Loaded 93166 Chr.1 1240K SNPs.
Intersection 88115 out of 88408 HDF5 SNPs
Interpolating 293 variants.
Finished Chromosome 1.
Adding map to HDF5...
Intersection 88115 out of 88408 target HDF5 SNPs. 293 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr1.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr2.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 93875 variants.
Loaded 6 individuals.
Loaded 98657 Chr.2 1240K SNPs.
Intersection 93471 out of 93875 HDF5 SNPs
Interpolating 404 variants.
Finished Chromosome 2.
Adding map to HDF5...
Intersection 93471 out of 93875 target HDF5 SNPs. 404 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr2.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr3.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 77345 variants.
Loaded 6 individuals.
Loaded 81416 Chr.3 1240K SNPs.
Intersection 77013 out of 77345 HDF5 SNPs
Interpolating 332 variants.
Finished Chromosome 3.
Adding map to HDF5...
Intersection 77013 out of 77345 target HDF5 SNPs. 332 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr3.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr4.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 68518 variants.
Loaded 6 individuals.
Loaded 71634 Chr.4 1240K SNPs.
Intersection 68254 out of 68518 HDF5 SNPs
Interpolating 264 variants.
Finished Chromosome 4.
Adding map to HDF5...
Intersection 68254 out of 68518 target HDF5 SNPs. 264 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr4.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr5.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 69063 variants.
Loaded 6 individuals.
Loaded 74004 Chr.5 1240K SNPs.
Intersection 68899 out of 69063 HDF5 SNPs
Interpolating 164 variants.
Finished Chromosome 5.
Adding map to HDF5...
Intersection 68899 out of 69063 target HDF5 SNPs. 164 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr5.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr6.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 75347 variants.
Loaded 6 individuals.
Loaded 78867 Chr.6 1240K SNPs.
Intersection 75059 out of 75347 HDF5 SNPs
Interpolating 288 variants.
Finished Chromosome 6.
Adding map to HDF5...
Intersection 75059 out of 75347 target HDF5 SNPs. 288 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr6.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr7.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 59603 variants.
Loaded 6 individuals.
Loaded 62595 Chr.7 1240K SNPs.
Intersection 59324 out of 59603 HDF5 SNPs
Interpolating 279 variants.
Finished Chromosome 7.
Adding map to HDF5...
Intersection 59324 out of 59603 target HDF5 SNPs. 279 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr7.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr8.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 60828 variants.
Loaded 6 individuals.
Loaded 63916 Chr.8 1240K SNPs.
Intersection 60530 out of 60828 HDF5 SNPs
Interpolating 298 variants.
Finished Chromosome 8.
Adding map to HDF5...
Intersection 60530 out of 60828 target HDF5 SNPs. 298 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr8.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr9.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 50546 variants.
Loaded 6 individuals.
Loaded 52765 Chr.9 1240K SNPs.
Intersection 50307 out of 50546 HDF5 SNPs
Interpolating 239 variants.
Finished Chromosome 9.
Adding map to HDF5...
Intersection 50307 out of 50546 target HDF5 SNPs. 239 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr9.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr10.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 58610 variants.
Loaded 6 individuals.
Loaded 61131 Chr.10 1240K SNPs.
Intersection 58364 out of 58610 HDF5 SNPs
Interpolating 246 variants.
Finished Chromosome 10.
Adding map to HDF5...
Intersection 58364 out of 58610 target HDF5 SNPs. 246 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr10.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr11.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 54590 variants.
Loaded 6 individuals.
Loaded 57163 Chr.11 1240K SNPs.
Intersection 54365 out of 54590 HDF5 SNPs
Interpolating 225 variants.
Finished Chromosome 11.
Adding map to HDF5...
Intersection 54365 out of 54590 target HDF5 SNPs. 225 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr11.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr12.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 53737 variants.
Loaded 6 individuals.
Loaded 56133 Chr.12 1240K SNPs.
Intersection 53528 out of 53737 HDF5 SNPs
Interpolating 209 variants.
Finished Chromosome 12.
Adding map to HDF5...
Intersection 53528 out of 53737 target HDF5 SNPs. 209 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr12.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr13.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 38927 variants.
Loaded 6 individuals.
Loaded 40441 Chr.13 1240K SNPs.
Intersection 38774 out of 38927 HDF5 SNPs
Interpolating 153 variants.
Finished Chromosome 13.
Adding map to HDF5...
Intersection 38774 out of 38927 target HDF5 SNPs. 153 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr13.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr14.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 35885 variants.
Loaded 6 individuals.
Loaded 37903 Chr.14 1240K SNPs.
Intersection 35744 out of 35885 HDF5 SNPs
Interpolating 141 variants.
Finished Chromosome 14.
Adding map to HDF5...
Intersection 35744 out of 35885 target HDF5 SNPs. 141 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr14.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr15.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 34280 variants.
Loaded 6 individuals.
Loaded 35991 Chr.15 1240K SNPs.
Intersection 34159 out of 34280 HDF5 SNPs
Interpolating 121 variants.
Finished Chromosome 15.
Adding map to HDF5...
Intersection 34159 out of 34280 target HDF5 SNPs. 121 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr15.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr16.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 34335 variants.
Loaded 6 individuals.
Loaded 36000 Chr.16 1240K SNPs.
Intersection 34138 out of 34335 HDF5 SNPs
Interpolating 198 variants.
Finished Chromosome 16.
Adding map to HDF5...
Intersection 34138 out of 34335 target HDF5 SNPs. 197 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr16.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr17.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 28892 variants.
Loaded 6 individuals.
Loaded 30733 Chr.17 1240K SNPs.
Intersection 28794 out of 28892 HDF5 SNPs
Interpolating 98 variants.
Finished Chromosome 17.
Adding map to HDF5...
Intersection 28794 out of 28892 target HDF5 SNPs. 98 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr17.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr18.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 33846 variants.
Loaded 6 individuals.
Loaded 35327 Chr.18 1240K SNPs.
Intersection 33720 out of 33846 HDF5 SNPs
Interpolating 126 variants.
Finished Chromosome 18.
Adding map to HDF5...
Intersection 33720 out of 33846 target HDF5 SNPs. 126 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr18.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr19.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 18092 variants.
Loaded 6 individuals.
Loaded 19273 Chr.19 1240K SNPs.
Intersection 18018 out of 18092 HDF5 SNPs
Interpolating 74 variants.
Finished Chromosome 19.
Adding map to HDF5...
Intersection 18018 out of 18092 target HDF5 SNPs. 74 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr19.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr20.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 28940 variants.
Loaded 6 individuals.
Loaded 30377 Chr.20 1240K SNPs.
Intersection 28827 out of 28940 HDF5 SNPs
Interpolating 113 variants.
Finished Chromosome 20.
Adding map to HDF5...
Intersection 28827 out of 28940 target HDF5 SNPs. 113 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr20.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr21.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 15707 variants.
Loaded 6 individuals.
Loaded 16727 Chr.21 1240K SNPs.
Intersection 15640 out of 15707 HDF5 SNPs
Interpolating 67 variants.
Finished Chromosome 21.
Adding map to HDF5...
Intersection 15640 out of 15707 target HDF5 SNPs. 67 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr21.h5

Print downsampling to 1240K...
Finished BCF tools filtering.
Deleting previous HDF5 file at path_h5: ./data/hdf5/example_hazelton_chr22.h5...
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 15483 variants.
Loaded 6 individuals.
Loaded 16420 Chr.22 1240K SNPs.
Intersection 15408 out of 15483 HDF5 SNPs
Interpolating 75 variants.
Finished Chromosome 22.
Adding map to HDF5...
Intersection 15408 out of 15483 target HDF5 SNPs. 75 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: ./data/hdf5/example_hazelton_chr22.h5

CPU times: user 2min 33s, sys: 6.26 s, total: 2min 39s
Wall time: 4min 48s

Next Steps

This is it, congratulations! Now you have the output data needed for ancIBD. Now you can continue with documentation for calling IBD.