ancIBD.IO.prepare_h5

Functions to prepare HDF5 file from imputed VCFs @ Author: Harald Ringbauer, 2021

Module Contents

Functions

save_1240kmarkers([snp1240k_path, marker_path, ch])

Save all 1240 Markers of .snp eigenstrat file.

save_1240_1000g_kmarkers([ch, snp_path, marker_path])

Save all 1240 and 1000G Markers of .snp eigenstrat file.

bctools_filter_vcf([in_vcf_path, out_vcf_path, ...])

Same as PLINK, but with bcftools and directly via Marker Positions.

bctools_filter_vcf_allvariants([in_vcf_path, ...])

Same as PLINK, but with bcftools and directly via Marker Positions.

merge_vcfs([in_vcf_paths, out_vcf_path])

Merges Set of VCFs into one VCF.

vcf_to_1240K_hdf([in_vcf_path, path_vcf, path_h5, ...])

Convert Ali's vcf to 1240K hdf5.

ancIBD.IO.prepare_h5.save_1240kmarkers(snp1240k_path='', marker_path='', ch=0)

Save all 1240 Markers of .snp eigenstrat file. to marker_path. ch: Chromosome. If null filter all of them

ancIBD.IO.prepare_h5.save_1240_1000g_kmarkers(ch=3, snp_path='', marker_path='')

Save all 1240 and 1000G Markers of .snp eigenstrat file. to marker_path. Loads Ali Path file snp_path: Where to find the SNPs plus their types

ancIBD.IO.prepare_h5.bctools_filter_vcf(in_vcf_path='', out_vcf_path='', marker_path='')

Same as PLINK, but with bcftools and directly via Marker Positions. filter_iids: Whether to use the .csv with Indivdiduals to extract. Check whether out_vcf_path has .gz or .vcf at end and compresses for former

ancIBD.IO.prepare_h5.bctools_filter_vcf_allvariants(in_vcf_path='', out_vcf_path='', marker_path='')

Same as PLINK, but with bcftools and directly via Marker Positions. filter_iids: Whether to use the .csv with Indivdiduals to extract

ancIBD.IO.prepare_h5.merge_vcfs(in_vcf_paths=[], out_vcf_path='')

Merges Set of VCFs into one VCF. in_vcf_paths: List of VCFs to merge out_vcf_path: Output of VCF

ancIBD.IO.prepare_h5.vcf_to_1240K_hdf(in_vcf_path='/n/groups/reich/ali/WholeGenomeImputation/imputed/v43.4/chr3.bcf', path_vcf='./data/vcf/1240k_v43/ch3.vcf.gz', path_h5='./data/hdf5/1240k_v43/ch3.h5', marker_path='./data/filters/ho_snps_bcftools_ch3.csv', map_path='/n/groups/reich/DAVID/V43/V43.5/v43.5.snp', af_path='', col_sample_af='AF_SAMPLE', chunk_length=10000, chunk_width=8, buffer_size=20000, ch=3)

Convert Ali’s vcf to 1240K hdf5.

Parameters:
  • in_vcf_path (str) – Input VCF file (i.e, output from GLIMPSE)

  • path_vcf (str) – A filtered vcf of in_vcf_path that contains only 1240k sites.

  • path_h5 (str) – Path of the output HDF5 files

  • marker_path (str) – Path to file containing SNPs to downsample. If marker_path empty, no SNP filtering done.

  • map_path (str) – Path to eigenstrat SNP file containing genetic map. These are merged into a hdf5 field “variants/MAP”. If map_path empty, no genetic map is merged in.

  • af_path (str) – Path to tab-seperated table containing allele frequencies. There are merged into the hdf5 field “variants/AF_ALL”. If no such path given, no allele frequencies are merged in.

  • col_sample_af (str:) – The hdf5 column name for the allele frequency calculated from sample. If left empty, no such column will be calculated or added.