ancIBD.IO.batch_run

Various Functions to prepare parameters for a batched run on a cluster Splits up individuals into batches, run all pw. batches with ancIBD, and then collect results. Functions here splits up input into batches, in a standardized way @ Author: Harald Ringbauer, 2023

Module Contents

Functions

get_iids([path_meta, min_snps])

Return list of iids to run

get_batch_idcs(i, batch_size)

Return the Index of batch and the within batch index

get_batch_pair_idx(i, batch_nr)

Return the Index of the two batches two run,

create_savepath([folder_base, ch, b1, b2, output])

Create the savepath in standardized output format

clean_double([run_idc, output])

Removes self comparisons as well as double comparisons.

get_idx_batch(b, batch_size)

Return all the indices of Indivdiuals of batch

get_unique_iid_pairs(iids, b1, b2, batch_size)

Get List of unique iid pairs to run.

get_run_lists_batch([i, k, batch_size, output])

Get IID list and Run lists for batches of samples.

save_ibd_df(df_ibd, savepath[, create])

Saves IBD Dataframe

get_run_params_from_i(i[, metapath, batch_size, ...])

Return the run parameters for run i

join_chromosomes(base_path[, chs, file_out, output])

Join different Chromosomes together and save output.

to_ind_df_batch(b1, b2[, folder_out, chs, min_cms, ...])

Post-process a batch of individals.

to_ind_df_batches([batches, folder_out, chs, min_cms, ...])

Runs multiple combinations of batches (wrapper of single batch function and then combines the processed batches.

to_ibd_df_batches([batches, folder_out, chs, min_cms, ...])

Runs multiple combinations of batches (wrapper of single batch function and then combines the processed batches.

print_runid_missing([b, folder_out, output])

Finds and prints indices of missing output (chXX.tsv) for batchwise runs.

find_output_missing([metapath, folder_out, ...])

Return List of all run nr.s that are missing.

get_batch_nr(n_iids[, batchsize, n_chr])

Get the number of jobs to submit to cluster.

ancIBD.IO.batch_run.get_iids(path_meta='/n/groups/reich/hringbauer/git/yamnaya/data/meta_v2.tsv', min_snps=600000)

Return list of iids to run

ancIBD.IO.batch_run.get_batch_idcs(i, batch_size)

Return the Index of batch and the within batch index

ancIBD.IO.batch_run.get_batch_pair_idx(i, batch_nr)

Return the Index of the two batches two run, using only triangular comparisons

ancIBD.IO.batch_run.create_savepath(folder_base='', ch=1, b1=1, b2=2, output=True)

Create the savepath in standardized output format

ancIBD.IO.batch_run.clean_double(run_idc=[], output=True)

Removes self comparisons as well as double comparisons. Input: list of pair-wise IIDs to run Return cleaned list.

ancIBD.IO.batch_run.get_idx_batch(b, batch_size)

Return all the indices of Indivdiuals of batch

ancIBD.IO.batch_run.get_unique_iid_pairs(iids, b1, b2, batch_size)

Get List of unique iid pairs to run. Return list of unique pairs, and list of all iids

ancIBD.IO.batch_run.get_run_lists_batch(i=67, k=3500, batch_size=400, output=True)

Get IID list and Run lists for batches of samples. i: Run number k: number of total indivdiuals batch_size: number of individuals in one batch Return batch indices

ancIBD.IO.batch_run.save_ibd_df(df_ibd, savepath, create=True)

Saves IBD Dataframe

ancIBD.IO.batch_run.get_run_params_from_i(i, metapath='./data/iid_lists/iid_ibd_eurasia_v1.tsv', batch_size=400, min_snps=0, output=True, folder_out='/n/groups/reich/hringbauer/git/ibd_euro/output/ibd/v1/')

Return the run parameters for run i min_snps: Minimum number of SNPs covered (for potential filtering on meta file) Returns iids, run_iids, and the output folder

ancIBD.IO.batch_run.join_chromosomes(base_path, chs=range(1, 23), file_out='ch_all.tsv', output=True)

Join different Chromosomes together and save output. Return joined dataframe. file_out: If given, save the joint file with that name into the base_path folder

ancIBD.IO.batch_run.to_ind_df_batch(b1, b2, folder_out='/n/groups/reich/hringbauer/git/ibd_euro/output/ibd/v1/', chs=range(1, 23), min_cms=[8, 12, 16, 20], snp_cm=220, min_cm=8, output=False)

Post-process a batch of individals. Returns individal IBD dataframe

ancIBD.IO.batch_run.to_ind_df_batches(batches=8, folder_out='/n/groups/reich/hringbauer/git/ibd_euro/output/ibd/v1/', chs=range(1, 23), min_cms=[8, 12, 16, 20], snp_cm=220, min_cm=8, output=False, savepath='')

Runs multiple combinations of batches (wrapper of single batch function and then combines the processed batches. Postprocess IBD to individal summary dataframe. Return merged IBD dataframe batches: If int: create all possible combinations. Otherwise needs to be array [n,2] of all pairs to run. savepath: If given, save IBD dataframe to there

ancIBD.IO.batch_run.to_ibd_df_batches(batches=8, folder_out='/n/groups/reich/hringbauer/git/ibd_euro/output/ibd/v1/', chs=range(1, 23), min_cms=[8, 12, 16, 20], snp_cm=220, min_cm=8, output=False, savepath='')

Runs multiple combinations of batches (wrapper of single batch function and then combines the processed batches. Postprocess IBD to individal summary dataframe. Return merged IBD dataframe batches: If int: create all possible combinations. Otherwise needs to be array [n,2] of all pairs to run. savepath: If given, save IBD dataframe to there

ancIBD.IO.batch_run.print_runid_missing(b=1, folder_out='', output=False)

Finds and prints indices of missing output (chXX.tsv) for batchwise runs. Return list of missing indices. Ideal for rerunning batch scripts. Uses C Indexing as would be used in submission script.

ancIBD.IO.batch_run.find_output_missing(metapath='', folder_out='', batch_size=400, rge=[10, 20])

Return List of all run nr.s that are missing. metapath: Path to .tsv of IIDs run for IBD screening [str] folder_out: Output folder [str] batch_size: How many individuals have been run per batch [int]

ancIBD.IO.batch_run.get_batch_nr(n_iids, batchsize=400, n_chr=22)

Get the number of jobs to submit to cluster. Return n [int]