variantplaner ¶
VariantPlaner, a tool kit to manage many variants without many cpu and ram resource.
Convert a vcf in parquet, convert annotations in parquet, convert parquet in vcf.
But also build a file struct to get a fast variant database interrogations time.
Modules:
-
cli
–Module contains command line entry point function.
-
exception
–Exception could be generate by VariantPlanner.
-
generate
–Function to generate information.
-
io
–Module manage input parsing and output serializing.
-
normalization
–Function use to normalize data.
-
objects
–Module to store variantplaner object.
-
struct
–Generated data structures for easy integration.
Classes:
-
Annotations
–Object to manage lazyframe as Annotations.
-
ContigsLength
–Store contigs -> length information.
-
Genotypes
–Object to manage lazyframe as Genotypes.
-
Pedigree
–Object to manage lazyframe as Variants.
-
Variants
–Object to manage lazyframe as Variants.
-
Vcf
–Object to manage lazyframe as Vcf.
-
VcfHeader
–Object that parse and store vcf information.
-
VcfParsingBehavior
–Enumeration use to control behavior of IntoLazyFrame.
Annotations ¶
Annotations()
Bases: LazyFrame
Object to manage lazyframe as Annotations.
Methods:
-
minimal_schema
–Get minimal schema of genotypes polars.LazyFrame.
Source code in src/variantplaner/objects/annotations.py
15 16 17 |
|
minimal_schema classmethod
¶
Get minimal schema of genotypes polars.LazyFrame.
Source code in src/variantplaner/objects/annotations.py
19 20 21 22 23 24 |
|
ContigsLength ¶
ContigsLength()
Store contigs -> length information.
Methods:
-
from_path
–Fill object with file point by pathlib.Path.
-
from_vcf_header
–Fill a object with VcfHeader.
Source code in src/variantplaner/objects/contigs_length.py
31 32 33 34 35 36 37 38 39 |
|
from_path ¶
Fill object with file point by pathlib.Path.
Argument: path: path of input file
Returns: Number of contigs line view
Source code in src/variantplaner/objects/contigs_length.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
|
from_vcf_header ¶
Fill a object with VcfHeader.
Argument
header: VcfHeader
Returns: Number of contigs line view
Source code in src/variantplaner/objects/contigs_length.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
|
Genotypes ¶
Genotypes(data: LazyFrame | None = None)
Bases: LazyFrame
Object to manage lazyframe as Genotypes.
Methods:
-
minimal_schema
–Get minimal schema of genotypes polars.LazyFrame.
-
samples_names
–Get list of sample name.
Source code in src/variantplaner/objects/genotypes.py
15 16 17 18 19 20 |
|
minimal_schema classmethod
¶
Get minimal schema of genotypes polars.LazyFrame.
Source code in src/variantplaner/objects/genotypes.py
26 27 28 29 30 31 32 |
|
Pedigree ¶
Pedigree()
Bases: LazyFrame
Object to manage lazyframe as Variants.
Methods:
-
from_path
–Read a pedigree file in polars.LazyFrame.
-
minimal_schema
–Get schema of variants polars.LazyFrame.
-
to_path
–Write pedigree polars.LazyFrame in ped format.
Source code in src/variantplaner/objects/pedigree.py
19 20 21 |
|
from_path ¶
from_path(input_path: Path) -> None
Read a pedigree file in polars.LazyFrame.
Parameters:
-
input_path
(Path
) –Path to pedigree file.
Returns:
-
None
–A polars.LazyFrame that contains ped information ('family_id', 'personal_id', 'father_id', 'mother_id', 'sex', 'affected')
Source code in src/variantplaner/objects/pedigree.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
minimal_schema classmethod
¶
Get schema of variants polars.LazyFrame.
Source code in src/variantplaner/objects/pedigree.py
62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
to_path ¶
to_path(output_path: Path) -> None
Write pedigree polars.LazyFrame in ped format.
Warning: This function performs polars.LazyFrame.collect before write csv, this can have a significant impact on memory usage
Parameters:
-
lf
–LazyFrame contains pedigree information.
-
output_path
(Path
) –Path where write pedigree information.
Returns:
-
None
–None
Source code in src/variantplaner/objects/pedigree.py
48 49 50 51 52 53 54 55 56 57 58 59 60 |
|
Variants ¶
Variants(data: LazyFrame | None = None)
Bases: LazyFrame
Object to manage lazyframe as Variants.
Methods:
-
minimal_schema
–Get schema of variants polars.LazyFrame.
Source code in src/variantplaner/objects/variants.py
15 16 17 18 19 20 |
|
minimal_schema classmethod
¶
Get schema of variants polars.LazyFrame.
Source code in src/variantplaner/objects/variants.py
22 23 24 25 26 27 28 29 30 31 |
|
Vcf ¶
Vcf()
Object to manage lazyframe as Vcf.
Methods:
-
add_genotypes
–Add genotypes information in vcf.
-
annotations
–Get annotations of vcf.
-
from_path
–Populate Vcf object with vcf file.
-
genotypes
–Get genotype of vcf.
-
schema
–Get schema of Vcf polars.LazyFrame.
-
set_variants
–Set variants of vcf.
-
variants
–Get variants of vcf.
Source code in src/variantplaner/objects/vcf.py
53 54 55 56 57 |
|
add_genotypes ¶
add_genotypes(genotypes_lf: Genotypes) -> None
Add genotypes information in vcf.
Source code in src/variantplaner/objects/vcf.py
172 173 174 175 176 177 178 179 180 181 182 |
|
annotations ¶
annotations(
select_info: set[str] | None = None,
) -> Annotations
Get annotations of vcf.
Source code in src/variantplaner/objects/vcf.py
184 185 186 187 188 |
|
from_path ¶
from_path(
path: Path,
chr2len_path: Path | None,
behavior: VcfParsingBehavior = NOTHING,
) -> None
Populate Vcf object with vcf file.
Source code in src/variantplaner/objects/vcf.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
|
genotypes ¶
genotypes() -> Genotypes
Get genotype of vcf.
Source code in src/variantplaner/objects/vcf.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
|
schema classmethod
¶
Get schema of Vcf polars.LazyFrame.
Source code in src/variantplaner/objects/vcf.py
190 191 192 193 194 195 196 197 198 199 200 201 202 |
|
VcfHeader ¶
VcfHeader()
Object that parse and store vcf information.
Methods:
-
build_metadata
–Generate metadata associate to vcf_header.
-
column_name
–Get an iterator of correct column name.
-
format_parser
–Generate a list of polars.Expr to extract genotypes information.
-
from_files
–Populate VcfHeader object with content of only header file.
-
from_lines
–Extract all header information of vcf lines.
-
info_parser
–Generate a list of polars.Expr to extract variants information.
Attributes:
-
contigs
(Iterator[str]
) –Get an iterator of line contains chromosomes information.
-
samples_index
(dict[str, int] | None
) –Read vcf header to generate an association map between sample name and index.
Source code in src/variantplaner/objects/vcf_header.py
34 35 36 |
|
contigs cached
property
¶
Get an iterator of line contains chromosomes information.
Returns: String iterator
samples_index cached
property
¶
Read vcf header to generate an association map between sample name and index.
Args: header: Header string.
Returns: Map that associate a sample name to is sample index.
Raises: NotVcfHeaderError: If all line not start by '#CHR'
build_metadata ¶
Generate metadata associate to vcf_header.
Args: select_columns: Output only columns in this list.
Returns: An associations map for column name to corresponding header line
Source code in src/variantplaner/objects/vcf_header.py
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 |
|
column_name ¶
Get an iterator of correct column name.
Returns: String iterator
Source code in src/variantplaner/objects/vcf_header.py
224 225 226 227 228 229 230 231 232 233 234 235 |
|
format_parser ¶
Generate a list of polars.Expr to extract genotypes information.
Warning: Float values can't be converted for the moment they are stored as String to keep information
Args: header: Line of vcf header. input_path: Path to vcf file. select_format: List of target format field.
Returns: A dict to link format id to pipeable function with Polars.Expr
Raises: NotVcfHeaderError: If all line not start by '#CHR'
Source code in src/variantplaner/objects/vcf_header.py
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
|
from_files ¶
from_files(path: Path) -> None
Populate VcfHeader object with content of only header file.
Args: path: Path of file
Returns: None
Source code in src/variantplaner/objects/vcf_header.py
38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
from_lines ¶
Extract all header information of vcf lines.
Line between start of file and first line start with '#CHROM' or not start with '#'
Args: lines: Iterator of line
Returns: None
Raises: NotAVcfHeader: If a line not starts with '#' NotAVcfHeader: If no line start by '#CHROM'
Source code in src/variantplaner/objects/vcf_header.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
info_parser ¶
Generate a list of polars.Expr to extract variants information.
Args: header: Line of vcf header input_path: Path to vcf file. select_info: List of target info field
Returns: List of polars.Expr to parse info columns.
Raises: NotVcfHeaderError: If all line not start by '#CHR'
Source code in src/variantplaner/objects/vcf_header.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
|
VcfParsingBehavior ¶
Bases: IntFlag
Enumeration use to control behavior of IntoLazyFrame.
Attributes:
any2string ¶
Convert an int in a string. Use for temp file creation.
Source code in src/variantplaner/__init__.py
26 27 28 |
|
exception ¶
Exception could be generate by VariantPlanner.
Classes:
-
NoContigsLengthInformationError
–Exception raise if we didn't get Contigs Length information in vcf or in compagnion file.
-
NoGTError
–Exception raise if genotype polars.LazyFrame not contains gt column.
-
NoGenotypeError
–Exception raise if vcf file seems not contains genotypes information.
-
NotAVCFError
–Exception raise if file read seems not be a vcf, generally not contains a line starts with '#CHROM'.
-
NotAVariantCsvError
–Exception raise if file is a csv should contains variants info but columns name not match minimal requirement.
-
NotVcfHeaderError
–Exception raise if header isn't compatible with vcf.
NoContigsLengthInformationError ¶
NoContigsLengthInformationError()
Bases: Exception
Exception raise if we didn't get Contigs Length information in vcf or in compagnion file.
Source code in src/variantplaner/exception.py
19 20 21 |
|
NoGTError ¶
NoGTError(message: str)
Bases: Exception
Exception raise if genotype polars.LazyFrame not contains gt column.
Source code in src/variantplaner/exception.py
59 60 61 |
|
NoGenotypeError ¶
NoGenotypeError()
Bases: Exception
Exception raise if vcf file seems not contains genotypes information.
Source code in src/variantplaner/exception.py
51 52 53 |
|
NotAVCFError ¶
NotAVCFError(path: Path)
Bases: Exception
Exception raise if file read seems not be a vcf, generally not contains a line starts with '#CHROM'.
Source code in src/variantplaner/exception.py
43 44 45 |
|
NotAVariantCsvError ¶
NotAVariantCsvError(path: Path)
Bases: Exception
Exception raise if file is a csv should contains variants info but columns name not match minimal requirement.
Source code in src/variantplaner/exception.py
27 28 29 |
|
generate ¶
Function to generate information.
Functions:
-
transmission
–Compute how each variant are transmite to index case.
-
transmission_ped
–Compute transmission of each variants.
transmission ¶
transmission(
genotypes_lf: LazyFrame,
index_names: tuple[str],
mother_names: tuple[str | None] = (None,),
father_names: tuple[str | None] = (None,),
) -> DataFrame | None
Compute how each variant are transmite to index case.
Parameters:
-
genotypes_lf
(LazyFrame
) –Genotypes polars.LazyFrame,
gt
column are required. -
index_name
–Sample name of index case.
-
mother_name
–Sample name of mother.
-
father_name
–Sample name of father.
Returns:
-
DataFrame | None
–polars.DataFrame with transmission information. With genotyping information for index, mother and father. If any of them isn't present value are set to polars.Null. Columns transmission contains a string: concat(chr(index_gt + 33), chr(mother_gt + 33), chr(father_gt + 33)), transmission:
#~!
mean homozygote diploide variant not present in father but with no information about mother.
Raises:
-
NoGTError
–if genotypes_lf not containts gt column.
Source code in src/variantplaner/generate.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
|
transmission_ped ¶
transmission_ped(
genotypes_lf: LazyFrame, pedigree_lf: LazyFrame
) -> DataFrame | None
Compute transmission of each variants.
Warning: only the first sample with two parent are considered.
Parameters:
-
genotypes_lf
(LazyFrame
) –Genotypes polars.LazyFrame,
gt
column are required. -
pedigree_lf
(LazyFrame
) –Pedigree polars.LazyFrame.
Returns:
-
DataFrame | None
–DataFrame with transmission information
Raises:
-
NoGTError
–If genotypes_lf not contains gt column.
Source code in src/variantplaner/generate.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
normalization ¶
Function use to normalize data.
Functions:
-
add_id_part
–Add column id part.
-
add_variant_id
–Add a column id of variants.
add_id_part ¶
add_id_part(
lf: LazyFrame, number_of_bits: int = 8
) -> LazyFrame
Add column id part.
If id is large variant id value, id_part are set to 255, other value most weigthed position 8 bits are use.
Parameters:
-
lf
(LazyFrame
) –polars.LazyFrame contains: id column.
Returns:
-
LazyFrame
–polars.LazyFrame with column id_part added
Source code in src/variantplaner/normalization.py
79 80 81 82 83 84 85 86 87 88 89 90 |
|
add_variant_id ¶
add_variant_id(
lf: LazyFrame, chrom2length: LazyFrame
) -> LazyFrame
Add a column id of variants.
Id computation is based on
Two different algorithms are used to calculate the variant identifier, depending on the cumulative length of the reference and alternative sequences.
If the cumulative length of the reference and alternative sequences is short, the leftmost bit of the id is set to 0, then a unique 63-bit hash of the variant is calculated.
If the cumulative length of the reference and alternative sequences is long, the right-most bit of the id will have a value of 1, followed by a hash function, used in Firefox, of the chromosome, position, reference and alternative sequence without the right-most bit.
If lf.columns contains SVTYPE and SVLEN variant with regex group in alt <([^:]+).*> match SVTYPE are replaced by concatenation of SVTYPE and SVLEN first value.
Parameters:
-
lf
(LazyFrame
) –polars.LazyFrame contains: chr, pos, ref, alt columns.
-
chrom2length
(LazyFrame
) –polars.DataFrame contains: chr and length columns.
Returns:
-
LazyFrame
–polars.LazyFrame with chr column normalized
Source code in src/variantplaner/normalization.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
vcf ¶
Read and write vcf file.
Functions:
-
build_rename_column
–A helper function to generate rename column dict for variantplaner.io.vcf.lazyframe_in_vcf function parameter.
-
lazyframe_in_vcf
–Write polars.LazyFrame in vcf format.
build_rename_column ¶
build_rename_column(
chromosome: str,
pos: str,
identifier: str,
ref: str,
alt: str,
qual: str | None = ".",
filter_col: str | None = ".",
info: list[tuple[str, str]] | None = None,
format_string: str | None = None,
sample: dict[str, dict[str, str]] | None = None,
) -> RenameCol
A helper function to generate rename column dict for variantplaner.io.vcf.lazyframe_in_vcf function parameter.
Returns:
-
RenameCol
–A rename column dictionary.
Source code in src/variantplaner/io/vcf.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
|
lazyframe_in_vcf ¶
lazyframe_in_vcf(
lf: LazyFrame,
output_path: Path,
/,
vcf_header: VcfHeader | None = None,
renaming: RenameCol = DEFAULT_RENAME,
) -> None
Write polars.LazyFrame in vcf format.
Warning: This function performs polars.LazyFrame.collect before write vcf, this can have a significant impact on memory usage.
Parameters:
-
lf
(LazyFrame
) –LazyFrame contains information.
-
output_path
(Path
) –Path to where vcf to write.
Returns:
-
None
–None
Source code in src/variantplaner/io/vcf.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
|
struct ¶
genotypes ¶
Function relate to genotype structuration.
Functions:
-
hive
–Read all genotypes parquet file and use information to generate a hive like struct, based on 63rd and 55th bits included of variant id with genotype information.
hive ¶
hive(
paths: list[Path],
output_prefix: Path,
threads: int,
file_per_thread: int,
*,
append: bool,
number_of_bits: int = 8,
) -> None
Read all genotypes parquet file and use information to generate a hive like struct, based on 63rd and 55th bits included of variant id with genotype information.
Real number of threads use are equal to \(min(threads, len(paths))\).
Output format look like: {output_prefix}/id_part=[0..2.pow(number_of_bits)]/0.parquet
.
Parameters:
-
paths
(list[Path]
) –list of file you want reorganize
-
output_prefix
(Path
) –prefix of hive
-
threads
(int
) –number of multiprocessing threads run
-
file_per_thread
(int
) –number of file manage per multiprocessing threads
Returns:
-
None
–None
Source code in src/variantplaner/struct/genotypes.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
|
variants ¶
Function relate to vcf structuration.
Functions:
-
merge
–Perform merge of multiple parquet variants file in one file.
merge ¶
merge(
paths: list[Path],
output_prefix: Path,
memory_limit: int = 10000000000,
polars_threads: int = 4,
*,
append: bool,
) -> None
Perform merge of multiple parquet variants file in one file.
These function generate temporary file, by default file are written in /tmp
but you can control where these files are written by set TMPDIR, TEMP or TMP directory.
Parameters:
-
paths
(list[Path]
) –List of file you want chunked.
-
output
–Path where variants is written.
-
memory_limit
(int
, default:10000000000
) –Size of each chunk in bytes.
Returns:
-
None
–None
Source code in src/variantplaner/struct/variants.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
|