API documentation
The PascalX package comes with the following user modules:
Genome (
PascalX.genome
)Genescorer (
PascalX.genescorer
)Pathwayscorer (
PascalX.pathway
)Xscorer (
PascalX.xscorer
)Genexprscorer (
PascalX.genexpr
)
Internal modules:
Genome
- class PascalX.genome.genome[source]
This class handles the genome annotation. It provides functionality for import of data from text files and automatic download of annotation data from ensembl.org.
- gene_info(gene)[source]
Prints the loaded information for given gene
- Parameters
gene (string) – Symbol of gene to query
- get_ensembl_annotation(filename, genetype='protein_coding', version='GRCh38')[source]
Gene annotation download function for ensembl.org BioMart data
- Parameters
filename (string) – File to store downloaded annotation in
genetype (string) – Comma separated list of BioMart genetypes to download (protein_coding, pseudogene, …)
version (string) – GRCh37 | CRCh38
Example:
from PascalX.genome import genome G = genome() G.get_ensemble_annotation('ensemble_hg38.txt')
- load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, cb=None, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]
Imports gene annotation from text file
- Parameters
file (text) – File to load
ccol (int) – Column containing chromosome number
cid (int) – Column containing gene id
csymb (int) – Column containing gene symbol
cstx (int) – Column containing transcription start
cetx (int) – Column containing transcription end
cs (int) – Column containing strand
cb (int) – Column containing band (None if not supplied)
chrStart (int) – Number of leading characters to skip in ccol
splitchr (string) – Character used to separate columns in text file
NAgeneid (string) – Identifier for not available gene id
useNAgenes (bool) – Import genes without gene id
header (bool) – First line is header
- Internal:
_GENEID
(dict) - Contains the raw data with gene id (cid) as key_GENESYMB
(dict) - Mapping from gene symbols (csymb) to gene ids (cid)_GENEIDtoSYMB
(dict) - Mapping from gene ids (cid) to gene symbols (csymb)_CHR
(dict) - Mapping from chromosomes to list of gene symbols_BAND
(dict) - Mapping from band to list of gene symbols_SKIPPED
(dict) - Genes (cid) which could not be imported
Note
An unique gene id is automatically generated for n/a gene ids if
useNAgenes=true
.Note
Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.
Genescorer
- class PascalX.genescorer.genescorer[source]
Genescorer base class
- load_refpanel(filename, parallel=1, keepfile=None, qualityT=100, SNPonly=False, chrlist=None)[source]
Sets the reference panel to use
- Parameters
filename (string) – /path/filename (without .chr#.db ending)
parallel (int) – Number of cores to use for parallel import of reference panel
keepfile – File with sample ids (one per line) to keep (only for .vcf)
qualityT – Quality threshold for variant to keep (only for .vcf)
SNPonly – Import only SNPs (only for .vcf)
chrlist (list) – List of chromosomes to import. (None to import 1-22)
Note
One file per chromosome with ending .chr#.db required (#: 1-22). If imported reference panel is not present, PascalX will automatically try to import from .chr#.tped.gz or .chr#.vcf.gz files.
- load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, cb=None, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]
Imports gene annotation from text file
- Parameters
file (text) – File to load
ccol (int) – Column containing chromosome number
cid (int) – Column containing gene id
csymb (int) – Column containing gene symbol
cstx (int) – Column containing transcription start
cetx (int) – Column containing transcription end
cs (int) – Column containing strand
cb (int) – Column containing band (None if not supplied)
chrStart (int) – Number of leading characters to skip in ccol
splitchr (string) – Character used to separate columns in text file
NAgeneid (string) – Identifier for not available gene id
useNAgenes (bool) – Import genes without gene id
header (bool) – First line is header
- Internal:
_GENEID
(dict) - Contains the raw data with gene id (cid) as key_GENESYMB
(dict) - Mapping from gene symbols (csymb) to gene ids (cid)_GENEIDtoSYMB
(dict) - Mapping from gene ids (cid) to gene symbols (csymb)_CHR
(dict) - Mapping from chromosomes to list of gene symbols_BAND
(dict) - Mapping from band to list of gene symbols_SKIPPED
(dict) - Genes (cid) which could not be imported
Note
An unique gene id is automatically generated for n/a gene ids if
useNAgenes=true
.Note
Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.
- load_mapping(file, gcol=0, rcol=1, wcol=None, delimiter='\t', a1col=None, a2col=None, bcol=None, pcol=None, pfilter=1, header=False, joint=True, symbol=False)[source]
Loads a SNP to gene mapping
- Parameters
file (string) – File to load
gcol (int) – Column with gene id
rcol (int) – Column with SNP id
wcol (int) – Column with weight
a1col (int) – Column of alternate allele (None for ignoring alleles)
a2col (int) – Column of reference allele (None for ignoring alleles)
bcol (int) – Column with additional weight
delimiter (string) – Character used to separate columns
pfilter (float) – Only include rows with wcol < pfilter
header (bool) – Header present
joint (bool) – Use mapping SNPs and gene window based SNPs
symbol (bool) – True: Gene id is a gene symbol; False: Gene id is an ensembl gene id
Note
A loaded mapping takes precedence over a loaded positional gene annotation
The mapping data is stored statically (same for all class initializations)
- load_GWAS(file, rscol=0, pcol=1, bcol=None, a1col=None, a2col=None, delimiter=None, header=False, NAid='NA', log10p=False, cutoff=1e-300)[source]
Load GWAS summary statistics p-values
- Parameters
file (string) – File containing the GWAS summary statistics data. Either as textfile or gzip compressed with ending .gz
rscol (int) – Column of SNP ids
pcol (int) – Column of p-values
bcol (int) – Column of betas (optional)
a1col (int) – Column of alternate allele (None for ignoring)
a2col (int) – Column of reference allele (None for ignoring)
delimiter (String) – Split character
header (bool) – Header present
NAid (String) – Code for not available (rows are ignored)
log10p (bool) – p-values are given -log10 transformed
cutoff (float) – Cutoff value for p-values (None for no cutoff)
- matchAlleles(SNPonly=False)[source]
Matches alleles between loaded GWAS and reference panel (SNPs with non matching alleles are removed)
- Parameters
SNPonly (bool) – Keep only SNPs
- save_scores(file)[source]
Save computed gene scores
- Parameters
file (string) – Filename to store data
- load_scores(file, gcol=0, pcol=1, header=False)[source]
Load computed gene scores
- Parameters
file (string) – Filename of data to load
gcol (int) – Column with gene symbol
pcol (int) – Column with p-value
header (bool) – File contains a header (True|False)
- get_topscores(N=10)[source]
Prints and returns the top gene scores
- Parameters
N (int) – # to show
- Returns
Ordered list of top scores
- Return type
list
- get_geneinfo(gene)[source]
Shows details of the loaded annotation for a gene
- Parameters
gene (String) – Gene symbol
- plot_Manhattan(ScoringResult=None, region=None, sigLine=0, logsigThreshold=0, labelSig=True, labelList=[], style='colorful')[source]
Produces a Manhattan plot
- Parameters
ScoringResult (list) – List of gene,p-value pairs [[‘Gene’,p-value],…] (if None, generate from internal _SCORES)
region (str) – Band region to plot
sigLine (float) – Draws a horizontal line at p-value (if > 0)
logsigThreshold (float) – Significance threshold above which to label genes (log(p) values)
labelSig (bool) – Label genes above significance threshold
labelList (list) – List of gene names to label
style (string) – Design of the plot (‘classic’ | ‘colorful’)
- score_chr(chrs, unloadRef=False, method='saddle', mode='auto', reqacc=1e-100, intlimit=100000, parallel=1, nobar=False, autorescore=False, keep_idx=None)[source]
Perform gene scoring for full chromosomes
- Parameters
chrs (list) – List of chromosomes to score.
unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
intlimit (int) – Max # integration terms to use
parallel (int) – # of cores to use
nobar (bool) – Do not show progress bar
autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm
- score_all(parallel=1, method='saddle', mode='auto', reqacc=1e-100, intlimit=100000, nobar=False, autorescore=False, keep_idx=None)[source]
Perform full gene scoring
- Parameters
parallel (int) – # of cores to use
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
intlimit (int) – Max # integration terms to use
nobar (bool) – Do not show progress bar
autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm
- class PascalX.genescorer.chi2sum(window=50000, varcutoff=0.99, MAF=0.05, genome=None, gpu=False)[source]
Implementation of chi2 sum based genescorer
- score(gene, parallel=1, unloadRef=False, method='saddle', mode='auto', reqacc=1e-100, intlimit=1000000, nobar=False, autorescore=False, keep_idx=None)[source]
Performs gene scoring for a given list of gene symbols
- Parameters
gene (list) – gene symbols to score.
parallel (int) – # of cores to use
unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
intlimit (int) – Max # integration terms to use
nobar (bool) – Do not show progress bar
autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm
- activateFails(RESULT)[source]
Helper method to force activate failed genes to success genes
- Parameters
Result (list) – Return of scoring function
Warning
Only use if you know what you are doing !
- rescore(RESULT, method='pearson', mode='auto', reqacc=1e-100, intlimit=100000, parallel=1, nobar=False, keep_idx=None)[source]
Function to re-score only the failed gene scorings of a previous scoring run with different scorer settings.
- Parameters
RESULT (list) – Return of one of the gene scorring methods
parallel (int) – # of cores to use
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
intlimit (int) – Max # integration terms to use
nobar – Do not show progress bar
- score_gene_bulk_chr(chrs, gene, data, unloadRef=False, method='saddle', mode='auto', reqacc=1e-100, intlimit=100000, autorescore=False)[source]
Perform scoring in bulk for supplied set of SNPs
- Parameters
chrs (int) – Chromosome number the supplied SNPs are located on
gene (string) – Gene symbol for the SNPs
data(list} – List of SNP data in format [ [rsid1,rsid2,…], [GWASid1, GWASid2,…], M ] with M a pvalue matrix (rows: GWAS, cols: rsid)
unloadRef (bool) – Remove loaded reference data from memory
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
intlimit (int) – Max # integration terms to use
autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm
- plot_genesnps(G, show_correlation=False, mark_window=False, tickspacing=10, color='limegreen', corrcmap=None)[source]
Plots the SNP p-values for a list of genes and the genotypic SNP-SNP correlation matrix
- Parameters
G (list) – List of gene symbols
show_correlation (bool) – Plot the corresponding SNP-SNP correlation matrix
tickspacing (int) – Spacing of ticks for correlation plot
color (color) – Color for SNP associations
corrcmap (cmap) – Colormap to use for correlation plot (None for default)
- class PascalX.genescorer.wchi2sum(window=50000, varcutoff=0.99, MAF=0.05, genome=None, gpu=False)[source]
Implementation of weighted chi2 sum based genescorer
Note
SNP weights have to be supplied via Mapper
- score(gene, parallel=1, unloadRef=False, method='saddle', mode='auto', reqacc=1e-100, intlimit=1000000, nobar=False, autorescore=False, keep_idx=None)[source]
Performs gene scoring for a given list of gene symbols
- Parameters
gene (list) – gene symbols to score.
parallel (int) – # of cores to use
unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
intlimit (int) – Max # integration terms to use
nobar (bool) – Do not show progress bar
autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm
Pathwayscorer
- class PascalX.pathway.chi2rank(genescorer, mergedist=100000, fuse=True)[source]
Pathway scoring via chi2 of ranked gene p-values. Nearby genes can be merged to form meta-genes
- Parameters
genescorer (genescorer) – The initialized genescorer to use to re-compute fused genes
mergedist (int) – Maximum inbetween distance of genes to merge
fuse (bool) – Fuse nearby genes to meta-genes
- score(modules, method='saddle', mode='auto', reqacc=1e-100, parallel=1, nobar=False, genes_only=False, chrs_only=None, autorescore=True)[source]
Scores a set of pathways/modules
- Parameters
modules (list) – List of modules to score
samples (int) – # of random gene sets to draw
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
autorescore (bool) – Automatically try to re-score failed genes
nobar (bool) – Show progress bar
genes_only (bool) – Compute only (fused)-genescores (accessible via genescorer method)
chrs_only (list) – Only consider genes on listed chromosomes. None for all
- get_sigpathways(RESULT, cutoff=0.0001)[source]
Prints significant pathways in the result set
- Parameters
RESULT (list) – Return of a pathwayscorer
cutoff (float) – Significance threshold to print pathways
- load_modules(file, ncol=0, fcol=2, symbol=True)[source]
Load modules from tab separated file
- Parameters
file (string) – path/filename
ncol (int) – Column with name of module
fcol (int) – Column with first gene (symbol) in module. Remaining genes have to follow tab ( ) separated
symbol (bool) – Genes are given as gene symbols (False requires genome to be set in genescorer)
- set_genescorer(S)[source]
Set the genescorer to use for re-computing p-values of fused genes
- Parameters
S (genescorer) – The initialized genescorer to use
- class PascalX.pathway.chi2perm(genescorer, mergedist=100000, fuse=True)[source]
Pathway scoring via testing summed inverse chi2 transformed gene p-values against equally size random samples of gene sets.
- Parameters
genescorer (genescorer) – The initialized genescorer to use to re-compute fused genes
mergedist (int) – Maximum inbetween distance of genes to merge
fuse (bool) – Fuse nearby genes to meta-genes
Note
Genes in the background gene sets are NOT fused.
- score(modules, samples=100000, method='saddle', mode='auto', reqacc=1e-100, parallel=1, nobar=False, autorescore=True)[source]
Scores a set of pathways/modules
- Parameters
modules (list) – List of modules to score
samples (int) – # of random gene sets to draw
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
nobar (bool) – Show progress bar
autorescore (bool) – Automatically try to re-score failed genes
- get_sigpathways(RESULT, cutoff=0.0001)[source]
Prints significant pathways in the result set
- Parameters
RESULT (list) – Return of a pathwayscorer
cutoff (float) – Significance threshold to print pathways
- load_modules(file, ncol=0, fcol=2, symbol=True)[source]
Load modules from tab separated file
- Parameters
file (string) – path/filename
ncol (int) – Column with name of module
fcol (int) – Column with first gene (symbol) in module. Remaining genes have to follow tab ( ) separated
symbol (bool) – Genes are given as gene symbols (False requires genome to be set in genescorer)
- set_genescorer(S)[source]
Set the genescorer to use for re-computing p-values of fused genes
- Parameters
S (genescorer) – The initialized genescorer to use
Xscorer
- class PascalX.xscorer.zsum(window=50000, varcutoff=0.99, MAF=0.05, leftTail=False, gpu=False)[source]
This class implements the cross scorer based on SNP coherence over gene windows.
- get_topscores(N=10)[source]
Prints and returns the top gene scores
- Parameters
N (int) – # to show
- Returns
Ordered list of top scores
- Return type
list
- jointlyRank(E_A, E_B)[source]
Jointly QQ normalizes the p-values of two GWAS
- Parameters
E_A (str) – Identifier of first GWAS
E_B (str) – Identifier of second GWAS
- jointlyRank_mapper(E_A, E_B, invert=False)[source]
Jointly QQ normalizes the p-values of GWAS and Mapper
- Parameters
E_A (str) – Identifier of first GWAS
E_B (str) – Identifier of MAP
- load_GWAS(file, rscol=0, pcol=1, bcol=2, a1col=None, a2col=None, idcol=None, name='GWAS', delimiter=None, NAid='n/a', header=False, threshold=1, mincutoff=0.0, rank=False, SNPonly=False, log10p=False)[source]
Load GWAS summary statistics p-values and betas
- Parameters
file (string) – File containing the GWAS summary statistics data. Either as textfile or gzip compressed with ending .gz
rscol (int) – Column of SNP ids
pcol (int) – Column of p-values
bcol (int) – Column of betas
a1col (int) – Column of alternate allele (None for ignoring alleles)
a2col (int) – Column of reference allele (None for ignoring alleles)
idcol – Column of identifiers, if several different GWAS in one file
name – Identifier code for GWAS (needs to be unique)
delimiter (String) – Split character
header (bool) – Header present
NAid (String) – Code for not available (rows are ignored)
threshold (float) – Only load data with p-value < threshold
SNPonly (bool) – Load only SNPs (only if a1col and a2col is specified)
log10p (bool) – p-values are -log10 transformed
Note
The loaded GWAS data is shared between different xscorer instances !
- load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, cb=None, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]
Imports gene annotation from text file
- Parameters
file (text) – File to load
ccol (int) – Column containing chromosome number
cid (int) – Column containing gene id
csymb (int) – Column containing gene symbol
cstx (int) – Column containing transcription start
cetx (int) – Column containing transcription end
cs (int) – Column containing strand
chrStart (int) – Number of leading characters to skip in ccol
splitchr (string) – Character used to separate columns in text file
NAgeneid (string) – Identifier for not available gene id
useNAgenes (bool) – Import genes without gene id
header (bool) – First line is header
- Internal:
_GENEID
(dict) - Contains the raw data with gene id (cid) as key_GENESYMB
(dict) - Mapping from gene symbols (csymb) to gene ids (cid)_GENEIDtoSYMB
(dict) - Mapping from gene ids (cid) to gene symbols (csymb)_CHR
(dict) - Mapping from chromosomes to list of gene symbols_BAND
(dict) - Mapping from band to list of gene symbols_SKIPPED
(dict) - Genes (cid) which could not be imported
Note
An unique gene id is automatically generated for n/a gene ids if
useNAgenes=true
.Note
Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.
- load_mapping(file, name='MAP', gcol=0, rcol=1, wcol=None, bcol=None, delimiter='\t', a1col=None, a2col=None, pfilter=1, header=False, joint=True)[source]
Loads a SNP to gene mapping
- Parameters
file (string) – File to load
name (string) – Identifier code for mapping (needs to be unique)
gcol (int) – Column with gene id
rcol (int) – Column with SNP id
wcol (int) – Column with weight
bcol (int) – Column with additional weight
delimiter (string) – Character used to separate columns
a1col (int) – Column of alternate allele (None for ignoring alleles)
a2col (int) – Column of reference allele (None for ignoring alleles)
pfilter (float) – Only include rows with wcol < pfilter
header (bool) – Header present
joint (bool) – Use mapping SNPs and gene window based SNPs
- load_refpanel(filename, parallel=1, keepfile=None, qualityT=100, SNPonly=False, chrlist=None)[source]
Sets the reference panel to use
- Parameters
filename (string) – /path/filename (without .chr#.db ending)
parallel (int) – Number of cores to use for parallel import of reference panel
keepfile – File with sample ids (one per line) to keep (only for .vcf)
qualityT – Quality threshold for variant to keep (only for .vcf)
SNPonly – Import only SNPs (only for .vcf)
chrlist (list) – List of chromosomes to import. (None to import 1-22)
Note
One file per chromosome with ending .chr#.db required (#: 1-22). If imported reference panel is not present, PascalX will automatically try to import from .chr#.tped.gz or .chr#.vcf.gz files.
- load_scores(file, gcol=0, pcol=1, header=False)[source]
Load computed gene scores
- Parameters
file (string) – Filename of data to load
gcol (int) – Column with gene symbol
pcol (int) – Column with p-value
header (bool) – File contains a header (True|False)
- matchAlleles(E_A, E_B, matchRefPanel=False)[source]
Matches alleles between two GWAS (SNPs with non matching alleles are removed)
- Parameters
E_A (str) – Identifier of first GWAS
E_B (str) – Identifier of second GWAS
matchRefPanel (bool) – Match also alleles to reference panel
Note
Currently, matchRefPanel=True requires sufficient memory to load all reference panel indices into memory.
- matchAlleles_mapper(E_A, E_B, matchRefPanel=False)[source]
Matches alleles between GWAS and Mapper loaded data (SNPs with non matching alleles are removed)
- Parameters
E_A (str) – Identifier of GWAS
E_B (str) – Identifier of MAP
matchRefPanel (bool) – Match also with reference panel alleles
- plot_genesnps(G, E_A, E_B, rank=False, zscore=False, show_correlation=False, mark_window=False, MAF=None, tickspacing=10, pcolor='limegreen', ncolor='darkviolet', corrcmap=None)[source]
Plots the SNP p-values for a list of genes and the genotypic SNP-SNP correlation matrix
- Parameters
G (list) – List of gene symbols
show_correlation (bool) – Plot the corresponding SNP-SNP correlation matrix
mark_window (bool) – Mark the gene transcription start and end positions
MAF (float) – MAF filter (None for value set in class)
tickspacing (int) – Spacing of ticks
pcolor (color) – Color for positive SNP associations
ncolor (color) – Color for negative SNP associations
corrcmap (cmap) – Colormap to use for correlation plot (None for default)
- save_scores(file)[source]
Save computed gene scores
- Parameters
file (string) – Filename to store data
- score(gene, E_A=None, E_B=None, threshold=1, parallel=1, method=None, mode=None, nobar=False, reqacc=None, autorescore=False, pcorr=0)[source]
Performs cross scoring for a given list of gene symbols
- Parameters
gene (list) – gene symbols to score.
E_A (str) – First GWAS to use
E_B (str) – Second GWAS to use
parallel (int) – # of cores to use
unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core
method (string) – Not used
mode (string) – Not used
reqacc (float) – Not used
intlimit (int) – Not used
threshold (bool) – Threshold p-value to reqacc
nobar (bool) – Do not show progress bar
autorescore (bool) – Automatically try to re-score failed genes
pcorr (float) – Sample overlap correction factor
- score_all(E_A=None, E_B=None, threshold=1, parallel=1, pcorr=0, method=None, mode=None, nobar=False, reqacc=None)[source]
Performs cross scoring for all gene symbols
- Parameters
E_A (str) – First GWAS to use
E_B (str) – Second GWAS to use
parallel (int) – # of cores to use
pcorr (float) – Sample overlap correction factor
method (string) – Not used
mode (string) – Not used
reqacc (float) – Not used
nobar (bool) – Do not show progress bar
- score_chr(E_A, E_B, chrs=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22'], threshold=1, parallel=1, pcorr=0, nobar=False)[source]
Performs cross scoring for gene symbols on given chromosomes
- Parameters
E_A (str) – First GWAS to use
E_B (str) – Second GWAS to use
chrs (list) – Chromosomes to score.
parallel (int) – # of cores to use
pcorr (float) – Sample overlap correction factor
nobar (bool) – Do not show progress bar
- class PascalX.xscorer.rsum(window=50000, varcutoff=0.99, MAF=0.05, leftTail=False, gpu=False)[source]
This class implements the ratio cross scorer based on SNP coherence/variance over gene windows.
- get_topscores(N=10)[source]
Prints and returns the top gene scores
- Parameters
N (int) – # to show
- Returns
Ordered list of top scores
- Return type
list
- jointlyRank(E_A, E_B)[source]
Jointly QQ normalizes the p-values of two GWAS
- Parameters
E_A (str) – Identifier of first GWAS
E_B (str) – Identifier of second GWAS
- jointlyRank_mapper(E_A, E_B, invert=False)[source]
Jointly QQ normalizes the p-values of GWAS and Mapper
- Parameters
E_A (str) – Identifier of first GWAS
E_B (str) – Identifier of MAP
- load_GWAS(file, rscol=0, pcol=1, bcol=2, a1col=None, a2col=None, idcol=None, name='GWAS', delimiter=None, NAid='n/a', header=False, threshold=1, mincutoff=0.0, rank=False, SNPonly=False, log10p=False)[source]
Load GWAS summary statistics p-values and betas
- Parameters
file (string) – File containing the GWAS summary statistics data. Either as textfile or gzip compressed with ending .gz
rscol (int) – Column of SNP ids
pcol (int) – Column of p-values
bcol (int) – Column of betas
a1col (int) – Column of alternate allele (None for ignoring alleles)
a2col (int) – Column of reference allele (None for ignoring alleles)
idcol – Column of identifiers, if several different GWAS in one file
name – Identifier code for GWAS (needs to be unique)
delimiter (String) – Split character
header (bool) – Header present
NAid (String) – Code for not available (rows are ignored)
threshold (float) – Only load data with p-value < threshold
SNPonly (bool) – Load only SNPs (only if a1col and a2col is specified)
log10p (bool) – p-values are -log10 transformed
Note
The loaded GWAS data is shared between different xscorer instances !
- load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, cb=None, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]
Imports gene annotation from text file
- Parameters
file (text) – File to load
ccol (int) – Column containing chromosome number
cid (int) – Column containing gene id
csymb (int) – Column containing gene symbol
cstx (int) – Column containing transcription start
cetx (int) – Column containing transcription end
cs (int) – Column containing strand
chrStart (int) – Number of leading characters to skip in ccol
splitchr (string) – Character used to separate columns in text file
NAgeneid (string) – Identifier for not available gene id
useNAgenes (bool) – Import genes without gene id
header (bool) – First line is header
- Internal:
_GENEID
(dict) - Contains the raw data with gene id (cid) as key_GENESYMB
(dict) - Mapping from gene symbols (csymb) to gene ids (cid)_GENEIDtoSYMB
(dict) - Mapping from gene ids (cid) to gene symbols (csymb)_CHR
(dict) - Mapping from chromosomes to list of gene symbols_BAND
(dict) - Mapping from band to list of gene symbols_SKIPPED
(dict) - Genes (cid) which could not be imported
Note
An unique gene id is automatically generated for n/a gene ids if
useNAgenes=true
.Note
Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.
- load_mapping(file, name='MAP', gcol=0, rcol=1, wcol=None, bcol=None, delimiter='\t', a1col=None, a2col=None, pfilter=1, header=False, joint=True)[source]
Loads a SNP to gene mapping
- Parameters
file (string) – File to load
name (string) – Identifier code for mapping (needs to be unique)
gcol (int) – Column with gene id
rcol (int) – Column with SNP id
wcol (int) – Column with weight
bcol (int) – Column with additional weight
delimiter (string) – Character used to separate columns
a1col (int) – Column of alternate allele (None for ignoring alleles)
a2col (int) – Column of reference allele (None for ignoring alleles)
pfilter (float) – Only include rows with wcol < pfilter
header (bool) – Header present
joint (bool) – Use mapping SNPs and gene window based SNPs
- load_refpanel(filename, parallel=1, keepfile=None, qualityT=100, SNPonly=False, chrlist=None)[source]
Sets the reference panel to use
- Parameters
filename (string) – /path/filename (without .chr#.db ending)
parallel (int) – Number of cores to use for parallel import of reference panel
keepfile – File with sample ids (one per line) to keep (only for .vcf)
qualityT – Quality threshold for variant to keep (only for .vcf)
SNPonly – Import only SNPs (only for .vcf)
chrlist (list) – List of chromosomes to import. (None to import 1-22)
Note
One file per chromosome with ending .chr#.db required (#: 1-22). If imported reference panel is not present, PascalX will automatically try to import from .chr#.tped.gz or .chr#.vcf.gz files.
- load_scores(file, gcol=0, pcol=1, header=False)[source]
Load computed gene scores
- Parameters
file (string) – Filename of data to load
gcol (int) – Column with gene symbol
pcol (int) – Column with p-value
header (bool) – File contains a header (True|False)
- matchAlleles(E_A, E_B, matchRefPanel=False)[source]
Matches alleles between two GWAS (SNPs with non matching alleles are removed)
- Parameters
E_A (str) – Identifier of first GWAS
E_B (str) – Identifier of second GWAS
matchRefPanel (bool) – Match also alleles to reference panel
Note
Currently, matchRefPanel=True requires sufficient memory to load all reference panel indices into memory.
- matchAlleles_mapper(E_A, E_B, matchRefPanel=False)[source]
Matches alleles between GWAS and Mapper loaded data (SNPs with non matching alleles are removed)
- Parameters
E_A (str) – Identifier of GWAS
E_B (str) – Identifier of MAP
matchRefPanel (bool) – Match also with reference panel alleles
- plot_genesnps(G, E_A, E_B, rank=False, zscore=False, show_correlation=False, mark_window=False, MAF=None, tickspacing=10, pcolor='limegreen', ncolor='darkviolet', corrcmap=None)[source]
Plots the SNP p-values for a list of genes and the genotypic SNP-SNP correlation matrix
- Parameters
G (list) – List of gene symbols
show_correlation (bool) – Plot the corresponding SNP-SNP correlation matrix
mark_window (bool) – Mark the gene transcription start and end positions
MAF (float) – MAF filter (None for value set in class)
tickspacing (int) – Spacing of ticks
pcolor (color) – Color for positive SNP associations
ncolor (color) – Color for negative SNP associations
corrcmap (cmap) – Colormap to use for correlation plot (None for default)
- save_scores(file)[source]
Save computed gene scores
- Parameters
file (string) – Filename to store data
- score(gene, E_A=None, E_B=None, threshold=1, parallel=1, method=None, mode=None, nobar=False, reqacc=None, autorescore=False, pcorr=0)[source]
Performs cross scoring for a given list of gene symbols
- Parameters
gene (list) – gene symbols to score.
E_A (str) – First GWAS to use
E_B (str) – Second GWAS to use
parallel (int) – # of cores to use
unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core
method (string) – Not used
mode (string) – Not used
reqacc (float) – Not used
intlimit (int) – Not used
threshold (bool) – Threshold p-value to reqacc
nobar (bool) – Do not show progress bar
autorescore (bool) – Automatically try to re-score failed genes
pcorr (float) – Sample overlap correction factor
- score_all(E_A=None, E_B=None, threshold=1, parallel=1, pcorr=0, method=None, mode=None, nobar=False, reqacc=None)[source]
Performs cross scoring for all gene symbols
- Parameters
E_A (str) – First GWAS to use
E_B (str) – Second GWAS to use
parallel (int) – # of cores to use
pcorr (float) – Sample overlap correction factor
method (string) – Not used
mode (string) – Not used
reqacc (float) – Not used
nobar (bool) – Do not show progress bar
- score_chr(E_A, E_B, chrs=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22'], threshold=1, parallel=1, pcorr=0, nobar=False)[source]
Performs cross scoring for gene symbols on given chromosomes
- Parameters
E_A (str) – First GWAS to use
E_B (str) – Second GWAS to use
chrs (list) – Chromosomes to score.
parallel (int) – # of cores to use
pcorr (float) – Sample overlap correction factor
nobar (bool) – Do not show progress bar
Genexprscorer
- class PascalX.genexpr.genexpr[source]
- get_GTEX_expr(filename)[source]
Downloads GTEx v8 data and imports.
- Parameters
filename (string) – Filename to store downloaded GTEX data
Note
The import may take several hours.
- load_expr(filename)[source]
Loads the GTEx data for usage
- Parameters
filename (string) – File to load
- load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]
Imports gene annotation from text file
- Parameters
file (text) – File to load
ccol (int) – Column containing chromosome number
cid (int) – Column containing gene id
csymb (int) – Column containing gene symbol
cstx (int) – Column containing transcription start
cetx (int) – Column containing transcription end
cs (int) – Column containing strand
chrStart (int) – Number of leading characters to skip in ccol
splitchr (string) – Character used to separate columns in text file
NAgeneid (string) – Identifier for not available gene id
useNAgenes (bool) – Import genes without gene id
header (bool) – First line is header
- Internal:
_GENEID
(dict) - Contains the raw data with gene id (cid) as key_GENESYMB
(dict) - Mapping from gene symbols (csymb) to gene ids (cid)_GENEIDtoSYMB
(dict) - Mapping from gene ids (cid) to gene symbols (csymb)_CHR
(dict) - Mapping from chromosomes to list of gene symbols_SKIPPED
(dict) - Genes (cid) which could not be imported
Note
An unique gene id is automatically generated for n/a gene ids if
useNAgenes=true
.Note
Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.
SNP database
- class PascalX.snpdb.db[source]
Class for handling storage of the raw genotype data. The data is indexed and stored for each chromosome individually as zlib compressed pickle. The indexing allows fast random access via SNP ids or positions.
- open(filename)[source]
Opens storage file. A new file is created if not exists.
- Parameters
filename (string) – Name to use for the storage file
- insert(data)[source]
Stores set of rows into the file storage
- Parameters
data (dict) – First storage key is the key in the dictionary. Second storage key is the first element of the inner list.
Warning
If all insert calls are done, the close function has to be called once to make the index persistent.
- get(pos)[source]
Returns all stored data for a set of SNPs indexed via positions
- Parameters
pos (list) – Positions of SNPs to retrieve
- getSNPatPos(pos)[source]
Returns SNP id at position
- Parameters
pos (list) – Positions of SNPs to retrieve
Reference panel
- class PascalX.refpanel.refpanel[source]
- load_pos_reference(cr, keep_idx=None)[source]
Returns a snpdb object for a chromosome and a sorted list of SNP positions on the chromosome
- Parameters
cr (int) – Chromosome number
- load_snp_reference(cr, keep_idx=None)[source]
Returns a snpdb object for a chromosome
- Parameters
cr (int) – Chromosome number
- set_refpanel(filename, parallel=1, keepfile=None, qualityT=100, SNPonly=False, chrlist=None, sourcefilename=None, regEx=None, nobar=True)[source]
Sets the reference panel to use
- Parameters
filename (string) – /path/filename (without .chr#.db ending)
parallel (int) – Number of cores to use for parallel import of reference panel
keepfile (string) – [only for .vcf] File with sample ids (one per line) to keep. None to keep all.
qualityT (int) – [only for .vcf] Quality threshold for variant to keep (None to ignore)
SNPonly (bool) – [only for .vcf] Load only SNPs
chrlist (list) – List of chromosomes to import. (None to import 1-22)
sourcefilename (string) – /path/filename (without .chr#. ending) of .tped | .vcf files. None to use same as filename
regEx (string) – Regular expression to filter sample ids. First capture group is kept. [only for .vcf]
nobar (bool) – Show progress bar (updates only if a chromosome finished)
Note
One file per chromosome with ending .chr#.db required (#: 1-22). If imported reference panel is not present, PascalX will automatically try to import from .chr#.tped.gz or .chr#.vcf.gz files.
Note
Alleles (under .vcf import) are stored internally in the order [ALT,REF].
Mapper
- class PascalX.mapper.mapper(genome=None)[source]
- load_mapping(file, gcol=0, rcol=1, wcol=None, a1col=None, a2col=None, bcol=None, pcol=None, delimiter='\t', pfilter=1, header=False, symbol=False)[source]
Loads a SNP to gene mapping
- Parameters
file (string) – File to load
gcol (int) – Column with gene id
rcol (int) – Column with SNP id
wcol (int) – Column with weight (None for none)
a1col (int) – Column of alternate allele (None for ignoring alleles)
a2col (int) – Column of reference allele (None for ignoring alleles)
bcol (int) – Column with additional weight (None for none)
pcol (int) – Column with pvalue (For None p-value is taken from .load_GWAS data)
delimiter (string) – Character used to separate columns
header (bool) – Header present
pfilter (float) – Only include rows with pcol < pfilter
symbol (bool) – Gene id are gene symbols (requires genome to be set on init)