API documentation

The PascalX package comes with the following user modules:

Genome (PascalX.genome)
Genescorer (PascalX.genescorer)
Pathwayscorer (PascalX.pathway)
Xscorer (PascalX.xscorer)
Genexprscorer (PascalX.genexpr)

Internal modules:

SNPdb (PascalX.snpdb)
RefPanel (PascalX.refpanel)
Mapper (PascalX.mapper)

Genome

class PascalX.genome.genome[source]

This class handles the genome annotation. It provides functionality for import of data from text files and automatic download of annotation data from ensembl.org.

gene_info(gene)[source]

Prints the loaded information for given gene

Parameters: gene (string) – Symbol of gene to query

get_ensembl_annotation(filename, genetype='protein_coding', version='GRCh38')[source]

Gene annotation download function for ensembl.org BioMart data

Parameters

filename (string) – File to store downloaded annotation in
genetype (string) – Comma separated list of BioMart genetypes to download (protein_coding, pseudogene, …)
version (string) – GRCh37 | CRCh38

Example:

from PascalX.genome import genome
G = genome()
G.get_ensemble_annotation('ensemble_hg38.txt')

load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, cb=None, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]

Imports gene annotation from text file

Parameters

file (text) – File to load
ccol (int) – Column containing chromosome number
cid (int) – Column containing gene id
csymb (int) – Column containing gene symbol
cstx (int) – Column containing transcription start
cetx (int) – Column containing transcription end
cs (int) – Column containing strand
cb (int) – Column containing band (None if not supplied)
chrStart (int) – Number of leading characters to skip in ccol
splitchr (string) – Character used to separate columns in text file
NAgeneid (string) – Identifier for not available gene id
useNAgenes (bool) – Import genes without gene id
header (bool) – First line is header

Internal:

_GENEID (dict) - Contains the raw data with gene id (cid) as key
_GENESYMB (dict) - Mapping from gene symbols (csymb) to gene ids (cid)
_GENEIDtoSYMB (dict) - Mapping from gene ids (cid) to gene symbols (csymb)
_CHR (dict) - Mapping from chromosomes to list of gene symbols
_BAND (dict) - Mapping from band to list of gene symbols
_SKIPPED (dict) - Genes (cid) which could not be imported

Note

An unique gene id is automatically generated for n/a gene ids if useNAgenes=true.

Note

Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.

Genescorer

class PascalX.genescorer.genescorer[source]

Genescorer base class

load_refpanel(filename, parallel=1, keepfile=None, qualityT=100, SNPonly=False, chrlist=None)[source]

Sets the reference panel to use

Parameters

filename (string) – /path/filename (without .chr#.db ending)
parallel (int) – Number of cores to use for parallel import of reference panel
keepfile – File with sample ids (one per line) to keep (only for .vcf)
qualityT – Quality threshold for variant to keep (only for .vcf)
SNPonly – Import only SNPs (only for .vcf)
chrlist (list) – List of chromosomes to import. (None to import 1-22)

Note

One file per chromosome with ending .chr#.db required (#: 1-22). If imported reference panel is not present, PascalX will automatically try to import from .chr#.tped.gz or .chr#.vcf.gz files.

load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, cb=None, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]

Imports gene annotation from text file

Parameters

file (text) – File to load
ccol (int) – Column containing chromosome number
cid (int) – Column containing gene id
csymb (int) – Column containing gene symbol
cstx (int) – Column containing transcription start
cetx (int) – Column containing transcription end
cs (int) – Column containing strand
cb (int) – Column containing band (None if not supplied)
chrStart (int) – Number of leading characters to skip in ccol
splitchr (string) – Character used to separate columns in text file
NAgeneid (string) – Identifier for not available gene id
useNAgenes (bool) – Import genes without gene id
header (bool) – First line is header

Internal:

_GENEID (dict) - Contains the raw data with gene id (cid) as key
_GENESYMB (dict) - Mapping from gene symbols (csymb) to gene ids (cid)
_GENEIDtoSYMB (dict) - Mapping from gene ids (cid) to gene symbols (csymb)
_CHR (dict) - Mapping from chromosomes to list of gene symbols
_BAND (dict) - Mapping from band to list of gene symbols
_SKIPPED (dict) - Genes (cid) which could not be imported

Note

An unique gene id is automatically generated for n/a gene ids if useNAgenes=true.

Note

Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.

load_mapping(file, gcol=0, rcol=1, wcol=None, delimiter='\t', a1col=None, a2col=None, bcol=None, pcol=None, pfilter=1, header=False, joint=True, symbol=False)[source]

Loads a SNP to gene mapping

Parameters

file (string) – File to load
gcol (int) – Column with gene id
rcol (int) – Column with SNP id
wcol (int) – Column with weight
a1col (int) – Column of alternate allele (None for ignoring alleles)
a2col (int) – Column of reference allele (None for ignoring alleles)
bcol (int) – Column with additional weight
delimiter (string) – Character used to separate columns
pfilter (float) – Only include rows with wcol < pfilter
header (bool) – Header present
joint (bool) – Use mapping SNPs and gene window based SNPs
symbol (bool) – True: Gene id is a gene symbol; False: Gene id is an ensembl gene id

Note

A loaded mapping takes precedence over a loaded positional gene annotation
The mapping data is stored statically (same for all class initializations)

load_GWAS(file, rscol=0, pcol=1, bcol=None, a1col=None, a2col=None, delimiter=None, header=False, NAid='NA', log10p=False, cutoff=1e-300)[source]

Load GWAS summary statistics p-values

Parameters

file (string) – File containing the GWAS summary statistics data. Either as textfile or gzip compressed with ending .gz
rscol (int) – Column of SNP ids
pcol (int) – Column of p-values
bcol (int) – Column of betas (optional)
a1col (int) – Column of alternate allele (None for ignoring)
a2col (int) – Column of reference allele (None for ignoring)
delimiter (String) – Split character
header (bool) – Header present
NAid (String) – Code for not available (rows are ignored)
log10p (bool) – p-values are given -log10 transformed
cutoff (float) – Cutoff value for p-values (None for no cutoff)

matchAlleles(SNPonly=False)[source]

Matches alleles between loaded GWAS and reference panel (SNPs with non matching alleles are removed)

Parameters: SNPonly (bool) – Keep only SNPs

rank()[source]: QQ normalizes the p-values of loaded GWAS

rank_mapper()[source]: QQ normalizes the p-values of loaded Mapper

save_GWAS(file)[source]

Save GWAS p-values

Parameters: file (string) – Filename to store data

save_scores(file)[source]

Save computed gene scores

Parameters: file (string) – Filename to store data

load_scores(file, gcol=0, pcol=1, header=False)[source]

Load computed gene scores

Parameters

file (string) – Filename of data to load
gcol (int) – Column with gene symbol
pcol (int) – Column with p-value
header (bool) – File contains a header (True|False)

get_topscores(N=10)[source]

Prints and returns the top gene scores

Parameters: N (int) – # to show
Returns: Ordered list of top scores
Return type: list

get_geneinfo(gene)[source]

Shows details of the loaded annotation for a gene

Parameters: gene (String) – Gene symbol

plot_Manhattan(ScoringResult=None, region=None, sigLine=0, logsigThreshold=0, labelSig=True, labelList=[], style='colorful')[source]

Produces a Manhattan plot

Parameters

ScoringResult (list) – List of gene,p-value pairs [[‘Gene’,p-value],…] (if None, generate from internal _SCORES)
region (str) – Band region to plot
sigLine (float) – Draws a horizontal line at p-value (if > 0)
logsigThreshold (float) – Significance threshold above which to label genes (log(p) values)
labelSig (bool) – Label genes above significance threshold
labelList (list) – List of gene names to label
style (string) – Design of the plot (‘classic’ | ‘colorful’)

clean()[source]: Removes scores obtained from previous runs

score_chr(chrs, unloadRef=False, method='saddle', mode='auto', reqacc=1e-100, intlimit=100000, parallel=1, nobar=False, autorescore=False, keep_idx=None)[source]

Perform gene scoring for full chromosomes

Parameters

chrs (list) – List of chromosomes to score.
unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
intlimit (int) – Max # integration terms to use
parallel (int) – # of cores to use
nobar (bool) – Do not show progress bar
autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm

score_all(parallel=1, method='saddle', mode='auto', reqacc=1e-100, intlimit=100000, nobar=False, autorescore=False, keep_idx=None)[source]

Perform full gene scoring

Parameters

parallel (int) – # of cores to use
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
intlimit (int) – Max # integration terms to use
nobar (bool) – Do not show progress bar
autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm

class PascalX.genescorer.chi2sum(window=50000, varcutoff=0.99, MAF=0.05, genome=None, gpu=False)[source]

Implementation of chi2 sum based genescorer

score(gene, parallel=1, unloadRef=False, method='saddle', mode='auto', reqacc=1e-100, intlimit=1000000, nobar=False, autorescore=False, keep_idx=None)[source]

Performs gene scoring for a given list of gene symbols

Parameters

gene (list) – gene symbols to score.
parallel (int) – # of cores to use
unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
intlimit (int) – Max # integration terms to use
nobar (bool) – Do not show progress bar
autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm

activateFails(RESULT)[source]

Helper method to force activate failed genes to success genes

Parameters: Result (list) – Return of scoring function

Warning

Only use if you know what you are doing !

rescore(RESULT, method='pearson', mode='auto', reqacc=1e-100, intlimit=100000, parallel=1, nobar=False, keep_idx=None)[source]

Function to re-score only the failed gene scorings of a previous scoring run with different scorer settings.

Parameters

RESULT (list) – Return of one of the gene scorring methods
parallel (int) – # of cores to use
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
intlimit (int) – Max # integration terms to use
nobar – Do not show progress bar

score_gene_bulk_chr(chrs, gene, data, unloadRef=False, method='saddle', mode='auto', reqacc=1e-100, intlimit=100000, autorescore=False)[source]

Perform scoring in bulk for supplied set of SNPs

Parameters

chrs (int) – Chromosome number the supplied SNPs are located on
gene (string) – Gene symbol for the SNPs
data(list} – List of SNP data in format [ [rsid1,rsid2,…], [GWASid1, GWASid2,…], M ] with M a pvalue matrix (rows: GWAS, cols: rsid)
unloadRef (bool) – Remove loaded reference data from memory
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
intlimit (int) – Max # integration terms to use
autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm

plot_genesnps(G, show_correlation=False, mark_window=False, tickspacing=10, color='limegreen', corrcmap=None)[source]

Plots the SNP p-values for a list of genes and the genotypic SNP-SNP correlation matrix

Parameters

G (list) – List of gene symbols
show_correlation (bool) – Plot the corresponding SNP-SNP correlation matrix
tickspacing (int) – Spacing of ticks for correlation plot
color (color) – Color for SNP associations
corrcmap (cmap) – Colormap to use for correlation plot (None for default)

test_gene_assocdir(gene, epsilon=1e-08)[source]

Tests for directional association of the gene (Requires betas of GWAS)

Parameters

gene (str) – Gene symbol to test
epsilon (float) – Regularization parameter

class PascalX.genescorer.wchi2sum(window=50000, varcutoff=0.99, MAF=0.05, genome=None, gpu=False)[source]

Implementation of weighted chi2 sum based genescorer

Note

SNP weights have to be supplied via Mapper

score(gene, parallel=1, unloadRef=False, method='saddle', mode='auto', reqacc=1e-100, intlimit=1000000, nobar=False, autorescore=False, keep_idx=None)[source]

Performs gene scoring for a given list of gene symbols

Parameters

gene (list) – gene symbols to score.
parallel (int) – # of cores to use
unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
intlimit (int) – Max # integration terms to use
nobar (bool) – Do not show progress bar
autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm

Pathwayscorer

class PascalX.pathway.chi2rank(genescorer, mergedist=100000, fuse=True)[source]

Pathway scoring via chi2 of ranked gene p-values. Nearby genes can be merged to form meta-genes

Parameters

genescorer (genescorer) – The initialized genescorer to use to re-compute fused genes
mergedist (int) – Maximum inbetween distance of genes to merge
fuse (bool) – Fuse nearby genes to meta-genes

score(modules, method='saddle', mode='auto', reqacc=1e-100, parallel=1, nobar=False, genes_only=False, chrs_only=None, autorescore=True)[source]

Scores a set of pathways/modules

Parameters

modules (list) – List of modules to score
samples (int) – # of random gene sets to draw
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
autorescore (bool) – Automatically try to re-score failed genes
nobar (bool) – Show progress bar
genes_only (bool) – Compute only (fused)-genescores (accessible via genescorer method)
chrs_only (list) – Only consider genes on listed chromosomes. None for all

get_sigpathways(RESULT, cutoff=0.0001)[source]

Prints significant pathways in the result set

Parameters

RESULT (list) – Return of a pathwayscorer
cutoff (float) – Significance threshold to print pathways

load_modules(file, ncol=0, fcol=2, symbol=True)[source]

Load modules from tab separated file

Parameters

file (string) – path/filename
ncol (int) – Column with name of module
fcol (int) – Column with first gene (symbol) in module. Remaining genes have to follow tab ( ) separated
symbol (bool) – Genes are given as gene symbols (False requires genome to be set in genescorer)

set_genescorer(S)[source]

Set the genescorer to use for re-computing p-values of fused genes

Parameters: S (genescorer) – The initialized genescorer to use

class PascalX.pathway.chi2perm(genescorer, mergedist=100000, fuse=True)[source]

Pathway scoring via testing summed inverse chi2 transformed gene p-values against equally size random samples of gene sets.

Parameters

genescorer (genescorer) – The initialized genescorer to use to re-compute fused genes
mergedist (int) – Maximum inbetween distance of genes to merge
fuse (bool) – Fuse nearby genes to meta-genes

Note

Genes in the background gene sets are NOT fused.

score(modules, samples=100000, method='saddle', mode='auto', reqacc=1e-100, parallel=1, nobar=False, autorescore=True)[source]

Scores a set of pathways/modules

Parameters

modules (list) – List of modules to score
samples (int) – # of random gene sets to draw
method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)
mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)
reqacc (float) – requested accuracy
nobar (bool) – Show progress bar
autorescore (bool) – Automatically try to re-score failed genes

get_sigpathways(RESULT, cutoff=0.0001)[source]

Prints significant pathways in the result set

Parameters

RESULT (list) – Return of a pathwayscorer
cutoff (float) – Significance threshold to print pathways

load_modules(file, ncol=0, fcol=2, symbol=True)[source]

Load modules from tab separated file

Parameters

file (string) – path/filename
ncol (int) – Column with name of module
fcol (int) – Column with first gene (symbol) in module. Remaining genes have to follow tab ( ) separated
symbol (bool) – Genes are given as gene symbols (False requires genome to be set in genescorer)

set_genescorer(S)[source]

Set the genescorer to use for re-computing p-values of fused genes

Parameters: S (genescorer) – The initialized genescorer to use

Xscorer

class PascalX.xscorer.zsum(window=50000, varcutoff=0.99, MAF=0.05, leftTail=False, gpu=False)[source]

This class implements the cross scorer based on SNP coherence over gene windows.

get_topscores(N=10)[source]

Prints and returns the top gene scores

Parameters: N (int) – # to show
Returns: Ordered list of top scores
Return type: list

jointlyRank(E_A, E_B)[source]

Jointly QQ normalizes the p-values of two GWAS

Parameters

E_A (str) – Identifier of first GWAS
E_B (str) – Identifier of second GWAS

jointlyRank_mapper(E_A, E_B, invert=False)[source]

Jointly QQ normalizes the p-values of GWAS and Mapper

Parameters

E_A (str) – Identifier of first GWAS
E_B (str) – Identifier of MAP

load_GWAS(file, rscol=0, pcol=1, bcol=2, a1col=None, a2col=None, idcol=None, name='GWAS', delimiter=None, NAid='n/a', header=False, threshold=1, mincutoff=0.0, rank=False, SNPonly=False, log10p=False)[source]

Load GWAS summary statistics p-values and betas

Parameters

file (string) – File containing the GWAS summary statistics data. Either as textfile or gzip compressed with ending .gz
rscol (int) – Column of SNP ids
pcol (int) – Column of p-values
bcol (int) – Column of betas
a1col (int) – Column of alternate allele (None for ignoring alleles)
a2col (int) – Column of reference allele (None for ignoring alleles)
idcol – Column of identifiers, if several different GWAS in one file
name – Identifier code for GWAS (needs to be unique)
delimiter (String) – Split character
header (bool) – Header present
NAid (String) – Code for not available (rows are ignored)
threshold (float) – Only load data with p-value < threshold
SNPonly (bool) – Load only SNPs (only if a1col and a2col is specified)
log10p (bool) – p-values are -log10 transformed

Note

The loaded GWAS data is shared between different xscorer instances !

load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, cb=None, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]

Imports gene annotation from text file

Parameters

file (text) – File to load
ccol (int) – Column containing chromosome number
cid (int) – Column containing gene id
csymb (int) – Column containing gene symbol
cstx (int) – Column containing transcription start
cetx (int) – Column containing transcription end
cs (int) – Column containing strand
chrStart (int) – Number of leading characters to skip in ccol
splitchr (string) – Character used to separate columns in text file
NAgeneid (string) – Identifier for not available gene id
useNAgenes (bool) – Import genes without gene id
header (bool) – First line is header

Internal:

_GENEID (dict) - Contains the raw data with gene id (cid) as key
_GENESYMB (dict) - Mapping from gene symbols (csymb) to gene ids (cid)
_GENEIDtoSYMB (dict) - Mapping from gene ids (cid) to gene symbols (csymb)
_CHR (dict) - Mapping from chromosomes to list of gene symbols
_BAND (dict) - Mapping from band to list of gene symbols
_SKIPPED (dict) - Genes (cid) which could not be imported

Note

An unique gene id is automatically generated for n/a gene ids if useNAgenes=true.

Note

Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.

load_mapping(file, name='MAP', gcol=0, rcol=1, wcol=None, bcol=None, delimiter='\t', a1col=None, a2col=None, pfilter=1, header=False, joint=True)[source]

Loads a SNP to gene mapping

Parameters

file (string) – File to load
name (string) – Identifier code for mapping (needs to be unique)
gcol (int) – Column with gene id
rcol (int) – Column with SNP id
wcol (int) – Column with weight
bcol (int) – Column with additional weight
delimiter (string) – Character used to separate columns
a1col (int) – Column of alternate allele (None for ignoring alleles)
a2col (int) – Column of reference allele (None for ignoring alleles)
pfilter (float) – Only include rows with wcol < pfilter
header (bool) – Header present
joint (bool) – Use mapping SNPs and gene window based SNPs

load_refpanel(filename, parallel=1, keepfile=None, qualityT=100, SNPonly=False, chrlist=None)[source]

Sets the reference panel to use

Parameters

filename (string) – /path/filename (without .chr#.db ending)
parallel (int) – Number of cores to use for parallel import of reference panel
keepfile – File with sample ids (one per line) to keep (only for .vcf)
qualityT – Quality threshold for variant to keep (only for .vcf)
SNPonly – Import only SNPs (only for .vcf)
chrlist (list) – List of chromosomes to import. (None to import 1-22)

Note

One file per chromosome with ending .chr#.db required (#: 1-22). If imported reference panel is not present, PascalX will automatically try to import from .chr#.tped.gz or .chr#.vcf.gz files.

load_scores(file, gcol=0, pcol=1, header=False)[source]

Load computed gene scores

Parameters

file (string) – Filename of data to load
gcol (int) – Column with gene symbol
pcol (int) – Column with p-value
header (bool) – File contains a header (True|False)

matchAlleles(E_A, E_B, matchRefPanel=False)[source]

Matches alleles between two GWAS (SNPs with non matching alleles are removed)

Parameters

E_A (str) – Identifier of first GWAS
E_B (str) – Identifier of second GWAS
matchRefPanel (bool) – Match also alleles to reference panel

Note

Currently, matchRefPanel=True requires sufficient memory to load all reference panel indices into memory.

matchAlleles_mapper(E_A, E_B, matchRefPanel=False)[source]

Matches alleles between GWAS and Mapper loaded data (SNPs with non matching alleles are removed)

Parameters

E_A (str) – Identifier of GWAS
E_B (str) – Identifier of MAP
matchRefPanel (bool) – Match also with reference panel alleles

plot_genesnps(G, E_A, E_B, rank=False, zscore=False, show_correlation=False, mark_window=False, MAF=None, tickspacing=10, pcolor='limegreen', ncolor='darkviolet', corrcmap=None)[source]

Plots the SNP p-values for a list of genes and the genotypic SNP-SNP correlation matrix

Parameters

G (list) – List of gene symbols
show_correlation (bool) – Plot the corresponding SNP-SNP correlation matrix
mark_window (bool) – Mark the gene transcription start and end positions
MAF (float) – MAF filter (None for value set in class)
tickspacing (int) – Spacing of ticks
pcolor (color) – Color for positive SNP associations
ncolor (color) – Color for negative SNP associations
corrcmap (cmap) – Colormap to use for correlation plot (None for default)

save_scores(file)[source]

Save computed gene scores

Parameters: file (string) – Filename to store data

score(gene, E_A=None, E_B=None, threshold=1, parallel=1, method=None, mode=None, nobar=False, reqacc=None, autorescore=False, pcorr=0)[source]

Performs cross scoring for a given list of gene symbols

Parameters

gene (list) – gene symbols to score.
E_A (str) – First GWAS to use
E_B (str) – Second GWAS to use
parallel (int) – # of cores to use
unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core
method (string) – Not used
mode (string) – Not used
reqacc (float) – Not used
intlimit (int) – Not used
threshold (bool) – Threshold p-value to reqacc
nobar (bool) – Do not show progress bar
autorescore (bool) – Automatically try to re-score failed genes
pcorr (float) – Sample overlap correction factor

score_all(E_A=None, E_B=None, threshold=1, parallel=1, pcorr=0, method=None, mode=None, nobar=False, reqacc=None)[source]

Performs cross scoring for all gene symbols

Parameters

E_A (str) – First GWAS to use
E_B (str) – Second GWAS to use
parallel (int) – # of cores to use
pcorr (float) – Sample overlap correction factor
method (string) – Not used
mode (string) – Not used
reqacc (float) – Not used
nobar (bool) – Do not show progress bar

score_chr(E_A, E_B, chrs=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22'], threshold=1, parallel=1, pcorr=0, nobar=False)[source]

Performs cross scoring for gene symbols on given chromosomes

Parameters

E_A (str) – First GWAS to use
E_B (str) – Second GWAS to use
chrs (list) – Chromosomes to score.
parallel (int) – # of cores to use
pcorr (float) – Sample overlap correction factor
nobar (bool) – Do not show progress bar

score_map(E_B, parallel=1, nobar=False, pcorr=0)[source]

Performs cross scoring for gene symbols given by mapper

Parameters

parallel (int) – # of cores to use
E_B (string) – Identifier of loaded MAP

class PascalX.xscorer.rsum(window=50000, varcutoff=0.99, MAF=0.05, leftTail=False, gpu=False)[source]

This class implements the ratio cross scorer based on SNP coherence/variance over gene windows.

get_topscores(N=10)[source]

Prints and returns the top gene scores

Parameters: N (int) – # to show
Returns: Ordered list of top scores
Return type: list

jointlyRank(E_A, E_B)[source]

Jointly QQ normalizes the p-values of two GWAS

Parameters

E_A (str) – Identifier of first GWAS
E_B (str) – Identifier of second GWAS

jointlyRank_mapper(E_A, E_B, invert=False)[source]

Jointly QQ normalizes the p-values of GWAS and Mapper

Parameters

E_A (str) – Identifier of first GWAS
E_B (str) – Identifier of MAP

load_GWAS(file, rscol=0, pcol=1, bcol=2, a1col=None, a2col=None, idcol=None, name='GWAS', delimiter=None, NAid='n/a', header=False, threshold=1, mincutoff=0.0, rank=False, SNPonly=False, log10p=False)[source]

Load GWAS summary statistics p-values and betas

Parameters

file (string) – File containing the GWAS summary statistics data. Either as textfile or gzip compressed with ending .gz
rscol (int) – Column of SNP ids
pcol (int) – Column of p-values
bcol (int) – Column of betas
a1col (int) – Column of alternate allele (None for ignoring alleles)
a2col (int) – Column of reference allele (None for ignoring alleles)
idcol – Column of identifiers, if several different GWAS in one file
name – Identifier code for GWAS (needs to be unique)
delimiter (String) – Split character
header (bool) – Header present
NAid (String) – Code for not available (rows are ignored)
threshold (float) – Only load data with p-value < threshold
SNPonly (bool) – Load only SNPs (only if a1col and a2col is specified)
log10p (bool) – p-values are -log10 transformed

Note

The loaded GWAS data is shared between different xscorer instances !

load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, cb=None, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]

Imports gene annotation from text file

Parameters

file (text) – File to load
ccol (int) – Column containing chromosome number
cid (int) – Column containing gene id
csymb (int) – Column containing gene symbol
cstx (int) – Column containing transcription start
cetx (int) – Column containing transcription end
cs (int) – Column containing strand
chrStart (int) – Number of leading characters to skip in ccol
splitchr (string) – Character used to separate columns in text file
NAgeneid (string) – Identifier for not available gene id
useNAgenes (bool) – Import genes without gene id
header (bool) – First line is header

Internal:

_GENEID (dict) - Contains the raw data with gene id (cid) as key
_GENESYMB (dict) - Mapping from gene symbols (csymb) to gene ids (cid)
_GENEIDtoSYMB (dict) - Mapping from gene ids (cid) to gene symbols (csymb)
_CHR (dict) - Mapping from chromosomes to list of gene symbols
_BAND (dict) - Mapping from band to list of gene symbols
_SKIPPED (dict) - Genes (cid) which could not be imported

Note

An unique gene id is automatically generated for n/a gene ids if useNAgenes=true.

Note

Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.

load_mapping(file, name='MAP', gcol=0, rcol=1, wcol=None, bcol=None, delimiter='\t', a1col=None, a2col=None, pfilter=1, header=False, joint=True)[source]

Loads a SNP to gene mapping

Parameters

file (string) – File to load
name (string) – Identifier code for mapping (needs to be unique)
gcol (int) – Column with gene id
rcol (int) – Column with SNP id
wcol (int) – Column with weight
bcol (int) – Column with additional weight
delimiter (string) – Character used to separate columns
a1col (int) – Column of alternate allele (None for ignoring alleles)
a2col (int) – Column of reference allele (None for ignoring alleles)
pfilter (float) – Only include rows with wcol < pfilter
header (bool) – Header present
joint (bool) – Use mapping SNPs and gene window based SNPs

load_refpanel(filename, parallel=1, keepfile=None, qualityT=100, SNPonly=False, chrlist=None)[source]

Sets the reference panel to use

Parameters

filename (string) – /path/filename (without .chr#.db ending)
parallel (int) – Number of cores to use for parallel import of reference panel
keepfile – File with sample ids (one per line) to keep (only for .vcf)
qualityT – Quality threshold for variant to keep (only for .vcf)
SNPonly – Import only SNPs (only for .vcf)
chrlist (list) – List of chromosomes to import. (None to import 1-22)

Note

One file per chromosome with ending .chr#.db required (#: 1-22). If imported reference panel is not present, PascalX will automatically try to import from .chr#.tped.gz or .chr#.vcf.gz files.

load_scores(file, gcol=0, pcol=1, header=False)[source]

Load computed gene scores

Parameters

file (string) – Filename of data to load
gcol (int) – Column with gene symbol
pcol (int) – Column with p-value
header (bool) – File contains a header (True|False)

matchAlleles(E_A, E_B, matchRefPanel=False)[source]

Matches alleles between two GWAS (SNPs with non matching alleles are removed)

Parameters

E_A (str) – Identifier of first GWAS
E_B (str) – Identifier of second GWAS
matchRefPanel (bool) – Match also alleles to reference panel

Note

Currently, matchRefPanel=True requires sufficient memory to load all reference panel indices into memory.

matchAlleles_mapper(E_A, E_B, matchRefPanel=False)[source]

Matches alleles between GWAS and Mapper loaded data (SNPs with non matching alleles are removed)

Parameters

E_A (str) – Identifier of GWAS
E_B (str) – Identifier of MAP
matchRefPanel (bool) – Match also with reference panel alleles

plot_genesnps(G, E_A, E_B, rank=False, zscore=False, show_correlation=False, mark_window=False, MAF=None, tickspacing=10, pcolor='limegreen', ncolor='darkviolet', corrcmap=None)[source]

Plots the SNP p-values for a list of genes and the genotypic SNP-SNP correlation matrix

Parameters

G (list) – List of gene symbols
show_correlation (bool) – Plot the corresponding SNP-SNP correlation matrix
mark_window (bool) – Mark the gene transcription start and end positions
MAF (float) – MAF filter (None for value set in class)
tickspacing (int) – Spacing of ticks
pcolor (color) – Color for positive SNP associations
ncolor (color) – Color for negative SNP associations
corrcmap (cmap) – Colormap to use for correlation plot (None for default)

save_scores(file)[source]

Save computed gene scores

Parameters: file (string) – Filename to store data

score(gene, E_A=None, E_B=None, threshold=1, parallel=1, method=None, mode=None, nobar=False, reqacc=None, autorescore=False, pcorr=0)[source]

Performs cross scoring for a given list of gene symbols

Parameters

gene (list) – gene symbols to score.
E_A (str) – First GWAS to use
E_B (str) – Second GWAS to use
parallel (int) – # of cores to use
unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core
method (string) – Not used
mode (string) – Not used
reqacc (float) – Not used
intlimit (int) – Not used
threshold (bool) – Threshold p-value to reqacc
nobar (bool) – Do not show progress bar
autorescore (bool) – Automatically try to re-score failed genes
pcorr (float) – Sample overlap correction factor

score_all(E_A=None, E_B=None, threshold=1, parallel=1, pcorr=0, method=None, mode=None, nobar=False, reqacc=None)[source]

Performs cross scoring for all gene symbols

Parameters

E_A (str) – First GWAS to use
E_B (str) – Second GWAS to use
parallel (int) – # of cores to use
pcorr (float) – Sample overlap correction factor
method (string) – Not used
mode (string) – Not used
reqacc (float) – Not used
nobar (bool) – Do not show progress bar

score_chr(E_A, E_B, chrs=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22'], threshold=1, parallel=1, pcorr=0, nobar=False)[source]

Performs cross scoring for gene symbols on given chromosomes

Parameters

E_A (str) – First GWAS to use
E_B (str) – Second GWAS to use
chrs (list) – Chromosomes to score.
parallel (int) – # of cores to use
pcorr (float) – Sample overlap correction factor
nobar (bool) – Do not show progress bar

score_map(E_B, parallel=1, nobar=False, pcorr=0)[source]

Performs cross scoring for gene symbols given by mapper

Parameters

parallel (int) – # of cores to use
E_B (string) – Identifier of loaded MAP

Genexprscorer

class PascalX.genexpr.genexpr[source]

get_GTEX_expr(filename)[source]

Downloads GTEx v8 data and imports.

Parameters: filename (string) – Filename to store downloaded GTEX data

Note

The import may take several hours.

load_expr(filename)[source]

Loads the GTEx data for usage

Parameters: filename (string) – File to load

load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]

Imports gene annotation from text file

Parameters

file (text) – File to load
ccol (int) – Column containing chromosome number
cid (int) – Column containing gene id
csymb (int) – Column containing gene symbol
cstx (int) – Column containing transcription start
cetx (int) – Column containing transcription end
cs (int) – Column containing strand
chrStart (int) – Number of leading characters to skip in ccol
splitchr (string) – Character used to separate columns in text file
NAgeneid (string) – Identifier for not available gene id
useNAgenes (bool) – Import genes without gene id
header (bool) – First line is header

Internal:

_GENEID (dict) - Contains the raw data with gene id (cid) as key
_GENESYMB (dict) - Mapping from gene symbols (csymb) to gene ids (cid)
_GENEIDtoSYMB (dict) - Mapping from gene ids (cid) to gene symbols (csymb)
_CHR (dict) - Mapping from chromosomes to list of gene symbols
_SKIPPED (dict) - Genes (cid) which could not be imported

Note

An unique gene id is automatically generated for n/a gene ids if useNAgenes=true.

Note

Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.

chi2rank(pathways, fuse=True)[source]

Calculates tissue enrichment scores for set of genes using GTEx data

Parameters

pathways (list) – List of pathways to score [ [‘name’,[‘gene1’,’gene2’,…]],…]
fuse (bool) – Fuse nearby genes

plot_genexpr(genes, tzscore=False, cbar_pos=(0.0, 0.0, 0.01, 0.5))[source]

Plots gene expression matrix for list of genes

Parameters

genes (list) – list of genes
tzscore (bool) – zscore over tissues per gene (true|false)
cbar_pos (list) – Position coordinates of color bar

SNP database

class PascalX.snpdb.db[source]

Class for handling storage of the raw genotype data. The data is indexed and stored for each chromosome individually as zlib compressed pickle. The indexing allows fast random access via SNP ids or positions.

open(filename)[source]

Opens storage file. A new file is created if not exists.

Parameters: filename (string) – Name to use for the storage file

insert(data)[source]

Stores set of rows into the file storage

Parameters: data (dict) – First storage key is the key in the dictionary. Second storage key is the first element of the inner list.

Warning

If all insert calls are done, the close function has to be called once to make the index persistent.

get(pos)[source]

Returns all stored data for a set of SNPs indexed via positions

Parameters: pos (list) – Positions of SNPs to retrieve

getSNPatPos(pos)[source]

Returns SNP id at position

Parameters: pos (list) – Positions of SNPs to retrieve

getPosatSNPs(snpids)[source]: Returns the position corresponding to a snpid WARNING: Inefficient

getSNPs(snps)[source]

Returns all stored data for a set of SNPs indexed via SNP ids

Parameters: snp (list) – ids of SNPs to retrieve

getSNPKeys()[source]: Returns the SNP ids in storage

getKeys()[source]: Returns SNP positions in storage

getSortedKeys()[source]: Returns a sorted list of SNP positions in storage

close()[source]: Closes open storage file.

Warning

After all inserts are done this function has to be called once to re-generate the index and close the storage file.

Reference panel

class PascalX.refpanel.refpanel[source]

load_pos_reference(cr, keep_idx=None)[source]

Returns a snpdb object for a chromosome and a sorted list of SNP positions on the chromosome

Parameters: cr (int) – Chromosome number

load_snp_reference(cr, keep_idx=None)[source]

Returns a snpdb object for a chromosome

Parameters: cr (int) – Chromosome number

set_refpanel(filename, parallel=1, keepfile=None, qualityT=100, SNPonly=False, chrlist=None, sourcefilename=None, regEx=None, nobar=True)[source]

Sets the reference panel to use

Parameters

filename (string) – /path/filename (without .chr#.db ending)
parallel (int) – Number of cores to use for parallel import of reference panel
keepfile (string) – [only for .vcf] File with sample ids (one per line) to keep. None to keep all.
qualityT (int) – [only for .vcf] Quality threshold for variant to keep (None to ignore)
SNPonly (bool) – [only for .vcf] Load only SNPs
chrlist (list) – List of chromosomes to import. (None to import 1-22)
sourcefilename (string) – /path/filename (without .chr#. ending) of .tped | .vcf files. None to use same as filename
regEx (string) – Regular expression to filter sample ids. First capture group is kept. [only for .vcf]
nobar (bool) – Show progress bar (updates only if a chromosome finished)

Note

One file per chromosome with ending .chr#.db required (#: 1-22). If imported reference panel is not present, PascalX will automatically try to import from .chr#.tped.gz or .chr#.vcf.gz files.

Note

Alleles (under .vcf import) are stored internally in the order [ALT,REF].

getSNPtoChrMap()[source]: Returns a dictionary mapping SNP id to corresponding chromosome number

getChrSNPs(cr)[source]

Returns SNP ids for a chromosome

Parameters: cr (int) – Chromosome number

Mapper

class PascalX.mapper.mapper(genome=None)[source]

load_mapping(file, gcol=0, rcol=1, wcol=None, a1col=None, a2col=None, bcol=None, pcol=None, delimiter='\t', pfilter=1, header=False, symbol=False)[source]

Loads a SNP to gene mapping

Parameters

file (string) – File to load
gcol (int) – Column with gene id
rcol (int) – Column with SNP id
wcol (int) – Column with weight (None for none)
a1col (int) – Column of alternate allele (None for ignoring alleles)
a2col (int) – Column of reference allele (None for ignoring alleles)
bcol (int) – Column with additional weight (None for none)
pcol (int) – Column with pvalue (For None p-value is taken from .load_GWAS data)
delimiter (string) – Character used to separate columns
header (bool) – Header present
pfilter (float) – Only include rows with pcol < pfilter
symbol (bool) – Gene id are gene symbols (requires genome to be set on init)