API documentation

The PascalX package comes with the following user modules:

Internal modules:


Genome

class PascalX.genome.genome[source]

This class handles the genome annotation. It provides functionality for import of data from text files and automatic download of annotation data from ensembl.org.

gene_info(gene)[source]

Prints the loaded information for given gene

Parameters

gene (string) – Symbol of gene to query

get_ensembl_annotation(filename, genetype='protein_coding', version='GRCh38')[source]

Gene annotation download function for ensembl.org BioMart data

Parameters
  • filename (string) – File to store downloaded annotation in

  • genetype (string) – Comma separated list of BioMart genetypes to download (protein_coding, pseudogene, …)

  • version (string) – GRCh37 | CRCh38

Example:

from PascalX.genome import genome
G = genome()
G.get_ensemble_annotation('ensemble_hg38.txt')
load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, cb=None, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]

Imports gene annotation from text file

Parameters
  • file (text) – File to load

  • ccol (int) – Column containing chromosome number

  • cid (int) – Column containing gene id

  • csymb (int) – Column containing gene symbol

  • cstx (int) – Column containing transcription start

  • cetx (int) – Column containing transcription end

  • cs (int) – Column containing strand

  • cb (int) – Column containing band (None if not supplied)

  • chrStart (int) – Number of leading characters to skip in ccol

  • splitchr (string) – Character used to separate columns in text file

  • NAgeneid (string) – Identifier for not available gene id

  • useNAgenes (bool) – Import genes without gene id

  • header (bool) – First line is header

Internal:
  • _GENEID (dict) - Contains the raw data with gene id (cid) as key

  • _GENESYMB (dict) - Mapping from gene symbols (csymb) to gene ids (cid)

  • _GENEIDtoSYMB (dict) - Mapping from gene ids (cid) to gene symbols (csymb)

  • _CHR (dict) - Mapping from chromosomes to list of gene symbols

  • _BAND (dict) - Mapping from band to list of gene symbols

  • _SKIPPED (dict) - Genes (cid) which could not be imported

Note

An unique gene id is automatically generated for n/a gene ids if useNAgenes=true.

Note

Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.


Genescorer

class PascalX.genescorer.genescorer[source]

Genescorer base class

load_refpanel(filename, parallel=1, keepfile=None, qualityT=100, SNPonly=False, chrlist=None)[source]

Sets the reference panel to use

Parameters
  • filename (string) – /path/filename (without .chr#.db ending)

  • parallel (int) – Number of cores to use for parallel import of reference panel

  • keepfile – File with sample ids (one per line) to keep (only for .vcf)

  • qualityT – Quality threshold for variant to keep (only for .vcf)

  • SNPonly – Import only SNPs (only for .vcf)

  • chrlist (list) – List of chromosomes to import. (None to import 1-22)

Note

One file per chromosome with ending .chr#.db required (#: 1-22). If imported reference panel is not present, PascalX will automatically try to import from .chr#.tped.gz or .chr#.vcf.gz files.

load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, cb=None, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]

Imports gene annotation from text file

Parameters
  • file (text) – File to load

  • ccol (int) – Column containing chromosome number

  • cid (int) – Column containing gene id

  • csymb (int) – Column containing gene symbol

  • cstx (int) – Column containing transcription start

  • cetx (int) – Column containing transcription end

  • cs (int) – Column containing strand

  • cb (int) – Column containing band (None if not supplied)

  • chrStart (int) – Number of leading characters to skip in ccol

  • splitchr (string) – Character used to separate columns in text file

  • NAgeneid (string) – Identifier for not available gene id

  • useNAgenes (bool) – Import genes without gene id

  • header (bool) – First line is header

Internal:
  • _GENEID (dict) - Contains the raw data with gene id (cid) as key

  • _GENESYMB (dict) - Mapping from gene symbols (csymb) to gene ids (cid)

  • _GENEIDtoSYMB (dict) - Mapping from gene ids (cid) to gene symbols (csymb)

  • _CHR (dict) - Mapping from chromosomes to list of gene symbols

  • _BAND (dict) - Mapping from band to list of gene symbols

  • _SKIPPED (dict) - Genes (cid) which could not be imported

Note

An unique gene id is automatically generated for n/a gene ids if useNAgenes=true.

Note

Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.

load_mapping(file, gcol=0, rcol=1, wcol=None, delimiter='\t', a1col=None, a2col=None, bcol=None, pcol=None, pfilter=1, header=False, joint=True, symbol=False)[source]

Loads a SNP to gene mapping

Parameters
  • file (string) – File to load

  • gcol (int) – Column with gene id

  • rcol (int) – Column with SNP id

  • wcol (int) – Column with weight

  • a1col (int) – Column of alternate allele (None for ignoring alleles)

  • a2col (int) – Column of reference allele (None for ignoring alleles)

  • bcol (int) – Column with additional weight

  • delimiter (string) – Character used to separate columns

  • pfilter (float) – Only include rows with wcol < pfilter

  • header (bool) – Header present

  • joint (bool) – Use mapping SNPs and gene window based SNPs

  • symbol (bool) – True: Gene id is a gene symbol; False: Gene id is an ensembl gene id

Note

  • A loaded mapping takes precedence over a loaded positional gene annotation

  • The mapping data is stored statically (same for all class initializations)

load_GWAS(file, rscol=0, pcol=1, bcol=None, a1col=None, a2col=None, delimiter=None, header=False, NAid='NA', log10p=False, cutoff=1e-300)[source]

Load GWAS summary statistics p-values

Parameters
  • file (string) – File containing the GWAS summary statistics data. Either as textfile or gzip compressed with ending .gz

  • rscol (int) – Column of SNP ids

  • pcol (int) – Column of p-values

  • bcol (int) – Column of betas (optional)

  • a1col (int) – Column of alternate allele (None for ignoring)

  • a2col (int) – Column of reference allele (None for ignoring)

  • delimiter (String) – Split character

  • header (bool) – Header present

  • NAid (String) – Code for not available (rows are ignored)

  • log10p (bool) – p-values are given -log10 transformed

  • cutoff (float) – Cutoff value for p-values (None for no cutoff)

matchAlleles(SNPonly=False)[source]

Matches alleles between loaded GWAS and reference panel (SNPs with non matching alleles are removed)

Parameters

SNPonly (bool) – Keep only SNPs

rank()[source]

QQ normalizes the p-values of loaded GWAS

rank_mapper()[source]

QQ normalizes the p-values of loaded Mapper

save_GWAS(file)[source]

Save GWAS p-values

Parameters

file (string) – Filename to store data

save_scores(file)[source]

Save computed gene scores

Parameters

file (string) – Filename to store data

load_scores(file, gcol=0, pcol=1, header=False)[source]

Load computed gene scores

Parameters
  • file (string) – Filename of data to load

  • gcol (int) – Column with gene symbol

  • pcol (int) – Column with p-value

  • header (bool) – File contains a header (True|False)

get_topscores(N=10)[source]

Prints and returns the top gene scores

Parameters

N (int) – # to show

Returns

Ordered list of top scores

Return type

list

get_geneinfo(gene)[source]

Shows details of the loaded annotation for a gene

Parameters

gene (String) – Gene symbol

plot_Manhattan(ScoringResult=None, region=None, sigLine=0, logsigThreshold=0, labelSig=True, labelList=[], style='colorful')[source]

Produces a Manhattan plot

Parameters
  • ScoringResult (list) – List of gene,p-value pairs [[‘Gene’,p-value],…] (if None, generate from internal _SCORES)

  • region (str) – Band region to plot

  • sigLine (float) – Draws a horizontal line at p-value (if > 0)

  • logsigThreshold (float) – Significance threshold above which to label genes (log(p) values)

  • labelSig (bool) – Label genes above significance threshold

  • labelList (list) – List of gene names to label

  • style (string) – Design of the plot (‘classic’ | ‘colorful’)

clean()[source]

Removes scores obtained from previous runs

score_chr(chrs, unloadRef=False, method='saddle', mode='auto', reqacc=1e-100, intlimit=100000, parallel=1, nobar=False, autorescore=False, keep_idx=None)[source]

Perform gene scoring for full chromosomes

Parameters
  • chrs (list) – List of chromosomes to score.

  • unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core

  • method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)

  • mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)

  • reqacc (float) – requested accuracy

  • intlimit (int) – Max # integration terms to use

  • parallel (int) – # of cores to use

  • nobar (bool) – Do not show progress bar

  • autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm

score_all(parallel=1, method='saddle', mode='auto', reqacc=1e-100, intlimit=100000, nobar=False, autorescore=False, keep_idx=None)[source]

Perform full gene scoring

Parameters
  • parallel (int) – # of cores to use

  • method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’)

  • mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)

  • reqacc (float) – requested accuracy

  • intlimit (int) – Max # integration terms to use

  • nobar (bool) – Do not show progress bar

  • autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm

class PascalX.genescorer.chi2sum(window=50000, varcutoff=0.99, MAF=0.05, genome=None, gpu=False)[source]

Implementation of chi2 sum based genescorer

score(gene, parallel=1, unloadRef=False, method='saddle', mode='auto', reqacc=1e-100, intlimit=1000000, nobar=False, autorescore=False, keep_idx=None)[source]

Performs gene scoring for a given list of gene symbols

Parameters
  • gene (list) – gene symbols to score.

  • parallel (int) – # of cores to use

  • unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core

  • method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)

  • mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)

  • reqacc (float) – requested accuracy

  • intlimit (int) – Max # integration terms to use

  • nobar (bool) – Do not show progress bar

  • autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm

activateFails(RESULT)[source]

Helper method to force activate failed genes to success genes

Parameters

Result (list) – Return of scoring function

Warning

Only use if you know what you are doing !

rescore(RESULT, method='pearson', mode='auto', reqacc=1e-100, intlimit=100000, parallel=1, nobar=False, keep_idx=None)[source]

Function to re-score only the failed gene scorings of a previous scoring run with different scorer settings.

Parameters
  • RESULT (list) – Return of one of the gene scorring methods

  • parallel (int) – # of cores to use

  • method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)

  • mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)

  • reqacc (float) – requested accuracy

  • intlimit (int) – Max # integration terms to use

  • nobar – Do not show progress bar

score_gene_bulk_chr(chrs, gene, data, unloadRef=False, method='saddle', mode='auto', reqacc=1e-100, intlimit=100000, autorescore=False)[source]

Perform scoring in bulk for supplied set of SNPs

Parameters
  • chrs (int) – Chromosome number the supplied SNPs are located on

  • gene (string) – Gene symbol for the SNPs

  • data(list} – List of SNP data in format [ [rsid1,rsid2,…], [GWASid1, GWASid2,…], M ] with M a pvalue matrix (rows: GWAS, cols: rsid)

  • unloadRef (bool) – Remove loaded reference data from memory

  • method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)

  • mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)

  • reqacc (float) – requested accuracy

  • intlimit (int) – Max # integration terms to use

  • autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm

plot_genesnps(G, show_correlation=False, mark_window=False, tickspacing=10, color='limegreen', corrcmap=None)[source]

Plots the SNP p-values for a list of genes and the genotypic SNP-SNP correlation matrix

Parameters
  • G (list) – List of gene symbols

  • show_correlation (bool) – Plot the corresponding SNP-SNP correlation matrix

  • tickspacing (int) – Spacing of ticks for correlation plot

  • color (color) – Color for SNP associations

  • corrcmap (cmap) – Colormap to use for correlation plot (None for default)

test_gene_assocdir(gene, epsilon=1e-08)[source]

Tests for directional association of the gene (Requires betas of GWAS)

Parameters
  • gene (str) – Gene symbol to test

  • epsilon (float) – Regularization parameter

class PascalX.genescorer.wchi2sum(window=50000, varcutoff=0.99, MAF=0.05, genome=None, gpu=False)[source]

Implementation of weighted chi2 sum based genescorer

Note

SNP weights have to be supplied via Mapper

score(gene, parallel=1, unloadRef=False, method='saddle', mode='auto', reqacc=1e-100, intlimit=1000000, nobar=False, autorescore=False, keep_idx=None)[source]

Performs gene scoring for a given list of gene symbols

Parameters
  • gene (list) – gene symbols to score.

  • parallel (int) – # of cores to use

  • unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core

  • method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)

  • mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)

  • reqacc (float) – requested accuracy

  • intlimit (int) – Max # integration terms to use

  • nobar (bool) – Do not show progress bar

  • autorescore (bool) – Automatically try to re-score failed genes via Pearson’s algorithm

Pathwayscorer

class PascalX.pathway.chi2rank(genescorer, mergedist=100000, fuse=True)[source]

Pathway scoring via chi2 of ranked gene p-values. Nearby genes can be merged to form meta-genes

Parameters
  • genescorer (genescorer) – The initialized genescorer to use to re-compute fused genes

  • mergedist (int) – Maximum inbetween distance of genes to merge

  • fuse (bool) – Fuse nearby genes to meta-genes

score(modules, method='saddle', mode='auto', reqacc=1e-100, parallel=1, nobar=False, genes_only=False, chrs_only=None, autorescore=True)[source]

Scores a set of pathways/modules

Parameters
  • modules (list) – List of modules to score

  • samples (int) – # of random gene sets to draw

  • method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)

  • mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)

  • reqacc (float) – requested accuracy

  • autorescore (bool) – Automatically try to re-score failed genes

  • nobar (bool) – Show progress bar

  • genes_only (bool) – Compute only (fused)-genescores (accessible via genescorer method)

  • chrs_only (list) – Only consider genes on listed chromosomes. None for all

get_sigpathways(RESULT, cutoff=0.0001)[source]

Prints significant pathways in the result set

Parameters
  • RESULT (list) – Return of a pathwayscorer

  • cutoff (float) – Significance threshold to print pathways

load_modules(file, ncol=0, fcol=2, symbol=True)[source]

Load modules from tab separated file

Parameters
  • file (string) – path/filename

  • ncol (int) – Column with name of module

  • fcol (int) – Column with first gene (symbol) in module. Remaining genes have to follow tab ( ) separated

  • symbol (bool) – Genes are given as gene symbols (False requires genome to be set in genescorer)

set_genescorer(S)[source]

Set the genescorer to use for re-computing p-values of fused genes

Parameters

S (genescorer) – The initialized genescorer to use

class PascalX.pathway.chi2perm(genescorer, mergedist=100000, fuse=True)[source]

Pathway scoring via testing summed inverse chi2 transformed gene p-values against equally size random samples of gene sets.

Parameters
  • genescorer (genescorer) – The initialized genescorer to use to re-compute fused genes

  • mergedist (int) – Maximum inbetween distance of genes to merge

  • fuse (bool) – Fuse nearby genes to meta-genes

Note

Genes in the background gene sets are NOT fused.

score(modules, samples=100000, method='saddle', mode='auto', reqacc=1e-100, parallel=1, nobar=False, autorescore=True)[source]

Scores a set of pathways/modules

Parameters
  • modules (list) – List of modules to score

  • samples (int) – # of random gene sets to draw

  • method (string) – Method to use to evaluate tail probability (‘auto’,’davies’,’ruben’,’satterthwaite’,’pearson’,’saddle’)

  • mode (string) – Precision mode to use (‘’,’128b’,’100d’,’auto’)

  • reqacc (float) – requested accuracy

  • nobar (bool) – Show progress bar

  • autorescore (bool) – Automatically try to re-score failed genes

get_sigpathways(RESULT, cutoff=0.0001)[source]

Prints significant pathways in the result set

Parameters
  • RESULT (list) – Return of a pathwayscorer

  • cutoff (float) – Significance threshold to print pathways

load_modules(file, ncol=0, fcol=2, symbol=True)[source]

Load modules from tab separated file

Parameters
  • file (string) – path/filename

  • ncol (int) – Column with name of module

  • fcol (int) – Column with first gene (symbol) in module. Remaining genes have to follow tab ( ) separated

  • symbol (bool) – Genes are given as gene symbols (False requires genome to be set in genescorer)

set_genescorer(S)[source]

Set the genescorer to use for re-computing p-values of fused genes

Parameters

S (genescorer) – The initialized genescorer to use

Xscorer

class PascalX.xscorer.zsum(window=50000, varcutoff=0.99, MAF=0.05, leftTail=False, gpu=False)[source]

This class implements the cross scorer based on SNP coherence over gene windows.

get_topscores(N=10)[source]

Prints and returns the top gene scores

Parameters

N (int) – # to show

Returns

Ordered list of top scores

Return type

list

jointlyRank(E_A, E_B)[source]

Jointly QQ normalizes the p-values of two GWAS

Parameters
  • E_A (str) – Identifier of first GWAS

  • E_B (str) – Identifier of second GWAS

jointlyRank_mapper(E_A, E_B, invert=False)[source]

Jointly QQ normalizes the p-values of GWAS and Mapper

Parameters
  • E_A (str) – Identifier of first GWAS

  • E_B (str) – Identifier of MAP

load_GWAS(file, rscol=0, pcol=1, bcol=2, a1col=None, a2col=None, idcol=None, name='GWAS', delimiter=None, NAid='n/a', header=False, threshold=1, mincutoff=0.0, rank=False, SNPonly=False, log10p=False)[source]

Load GWAS summary statistics p-values and betas

Parameters
  • file (string) – File containing the GWAS summary statistics data. Either as textfile or gzip compressed with ending .gz

  • rscol (int) – Column of SNP ids

  • pcol (int) – Column of p-values

  • bcol (int) – Column of betas

  • a1col (int) – Column of alternate allele (None for ignoring alleles)

  • a2col (int) – Column of reference allele (None for ignoring alleles)

  • idcol – Column of identifiers, if several different GWAS in one file

  • name – Identifier code for GWAS (needs to be unique)

  • delimiter (String) – Split character

  • header (bool) – Header present

  • NAid (String) – Code for not available (rows are ignored)

  • threshold (float) – Only load data with p-value < threshold

  • SNPonly (bool) – Load only SNPs (only if a1col and a2col is specified)

  • log10p (bool) – p-values are -log10 transformed

Note

The loaded GWAS data is shared between different xscorer instances !

load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, cb=None, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]

Imports gene annotation from text file

Parameters
  • file (text) – File to load

  • ccol (int) – Column containing chromosome number

  • cid (int) – Column containing gene id

  • csymb (int) – Column containing gene symbol

  • cstx (int) – Column containing transcription start

  • cetx (int) – Column containing transcription end

  • cs (int) – Column containing strand

  • chrStart (int) – Number of leading characters to skip in ccol

  • splitchr (string) – Character used to separate columns in text file

  • NAgeneid (string) – Identifier for not available gene id

  • useNAgenes (bool) – Import genes without gene id

  • header (bool) – First line is header

Internal:
  • _GENEID (dict) - Contains the raw data with gene id (cid) as key

  • _GENESYMB (dict) - Mapping from gene symbols (csymb) to gene ids (cid)

  • _GENEIDtoSYMB (dict) - Mapping from gene ids (cid) to gene symbols (csymb)

  • _CHR (dict) - Mapping from chromosomes to list of gene symbols

  • _BAND (dict) - Mapping from band to list of gene symbols

  • _SKIPPED (dict) - Genes (cid) which could not be imported

Note

An unique gene id is automatically generated for n/a gene ids if useNAgenes=true.

Note

Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.

load_mapping(file, name='MAP', gcol=0, rcol=1, wcol=None, bcol=None, delimiter='\t', a1col=None, a2col=None, pfilter=1, header=False, joint=True)[source]

Loads a SNP to gene mapping

Parameters
  • file (string) – File to load

  • name (string) – Identifier code for mapping (needs to be unique)

  • gcol (int) – Column with gene id

  • rcol (int) – Column with SNP id

  • wcol (int) – Column with weight

  • bcol (int) – Column with additional weight

  • delimiter (string) – Character used to separate columns

  • a1col (int) – Column of alternate allele (None for ignoring alleles)

  • a2col (int) – Column of reference allele (None for ignoring alleles)

  • pfilter (float) – Only include rows with wcol < pfilter

  • header (bool) – Header present

  • joint (bool) – Use mapping SNPs and gene window based SNPs

load_refpanel(filename, parallel=1, keepfile=None, qualityT=100, SNPonly=False, chrlist=None)[source]

Sets the reference panel to use

Parameters
  • filename (string) – /path/filename (without .chr#.db ending)

  • parallel (int) – Number of cores to use for parallel import of reference panel

  • keepfile – File with sample ids (one per line) to keep (only for .vcf)

  • qualityT – Quality threshold for variant to keep (only for .vcf)

  • SNPonly – Import only SNPs (only for .vcf)

  • chrlist (list) – List of chromosomes to import. (None to import 1-22)

Note

One file per chromosome with ending .chr#.db required (#: 1-22). If imported reference panel is not present, PascalX will automatically try to import from .chr#.tped.gz or .chr#.vcf.gz files.

load_scores(file, gcol=0, pcol=1, header=False)[source]

Load computed gene scores

Parameters
  • file (string) – Filename of data to load

  • gcol (int) – Column with gene symbol

  • pcol (int) – Column with p-value

  • header (bool) – File contains a header (True|False)

matchAlleles(E_A, E_B, matchRefPanel=False)[source]

Matches alleles between two GWAS (SNPs with non matching alleles are removed)

Parameters
  • E_A (str) – Identifier of first GWAS

  • E_B (str) – Identifier of second GWAS

  • matchRefPanel (bool) – Match also alleles to reference panel

Note

Currently, matchRefPanel=True requires sufficient memory to load all reference panel indices into memory.

matchAlleles_mapper(E_A, E_B, matchRefPanel=False)[source]

Matches alleles between GWAS and Mapper loaded data (SNPs with non matching alleles are removed)

Parameters
  • E_A (str) – Identifier of GWAS

  • E_B (str) – Identifier of MAP

  • matchRefPanel (bool) – Match also with reference panel alleles

plot_genesnps(G, E_A, E_B, rank=False, zscore=False, show_correlation=False, mark_window=False, MAF=None, tickspacing=10, pcolor='limegreen', ncolor='darkviolet', corrcmap=None)[source]

Plots the SNP p-values for a list of genes and the genotypic SNP-SNP correlation matrix

Parameters
  • G (list) – List of gene symbols

  • show_correlation (bool) – Plot the corresponding SNP-SNP correlation matrix

  • mark_window (bool) – Mark the gene transcription start and end positions

  • MAF (float) – MAF filter (None for value set in class)

  • tickspacing (int) – Spacing of ticks

  • pcolor (color) – Color for positive SNP associations

  • ncolor (color) – Color for negative SNP associations

  • corrcmap (cmap) – Colormap to use for correlation plot (None for default)

save_scores(file)[source]

Save computed gene scores

Parameters

file (string) – Filename to store data

score(gene, E_A=None, E_B=None, threshold=1, parallel=1, method=None, mode=None, nobar=False, reqacc=None, autorescore=False, pcorr=0)[source]

Performs cross scoring for a given list of gene symbols

Parameters
  • gene (list) – gene symbols to score.

  • E_A (str) – First GWAS to use

  • E_B (str) – Second GWAS to use

  • parallel (int) – # of cores to use

  • unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core

  • method (string) – Not used

  • mode (string) – Not used

  • reqacc (float) – Not used

  • intlimit (int) – Not used

  • threshold (bool) – Threshold p-value to reqacc

  • nobar (bool) – Do not show progress bar

  • autorescore (bool) – Automatically try to re-score failed genes

  • pcorr (float) – Sample overlap correction factor

score_all(E_A=None, E_B=None, threshold=1, parallel=1, pcorr=0, method=None, mode=None, nobar=False, reqacc=None)[source]

Performs cross scoring for all gene symbols

Parameters
  • E_A (str) – First GWAS to use

  • E_B (str) – Second GWAS to use

  • parallel (int) – # of cores to use

  • pcorr (float) – Sample overlap correction factor

  • method (string) – Not used

  • mode (string) – Not used

  • reqacc (float) – Not used

  • nobar (bool) – Do not show progress bar

score_chr(E_A, E_B, chrs=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22'], threshold=1, parallel=1, pcorr=0, nobar=False)[source]

Performs cross scoring for gene symbols on given chromosomes

Parameters
  • E_A (str) – First GWAS to use

  • E_B (str) – Second GWAS to use

  • chrs (list) – Chromosomes to score.

  • parallel (int) – # of cores to use

  • pcorr (float) – Sample overlap correction factor

  • nobar (bool) – Do not show progress bar

score_map(E_B, parallel=1, nobar=False, pcorr=0)[source]

Performs cross scoring for gene symbols given by mapper

Parameters
  • parallel (int) – # of cores to use

  • E_B (string) – Identifier of loaded MAP

class PascalX.xscorer.rsum(window=50000, varcutoff=0.99, MAF=0.05, leftTail=False, gpu=False)[source]

This class implements the ratio cross scorer based on SNP coherence/variance over gene windows.

get_topscores(N=10)[source]

Prints and returns the top gene scores

Parameters

N (int) – # to show

Returns

Ordered list of top scores

Return type

list

jointlyRank(E_A, E_B)[source]

Jointly QQ normalizes the p-values of two GWAS

Parameters
  • E_A (str) – Identifier of first GWAS

  • E_B (str) – Identifier of second GWAS

jointlyRank_mapper(E_A, E_B, invert=False)[source]

Jointly QQ normalizes the p-values of GWAS and Mapper

Parameters
  • E_A (str) – Identifier of first GWAS

  • E_B (str) – Identifier of MAP

load_GWAS(file, rscol=0, pcol=1, bcol=2, a1col=None, a2col=None, idcol=None, name='GWAS', delimiter=None, NAid='n/a', header=False, threshold=1, mincutoff=0.0, rank=False, SNPonly=False, log10p=False)[source]

Load GWAS summary statistics p-values and betas

Parameters
  • file (string) – File containing the GWAS summary statistics data. Either as textfile or gzip compressed with ending .gz

  • rscol (int) – Column of SNP ids

  • pcol (int) – Column of p-values

  • bcol (int) – Column of betas

  • a1col (int) – Column of alternate allele (None for ignoring alleles)

  • a2col (int) – Column of reference allele (None for ignoring alleles)

  • idcol – Column of identifiers, if several different GWAS in one file

  • name – Identifier code for GWAS (needs to be unique)

  • delimiter (String) – Split character

  • header (bool) – Header present

  • NAid (String) – Code for not available (rows are ignored)

  • threshold (float) – Only load data with p-value < threshold

  • SNPonly (bool) – Load only SNPs (only if a1col and a2col is specified)

  • log10p (bool) – p-values are -log10 transformed

Note

The loaded GWAS data is shared between different xscorer instances !

load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, cb=None, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]

Imports gene annotation from text file

Parameters
  • file (text) – File to load

  • ccol (int) – Column containing chromosome number

  • cid (int) – Column containing gene id

  • csymb (int) – Column containing gene symbol

  • cstx (int) – Column containing transcription start

  • cetx (int) – Column containing transcription end

  • cs (int) – Column containing strand

  • chrStart (int) – Number of leading characters to skip in ccol

  • splitchr (string) – Character used to separate columns in text file

  • NAgeneid (string) – Identifier for not available gene id

  • useNAgenes (bool) – Import genes without gene id

  • header (bool) – First line is header

Internal:
  • _GENEID (dict) - Contains the raw data with gene id (cid) as key

  • _GENESYMB (dict) - Mapping from gene symbols (csymb) to gene ids (cid)

  • _GENEIDtoSYMB (dict) - Mapping from gene ids (cid) to gene symbols (csymb)

  • _CHR (dict) - Mapping from chromosomes to list of gene symbols

  • _BAND (dict) - Mapping from band to list of gene symbols

  • _SKIPPED (dict) - Genes (cid) which could not be imported

Note

An unique gene id is automatically generated for n/a gene ids if useNAgenes=true.

Note

Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.

load_mapping(file, name='MAP', gcol=0, rcol=1, wcol=None, bcol=None, delimiter='\t', a1col=None, a2col=None, pfilter=1, header=False, joint=True)[source]

Loads a SNP to gene mapping

Parameters
  • file (string) – File to load

  • name (string) – Identifier code for mapping (needs to be unique)

  • gcol (int) – Column with gene id

  • rcol (int) – Column with SNP id

  • wcol (int) – Column with weight

  • bcol (int) – Column with additional weight

  • delimiter (string) – Character used to separate columns

  • a1col (int) – Column of alternate allele (None for ignoring alleles)

  • a2col (int) – Column of reference allele (None for ignoring alleles)

  • pfilter (float) – Only include rows with wcol < pfilter

  • header (bool) – Header present

  • joint (bool) – Use mapping SNPs and gene window based SNPs

load_refpanel(filename, parallel=1, keepfile=None, qualityT=100, SNPonly=False, chrlist=None)[source]

Sets the reference panel to use

Parameters
  • filename (string) – /path/filename (without .chr#.db ending)

  • parallel (int) – Number of cores to use for parallel import of reference panel

  • keepfile – File with sample ids (one per line) to keep (only for .vcf)

  • qualityT – Quality threshold for variant to keep (only for .vcf)

  • SNPonly – Import only SNPs (only for .vcf)

  • chrlist (list) – List of chromosomes to import. (None to import 1-22)

Note

One file per chromosome with ending .chr#.db required (#: 1-22). If imported reference panel is not present, PascalX will automatically try to import from .chr#.tped.gz or .chr#.vcf.gz files.

load_scores(file, gcol=0, pcol=1, header=False)[source]

Load computed gene scores

Parameters
  • file (string) – Filename of data to load

  • gcol (int) – Column with gene symbol

  • pcol (int) – Column with p-value

  • header (bool) – File contains a header (True|False)

matchAlleles(E_A, E_B, matchRefPanel=False)[source]

Matches alleles between two GWAS (SNPs with non matching alleles are removed)

Parameters
  • E_A (str) – Identifier of first GWAS

  • E_B (str) – Identifier of second GWAS

  • matchRefPanel (bool) – Match also alleles to reference panel

Note

Currently, matchRefPanel=True requires sufficient memory to load all reference panel indices into memory.

matchAlleles_mapper(E_A, E_B, matchRefPanel=False)[source]

Matches alleles between GWAS and Mapper loaded data (SNPs with non matching alleles are removed)

Parameters
  • E_A (str) – Identifier of GWAS

  • E_B (str) – Identifier of MAP

  • matchRefPanel (bool) – Match also with reference panel alleles

plot_genesnps(G, E_A, E_B, rank=False, zscore=False, show_correlation=False, mark_window=False, MAF=None, tickspacing=10, pcolor='limegreen', ncolor='darkviolet', corrcmap=None)[source]

Plots the SNP p-values for a list of genes and the genotypic SNP-SNP correlation matrix

Parameters
  • G (list) – List of gene symbols

  • show_correlation (bool) – Plot the corresponding SNP-SNP correlation matrix

  • mark_window (bool) – Mark the gene transcription start and end positions

  • MAF (float) – MAF filter (None for value set in class)

  • tickspacing (int) – Spacing of ticks

  • pcolor (color) – Color for positive SNP associations

  • ncolor (color) – Color for negative SNP associations

  • corrcmap (cmap) – Colormap to use for correlation plot (None for default)

save_scores(file)[source]

Save computed gene scores

Parameters

file (string) – Filename to store data

score(gene, E_A=None, E_B=None, threshold=1, parallel=1, method=None, mode=None, nobar=False, reqacc=None, autorescore=False, pcorr=0)[source]

Performs cross scoring for a given list of gene symbols

Parameters
  • gene (list) – gene symbols to score.

  • E_A (str) – First GWAS to use

  • E_B (str) – Second GWAS to use

  • parallel (int) – # of cores to use

  • unloadRef (bool) – Keep only reference data for one chromosome in memory (True, False) per core

  • method (string) – Not used

  • mode (string) – Not used

  • reqacc (float) – Not used

  • intlimit (int) – Not used

  • threshold (bool) – Threshold p-value to reqacc

  • nobar (bool) – Do not show progress bar

  • autorescore (bool) – Automatically try to re-score failed genes

  • pcorr (float) – Sample overlap correction factor

score_all(E_A=None, E_B=None, threshold=1, parallel=1, pcorr=0, method=None, mode=None, nobar=False, reqacc=None)[source]

Performs cross scoring for all gene symbols

Parameters
  • E_A (str) – First GWAS to use

  • E_B (str) – Second GWAS to use

  • parallel (int) – # of cores to use

  • pcorr (float) – Sample overlap correction factor

  • method (string) – Not used

  • mode (string) – Not used

  • reqacc (float) – Not used

  • nobar (bool) – Do not show progress bar

score_chr(E_A, E_B, chrs=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22'], threshold=1, parallel=1, pcorr=0, nobar=False)[source]

Performs cross scoring for gene symbols on given chromosomes

Parameters
  • E_A (str) – First GWAS to use

  • E_B (str) – Second GWAS to use

  • chrs (list) – Chromosomes to score.

  • parallel (int) – # of cores to use

  • pcorr (float) – Sample overlap correction factor

  • nobar (bool) – Do not show progress bar

score_map(E_B, parallel=1, nobar=False, pcorr=0)[source]

Performs cross scoring for gene symbols given by mapper

Parameters
  • parallel (int) – # of cores to use

  • E_B (string) – Identifier of loaded MAP


Genexprscorer

class PascalX.genexpr.genexpr[source]
get_GTEX_expr(filename)[source]

Downloads GTEx v8 data and imports.

Parameters

filename (string) – Filename to store downloaded GTEX data

Note

The import may take several hours.

load_expr(filename)[source]

Loads the GTEx data for usage

Parameters

filename (string) – File to load

load_genome(file, ccol=1, cid=0, csymb=5, cstx=2, cetx=3, cs=4, chrStart=0, splitchr='\t', NAgeneid='n/a', useNAgenes=False, header=False)[source]

Imports gene annotation from text file

Parameters
  • file (text) – File to load

  • ccol (int) – Column containing chromosome number

  • cid (int) – Column containing gene id

  • csymb (int) – Column containing gene symbol

  • cstx (int) – Column containing transcription start

  • cetx (int) – Column containing transcription end

  • cs (int) – Column containing strand

  • chrStart (int) – Number of leading characters to skip in ccol

  • splitchr (string) – Character used to separate columns in text file

  • NAgeneid (string) – Identifier for not available gene id

  • useNAgenes (bool) – Import genes without gene id

  • header (bool) – First line is header

Internal:
  • _GENEID (dict) - Contains the raw data with gene id (cid) as key

  • _GENESYMB (dict) - Mapping from gene symbols (csymb) to gene ids (cid)

  • _GENEIDtoSYMB (dict) - Mapping from gene ids (cid) to gene symbols (csymb)

  • _CHR (dict) - Mapping from chromosomes to list of gene symbols

  • _SKIPPED (dict) - Genes (cid) which could not be imported

Note

An unique gene id is automatically generated for n/a gene ids if useNAgenes=true.

Note

Identical gene ids occuring in more than one row are merged to a single gene, if on same chromosome and positional gap is < 1Mb. The strand is taken from the longer segment.

chi2rank(pathways, fuse=True)[source]

Calculates tissue enrichment scores for set of genes using GTEx data

Parameters
  • pathways (list) – List of pathways to score [ [‘name’,[‘gene1’,’gene2’,…]],…]

  • fuse (bool) – Fuse nearby genes

plot_genexpr(genes, tzscore=False, cbar_pos=(0.0, 0.0, 0.01, 0.5))[source]

Plots gene expression matrix for list of genes

Parameters
  • genes (list) – list of genes

  • tzscore (bool) – zscore over tissues per gene (true|false)

  • cbar_pos (list) – Position coordinates of color bar


SNP database

class PascalX.snpdb.db[source]

Class for handling storage of the raw genotype data. The data is indexed and stored for each chromosome individually as zlib compressed pickle. The indexing allows fast random access via SNP ids or positions.

open(filename)[source]

Opens storage file. A new file is created if not exists.

Parameters

filename (string) – Name to use for the storage file

insert(data)[source]

Stores set of rows into the file storage

Parameters

data (dict) – First storage key is the key in the dictionary. Second storage key is the first element of the inner list.

Warning

If all insert calls are done, the close function has to be called once to make the index persistent.

get(pos)[source]

Returns all stored data for a set of SNPs indexed via positions

Parameters

pos (list) – Positions of SNPs to retrieve

getSNPatPos(pos)[source]

Returns SNP id at position

Parameters

pos (list) – Positions of SNPs to retrieve

getPosatSNPs(snpids)[source]

Returns the position corresponding to a snpid WARNING: Inefficient

getSNPs(snps)[source]

Returns all stored data for a set of SNPs indexed via SNP ids

Parameters

snp (list) – ids of SNPs to retrieve

getSNPKeys()[source]

Returns the SNP ids in storage

getKeys()[source]

Returns SNP positions in storage

getSortedKeys()[source]

Returns a sorted list of SNP positions in storage

close()[source]

Closes open storage file.

Warning

After all inserts are done this function has to be called once to re-generate the index and close the storage file.


Reference panel

class PascalX.refpanel.refpanel[source]
load_pos_reference(cr, keep_idx=None)[source]

Returns a snpdb object for a chromosome and a sorted list of SNP positions on the chromosome

Parameters

cr (int) – Chromosome number

load_snp_reference(cr, keep_idx=None)[source]

Returns a snpdb object for a chromosome

Parameters

cr (int) – Chromosome number

set_refpanel(filename, parallel=1, keepfile=None, qualityT=100, SNPonly=False, chrlist=None, sourcefilename=None, regEx=None, nobar=True)[source]

Sets the reference panel to use

Parameters
  • filename (string) – /path/filename (without .chr#.db ending)

  • parallel (int) – Number of cores to use for parallel import of reference panel

  • keepfile (string) – [only for .vcf] File with sample ids (one per line) to keep. None to keep all.

  • qualityT (int) – [only for .vcf] Quality threshold for variant to keep (None to ignore)

  • SNPonly (bool) – [only for .vcf] Load only SNPs

  • chrlist (list) – List of chromosomes to import. (None to import 1-22)

  • sourcefilename (string) – /path/filename (without .chr#. ending) of .tped | .vcf files. None to use same as filename

  • regEx (string) – Regular expression to filter sample ids. First capture group is kept. [only for .vcf]

  • nobar (bool) – Show progress bar (updates only if a chromosome finished)

Note

One file per chromosome with ending .chr#.db required (#: 1-22). If imported reference panel is not present, PascalX will automatically try to import from .chr#.tped.gz or .chr#.vcf.gz files.

Note

Alleles (under .vcf import) are stored internally in the order [ALT,REF].

getSNPtoChrMap()[source]

Returns a dictionary mapping SNP id to corresponding chromosome number

getChrSNPs(cr)[source]

Returns SNP ids for a chromosome

Parameters

cr (int) – Chromosome number


Mapper

class PascalX.mapper.mapper(genome=None)[source]
load_mapping(file, gcol=0, rcol=1, wcol=None, a1col=None, a2col=None, bcol=None, pcol=None, delimiter='\t', pfilter=1, header=False, symbol=False)[source]

Loads a SNP to gene mapping

Parameters
  • file (string) – File to load

  • gcol (int) – Column with gene id

  • rcol (int) – Column with SNP id

  • wcol (int) – Column with weight (None for none)

  • a1col (int) – Column of alternate allele (None for ignoring alleles)

  • a2col (int) – Column of reference allele (None for ignoring alleles)

  • bcol (int) – Column with additional weight (None for none)

  • pcol (int) – Column with pvalue (For None p-value is taken from .load_GWAS data)

  • delimiter (string) – Character used to separate columns

  • header (bool) – Header present

  • pfilter (float) – Only include rows with pcol < pfilter

  • symbol (bool) – Gene id are gene symbols (requires genome to be set on init)