Module kitchen.recipes

Functions for processing and manipulating .h5ad objects and automated processing of scRNA-seq data

Functions

def NMF_decoupler_ORA(adata, net, top_n_genes=None, max_FDRpval=0.05, out_dir='./', save_prefix='', save_output=False, get_ora_df_kwargs={}, decoupler_dotplot_facet_kwargs={})

Quickly process cNMF loadings through decoupler ORA and create plot for initial look at pathways

decoupler tools used: * "ORA" for biological pathways

Parameters

adata : anndata.AnnData
Object containing cNMF results in adata.uns["cnmf_markers"]
net : pd.DataFrame
Network dataframe required to run ORA. e.g. msigdb where msigdb is a pd.DataFrame from dc.get_resource.
top_n_genes : int or None, optional (default=None)
If None, use entire adata.uns["cnmf_markers"] dataframe. If an integer, select first top_n_genes rows of adata.uns["cnmf_markers"].
max_FDRpval : float or None, optional (default=0.05)
FDR p-value cutoff for plotting significant ORA pathways. If None, don't filter and show top 20 terms by FDR p-value.
out_dir : str, optional (default="./")
Path to directory to save plots to
save_prefix : str, optional (default="")
String to prepend to output plots to make names unique
save_output : bool, optional (default=False)
If True, save output dataframe to out_dir/save_prefix_NMF_ORA.csv
get_ora_df_kwargs : dict, optional (default={})
Keyword arguments to pass to dc.get_ora_df. e.g. source, target, n_background.
decoupler_dotplot_facet_kwargs : dict, optional (default={})
Keyword arguments to pass to decoupler_dotplot_facet. e.g. top_n, cmap, dpi

Returns

enr_pvals : pd.DataFrame
ORA output as dataframe

Saves decoupler ORA dotplots to out_dir/.

def PCA_decoupler_GSEA(adata, net, n_pcs=None, max_FDRpval=0.05, out_dir='./', save_prefix='', save_output=False, plot=False, get_gsea_df_kwargs={}, decoupler_dotplot_facet_kwargs={})

Quickly process PCA loadings through decoupler GSEA and create plot for initial look at pathways

decoupler tools used: * "GSEA" for biological pathways

Parameters

adata : anndata.AnnData
Object containing PCA results in adata.varm["PCs"]
net : pd.DataFrame
Network dataframe required to run GSEA. e.g. msigdb where msigdb is a pd.DataFrame from dc.get_resource.
max_FDRpval : float or None, optional (default=0.05)
FDR p-value cutoff for plotting significant ORA pathways. If None, don't filter and show top 20 terms by FDR p-value.
out_dir : str, optional (default="./")
Path to directory to save plots to
save_prefix : str, optional (default="")
String to prepend to output plots to make names unique
save_output : bool, optional (default=False)
If True, save output dataframe to out_dir/save_prefix_PCA_GSEA.csv
get_gsea_df_kwargs : dict, optional (default={})
Keyword arguments to pass to dc.get_gsea_df. e.g. source, target, n_background.
decoupler_dotplot_facet_kwargs : dict, optional (default={})
Keyword arguments to pass to decoupler_dotplot_facet. e.g. top_n, cmap, dpi

Returns

enr_pvals : pd.DataFrame
GSEA output as dataframe

Saves decoupler GSEA dotplots to out_dir/.

Examples

nets = fetch_decoupler_resources( resources=["msigdb", "progeny", "collectri", "liana"], genome="human", ) enr_pvals = PCA_decoupler_GSEA( adata=a, n_pcs=3, net=nets["msigdb"], max_FDRpval=0.05, out_dir="plots/", save_prefix="", save_output=False, get_gsea_df_kwargs={ "source":"geneset", "target":"genesymbol", "times":500, "min_n":5, }, )

def calculate_cell_proportions(obs_df, celltype_column, groupby, celltype_subset=None)

calculate proportions of each cell type in a column

Parameters

obs_df : pd.DataFrame
Dataframe containing single cells as rows and metadata columns
celltype_column : str
Column of obs_df containing cell type labels for proportion calculation
groupby : str
Column of obs_df to group by for proportions (i.e. sample or patient)
celltype_subset : list of str, optional (default=None)
List of cell types in obs_df.celltype_column to keep in output. If None, return proportions of all celltypes

Returns

props_df : pd.DataFrame
Dataframe with values of obs_df.groupby as rows and values of obs_df.celltype_column (or celltype_subset) as columns
def cc_score(adata, layer=None, seed=18, verbose=True)

Calculates cell cycle scores and implied phase for each observation

Parameters

adata : anndata.AnnData
object containing transformed and normalized (arcsinh or log1p) counts in 'layer'.
layer : str, optional (default=None)
key from adata.layers to use for cc phase calculation. Default None to use .X
seed : int, optional (default=18)
random state for PCA, neighbors graph and clustering
verbose : bool, optional (default=True)
print updates to console

Returns

adata is edited in place to add 'G2M_score', 'S_score', and 'phase' to .obs
 
def cellranger2(adata, expected=1500, upper_quant=0.99, lower_prop=0.1, label='CellRanger_2', verbose=True)

Labels cells using "knee point" method from CellRanger 2.1

Parameters

adata : anndata.AnnData
object containing unfiltered counts
expected : int, optional (default=1500)
estimated number of real cells expected in dataset
upper_quant : float, optional (default=0.99)
upper quantile of real cells to test
lower_prop : float, optional (default=0.1)
percentage of expected quantile to calculate total counts threshold for
label : str, optional (default="CellRanger_2")
how to name .obs column containing output
verbose : bool, optional (default=True)
print updates to console

Returns

adata edited in place to add .obs[label] binary label
 
def cellranger3(adata, init_counts=15000, min_umi_frac_of_median=0.01, min_umis_nonambient=500, max_adj_pvalue=0.01)

Labels cells using "emptydrops" method from CellRanger 3.0

Parameters

adata : anndata.AnnData
object containing unfiltered counts
init_counts : int, optional (default=15000)
initial total counts threshold for calling cells
min_umi_frac_of_median : float, optional (default=0.01)
minimum total counts for testing barcodes as fraction of median counts for initially labeled cells
min_umis_nonambient : float, optional (default=500)
minimum total counts for testing barcodes
max_adj_pvalue : float, optional (default=0.01)
maximum p-value for cell calling after B-H correction

Returns

adata edited in place to add .obs["CellRanger_3"] binary label
 
and .obs["CellRanger_3_ll"] log-likelihoods for tested barcodes
 
def dim_reduce(adata, layer=None, use_rep=None, clust_resolution=1.0, paga=True, seed=18, verbose=True)

Reduces dimensions of single-cell dataset using standard methods

Parameters

adata : anndata.AnnData
object containing preprocessed counts matrix
layer : str, optional (default=None)
layer to use; default None for .X
use_rep : str, optional (default=None)
.obsm key to use for neighbors graph instead of PCA; default None, generate new PCA from layer
clust_resolution : float, optional (default=1.0)
resolution as fraction on [0.0, 1.0] for leiden clustering. default 1.0
paga : bool, optional (default=True)
run PAGA to seed UMAP embedding
seed : int, optional (default=18)
random state for PCA, neighbors graph and clustering
verbose : bool, optional (default=True)
print updates to console

Returns

adata is edited in place, adding PCA, neighbors graph, PAGA, and UMAP
 
def plot_genes(adata, de_method='t-test_overestim_var', layer='log1p_norm', groupby='leiden', key_added='rank_genes_groups', ambient=False, plot_type=None, n_genes=5, dendrogram=True, cmap='Greys', figsize_scale=1.0, save_to='de.png', verbose=True, **kwargs)

Calculates and plot rank_genes_groups results

Parameters

adata : anndata.AnnData
object containing preprocessed and dimension-reduced counts matrix
de_method : str, optional (default="t-test_overestim_var")
one of "t-test", "t-test_overestim_var", "wilcoxon"
layer : str, optional (default="log1p_norm")
one of adata.layers to use for DEG analysis. Recommended to use 'raw_counts' if de_method=='wilcoxon' and 'log1p_norm' if de_method=='t-test'.
groupby : str, optional (default="leiden")
adata.obs key to group cells by
key_added : str, optional (default="rank_genes_groups")
adata.uns key to add DEG results to
ambient : bool, optional (default=False)
include ambient genes as a group in the plot/output dictionary
plot_type : str, optional (default=None)
One of "dotplot", "matrixplot", "dotmatrix", "stacked_violin", or "heatmap". If None, don't plot, just return DEGs as dictionary.
n_genes : int, optional (default=5)
number of top genes per group to show
dendrogram : bool, optional (default=True)
show dendrogram of cluster similarity
cmap : str, optional (default="Greys")
matplotlib colormap for dots
figsize_scale : float, optional (default=1.0)
scale dimensions of the figure
save_to : str, optional (default="de.png")
string to add to plot name using scanpy plot defaults
verbose : bool, optional (default=True)
print updates to console
**kwargs : optional
keyword args to add to custom_heatmap()

Returns

markers : dict
dictionary of top n_genes DEGs per group
myplot : matplotlib.Figure
custom_heatmap object if plot_type!=None
def plot_genes_cnmf(adata, plot_type='heatmap', groupby='leiden', attr='varm', keys='cnmf_spectra', indices=None, n_genes=5, dendrogram=True, figsize_scale=1.0, cmap='Greys', save_to='de_cnmf.png', **kwargs)

Calculates and plots top cNMF gene loadings

Parameters

adata : anndata.AnnData
object containing preprocessed and dimension-reduced counts matrix
plot_type : str, optional (default=None)
One of "dotplot", "matrixplot", "dotmatrix", "stacked_violin", or "heatmap".
groupby : str, optional (default="leiden")
.obs key to group cells by
attr : str {"var", "obs", "uns", "varm", "obsm"}
attribute of adata that contains the score
keys : str or list of str, optional (default="cnmf_spectra")
scores to look up an array from the attribute of adata
indices : list of int, optional (default=None)
column indices of keys for which to plot (e.g. [0,1,2] for first three keys)
n_genes : int, optional (default=5)
number of top genes per group to show
dendrogram : bool, optional (default=True)
show dendrogram of cluster similarity
figsize_scale : float, optional (default=1.0)
scale dimensions of the figure
cmap : str, optional (default="Greys")
valid color map for the plot
save_to : str, optional (default="de.png")
string to add to plot name using scanpy plot defaults
**kwargs : optional
keyword args to add to sc.pl.matrixplot, sc.pl.dotplot, or sc.pl.heatmap

Returns

markers : dict
dictionary of top n_genes features per NMF factor
myplot : matplotlib.Figure
custom_heatmap object
def scDEG_decoupler_GSEA(adata, uns_key, net, max_FDRpval=0.05, out_dir='./', save_prefix='', save_output=False, get_gsea_df_kwargs={}, decoupler_dotplot_facet_kwargs={})

Quickly process DEGs from scRNA dataset through decoupler GSEA and create plot for initial look at pathways

decoupler tools used: * "GSEA" for biological pathways

Parameters

adata : anndata.AnnData
Object containing rank_genes_groups results in adata.uns
uns_key : str
Key to adata.uns containing rank_genes_groups results
net : pd.DataFrame
Network dataframe required to run ORA. e.g. msigdb where msigdb is a pd.DataFrame from dc.get_resource.
max_FDRpval : float, optional (default=0.05)
FDR p-value cutoff for using genes and plotting significant GSEA pathways.
out_dir : str, optional (default="./")
Path to directory to save plots to
save_prefix : str, optional (default="")
String to prepend to output plots to make names unique
save_output : bool, optional (default=False)
If True, save output dataframe to out_dir/save_prefix_NMF_GSEA.csv
get_gsea_df_kwargs : dict, optional (default={})
Keyword arguments to pass to dc.get_ora_df. e.g. source, target, n_background.
decoupler_dotplot_facet_kwargs : dict, optional (default={})
Keyword arguments to pass to decoupler_dotplot_facet. e.g. top_n, cmap, dpi

Returns

enr_pvals : pd.DataFrame
ORA output as dataframe

Saves decoupler ORA dotplots to out_dir/.

Examples

nets = fetch_decoupler_resources( resources=["msigdb", "progeny", "collectri", "liana"], genome="human", ) frame = scDEG_decoupler_GSEA( a, uns_key="rank_genes_groups_{}".format(key), net=nets["msigdb"], max_FDRpval=0.05, out_dir="plots/", save_prefix=f"HALLMARK_scATAC_{key}_", save_output=False, get_gsea_df_kwargs={ "source":"geneset", "target":"genesymbol", "times":200, "min_n":5, "seed":8, }, decoupler_dotplot_facet_kwargs={ "top_n":10, "dpi":200, "cmap":"Reds_r", "figsize_scale":1.2, }, )

def scDEG_decoupler_ORA(adata, uns_key, net, max_FDRpval=0.05, min_logFC=0.5849625007211562, out_dir='./', save_prefix='', save_output=False, get_ora_df_kwargs={}, decoupler_dotplot_facet_kwargs={})

Quickly process DEGs from scRNA dataset through decoupler ORA and create plot for initial look at pathways

decoupler tools used: * "ORA" for biological pathways

Parameters

adata : anndata.AnnData
Object containing rank_genes_groups results in adata.uns
uns_key : str
Key to adata.uns containing rank_genes_groups results
net : pd.DataFrame
Network dataframe required to run ORA. e.g. msigdb where msigdb is a pd.DataFrame from dc.get_resource.
max_FDRpval : float, optional (default=0.05)
FDR p-value cutoff for using genes and plotting significant ORA pathways.
min_logFC : float, optional (default=1.5)
logFC cutoff for using genes and plotting significant ORA pathways.
out_dir : str, optional (default="./")
Path to directory to save plots to
save_prefix : str, optional (default="")
String to prepend to output plots to make names unique
save_output : bool, optional (default=False)
If True, save output dataframe to out_dir/save_prefix_NMF_ORA.csv
get_ora_df_kwargs : dict, optional (default={})
Keyword arguments to pass to dc.get_ora_df. e.g. source, target, n_background.
decoupler_dotplot_facet_kwargs : dict, optional (default={})
Keyword arguments to pass to decoupler_dotplot_facet. e.g. top_n, cmap, dpi

Returns

enr_pvals : pd.DataFrame
ORA output as dataframe

Saves decoupler ORA dotplots to out_dir/.

def score_gene_signatures(adata, signatures_dict, sig_subset=None, layer=None)

Score gene signatures in AnnData using sc.tl.score_genes

Parameters

adata : anndata.AnnData
Object containing gene expression data for scoring
signatures_dict : dict
Dictionary of signature names (keys) and constituent genes (values), with gene names matching adata.var_names
sig_subset : list, Optional (default=None)
Subset of keys from signatures_dict to attempt scoring. If None, score all signatures from signatures_dict
layer : str, Optional (default=None)
Layer key from adata.layers to use for scoring. If None, use adata.X.

Returns

adata is edited with signature scores added to .obs

failed_sigs : list
List of signatures not scored properly
scored_sigs : list
List of signatures scored successfully
def subset_adata(adata, subset, verbose=True)

Subsets AnnData object on one or more .obs columns

Columns should contain 0/False for cells to throw out, and 1/True for cells to keep. Keeps union of all labels provided in subset.

Parameters

adata : anndata.AnnData
the data
subset : str or list of str
adata.obs labels to use for subsetting. Labels must be binary (0, "0", False, "False" to toss - 1, "1", True, "True" to keep). Multiple labels will keep intersection.
verbose : bool, optional (default=True)
print updates to console

Returns

adata : anndata.AnnData
new anndata object as subset of adata