Module `kitchen.recipes`

Functions for processing and manipulating .h5ad objects and automated processing of scRNA-seq data

Functions

def NMF_decoupler_ORA(adata, net, top_n_genes=None, max_FDRpval=0.05, out_dir='./', save_prefix='', save_output=False, get_ora_df_kwargs={}, decoupler_dotplot_facet_kwargs={})

Quickly process cNMF loadings through decoupler ORA and create plot for initial look at pathways

decoupler tools used: * "ORA" for biological pathways

Parameters

adata : anndata.AnnData: Object containing cNMF results in adata.uns["cnmf_markers"]
net : pd.DataFrame: Network dataframe required to run ORA. e.g. msigdb where msigdb is a pd.DataFrame from dc.get_resource.
top_n_genes : int or None, optional (default=None): If None, use entire adata.uns["cnmf_markers"] dataframe. If an integer, select first top_n_genes rows of adata.uns["cnmf_markers"].
max_FDRpval : float or None, optional (default=0.05): FDR p-value cutoff for plotting significant ORA pathways. If None, don't filter and show top 20 terms by FDR p-value.
out_dir : str, optional (default="./"): Path to directory to save plots to
save_prefix : str, optional (default=""): String to prepend to output plots to make names unique
save_output : bool, optional (default=False): If True, save output dataframe to out_dir/save_prefix_NMF_ORA.csv
get_ora_df_kwargs : dict, optional (default={}): Keyword arguments to pass to dc.get_ora_df. e.g. source, target, n_background.
decoupler_dotplot_facet_kwargs : dict, optional (default={}): Keyword arguments to pass to decoupler_dotplot_facet. e.g. top_n, cmap, dpi

Returns

enr_pvals : pd.DataFrame: ORA output as dataframe

Saves decoupler ORA dotplots to out_dir/.

def PCA_decoupler_GSEA(adata, net, n_pcs=None, max_FDRpval=0.05, out_dir='./', save_prefix='', save_output=False, plot=False, get_gsea_df_kwargs={}, decoupler_dotplot_facet_kwargs={})

Quickly process PCA loadings through decoupler GSEA and create plot for initial look at pathways

decoupler tools used: * "GSEA" for biological pathways

Parameters

adata : anndata.AnnData: Object containing PCA results in adata.varm["PCs"]
net : pd.DataFrame: Network dataframe required to run GSEA. e.g. msigdb where msigdb is a pd.DataFrame from dc.get_resource.
max_FDRpval : float or None, optional (default=0.05): FDR p-value cutoff for plotting significant ORA pathways. If None, don't filter and show top 20 terms by FDR p-value.
out_dir : str, optional (default="./"): Path to directory to save plots to
save_prefix : str, optional (default=""): String to prepend to output plots to make names unique
save_output : bool, optional (default=False): If True, save output dataframe to out_dir/save_prefix_PCA_GSEA.csv
get_gsea_df_kwargs : dict, optional (default={}): Keyword arguments to pass to dc.get_gsea_df. e.g. source, target, n_background.
decoupler_dotplot_facet_kwargs : dict, optional (default={}): Keyword arguments to pass to decoupler_dotplot_facet. e.g. top_n, cmap, dpi

Returns

enr_pvals : pd.DataFrame: GSEA output as dataframe

Saves decoupler GSEA dotplots to out_dir/.

Examples

nets = fetch_decoupler_resources( resources=["msigdb", "progeny", "collectri", "liana"], genome="human", ) enr_pvals = PCA_decoupler_GSEA( adata=a, n_pcs=3, net=nets["msigdb"], max_FDRpval=0.05, out_dir="plots/", save_prefix="", save_output=False, get_gsea_df_kwargs={ "source":"geneset", "target":"genesymbol", "times":500, "min_n":5, }, )

def calculate_cell_proportions(obs_df, celltype_column, groupby, celltype_subset=None)

calculate proportions of each cell type in a column

Parameters

obs_df : pd.DataFrame: Dataframe containing single cells as rows and metadata columns
celltype_column : str: Column of obs_df containing cell type labels for proportion calculation
groupby : str: Column of obs_df to group by for proportions (i.e. sample or patient)
celltype_subset : list of str, optional (default=None): List of cell types in obs_df.celltype_column to keep in output. If None, return proportions of all celltypes

Returns

props_df : pd.DataFrame: Dataframe with values of obs_df.groupby as rows and values of obs_df.celltype_column (or celltype_subset) as columns

def cc_score(adata, layer=None, seed=18, verbose=True)

Calculates cell cycle scores and implied phase for each observation

Parameters

adata : anndata.AnnData: object containing transformed and normalized (arcsinh or log1p) counts in 'layer'.
layer : str, optional (default=None): key from adata.layers to use for cc phase calculation. Default None to use .X
seed : int, optional (default=18): random state for PCA, neighbors graph and clustering
verbose : bool, optional (default=True): print updates to console

Returns

adata is edited in place to add 'G2M_score', 'S_score', and 'phase' to .obs

def cellranger2(adata, expected=1500, upper_quant=0.99, lower_prop=0.1, label='CellRanger_2', verbose=True)

Labels cells using "knee point" method from CellRanger 2.1

Parameters

adata : anndata.AnnData: object containing unfiltered counts
expected : int, optional (default=1500): estimated number of real cells expected in dataset
upper_quant : float, optional (default=0.99): upper quantile of real cells to test
lower_prop : float, optional (default=0.1): percentage of expected quantile to calculate total counts threshold for
label : str, optional (default="CellRanger_2"): how to name .obs column containing output
verbose : bool, optional (default=True): print updates to console

Returns

adata edited in place to add .obs[label] binary label

def cellranger3(adata, init_counts=15000, min_umi_frac_of_median=0.01, min_umis_nonambient=500, max_adj_pvalue=0.01)

Labels cells using "emptydrops" method from CellRanger 3.0

Parameters

adata : anndata.AnnData: object containing unfiltered counts
init_counts : int, optional (default=15000): initial total counts threshold for calling cells
min_umi_frac_of_median : float, optional (default=0.01): minimum total counts for testing barcodes as fraction of median counts for initially labeled cells
min_umis_nonambient : float, optional (default=500): minimum total counts for testing barcodes
max_adj_pvalue : float, optional (default=0.01): maximum p-value for cell calling after B-H correction

Returns

adata edited in place to add .obs["CellRanger_3"] binary label
and .obs["CellRanger_3_ll"] log-likelihoods for tested barcodes

def dim_reduce(adata, layer=None, use_rep=None, clust_resolution=1.0, paga=True, seed=18, verbose=True)

Reduces dimensions of single-cell dataset using standard methods

Parameters

adata : anndata.AnnData: object containing preprocessed counts matrix
layer : str, optional (default=None): layer to use; default None for .X
use_rep : str, optional (default=None): .obsm key to use for neighbors graph instead of PCA; default None, generate new PCA from layer
clust_resolution : float, optional (default=1.0): resolution as fraction on [0.0, 1.0] for leiden clustering. default 1.0
paga : bool, optional (default=True): run PAGA to seed UMAP embedding
seed : int, optional (default=18): random state for PCA, neighbors graph and clustering
verbose : bool, optional (default=True): print updates to console

Returns

adata is edited in place, adding PCA, neighbors graph, PAGA, and UMAP

def plot_genes(adata, de_method='t-test_overestim_var', layer='log1p_norm', groupby='leiden', key_added='rank_genes_groups', ambient=False, plot_type=None, n_genes=5, dendrogram=True, cmap='Greys', figsize_scale=1.0, save_to='de.png', verbose=True, **kwargs)

Calculates and plot rank_genes_groups results

Parameters

adata : anndata.AnnData: object containing preprocessed and dimension-reduced counts matrix
de_method : str, optional (default="t-test_overestim_var"): one of "t-test", "t-test_overestim_var", "wilcoxon"
layer : str, optional (default="log1p_norm"): one of adata.layers to use for DEG analysis. Recommended to use 'raw_counts' if de_method=='wilcoxon' and 'log1p_norm' if de_method=='t-test'.
groupby : str, optional (default="leiden"): adata.obs key to group cells by
key_added : str, optional (default="rank_genes_groups"): adata.uns key to add DEG results to
ambient : bool, optional (default=False): include ambient genes as a group in the plot/output dictionary
plot_type : str, optional (default=None): One of "dotplot", "matrixplot", "dotmatrix", "stacked_violin", or "heatmap". If None, don't plot, just return DEGs as dictionary.
n_genes : int, optional (default=5): number of top genes per group to show
dendrogram : bool, optional (default=True): show dendrogram of cluster similarity
cmap : str, optional (default="Greys"): matplotlib colormap for dots
figsize_scale : float, optional (default=1.0): scale dimensions of the figure
save_to : str, optional (default="de.png"): string to add to plot name using scanpy plot defaults
verbose : bool, optional (default=True): print updates to console
**kwargs : optional: keyword args to add to custom_heatmap()

Returns

markers : dict: dictionary of top n_genes DEGs per group
myplot : matplotlib.Figure: custom_heatmap object if plot_type!=None

def plot_genes_cnmf(adata, plot_type='heatmap', groupby='leiden', attr='varm', keys='cnmf_spectra', indices=None, n_genes=5, dendrogram=True, figsize_scale=1.0, cmap='Greys', save_to='de_cnmf.png', **kwargs)

Calculates and plots top cNMF gene loadings

Parameters

adata : anndata.AnnData: object containing preprocessed and dimension-reduced counts matrix
plot_type : str, optional (default=None): One of "dotplot", "matrixplot", "dotmatrix", "stacked_violin", or "heatmap".
groupby : str, optional (default="leiden"): .obs key to group cells by
attr : str {"var", "obs", "uns", "varm", "obsm"}: attribute of adata that contains the score
keys : str or list of str, optional (default="cnmf_spectra"): scores to look up an array from the attribute of adata
indices : list of int, optional (default=None): column indices of keys for which to plot (e.g. [0,1,2] for first three keys)
n_genes : int, optional (default=5): number of top genes per group to show
dendrogram : bool, optional (default=True): show dendrogram of cluster similarity
figsize_scale : float, optional (default=1.0): scale dimensions of the figure
cmap : str, optional (default="Greys"): valid color map for the plot
save_to : str, optional (default="de.png"): string to add to plot name using scanpy plot defaults
**kwargs : optional: keyword args to add to sc.pl.matrixplot, sc.pl.dotplot, or sc.pl.heatmap

Returns

markers : dict: dictionary of top n_genes features per NMF factor
myplot : matplotlib.Figure: custom_heatmap object

def scDEG_decoupler_GSEA(adata, uns_key, net, max_FDRpval=0.05, out_dir='./', save_prefix='', save_output=False, get_gsea_df_kwargs={}, decoupler_dotplot_facet_kwargs={})

Quickly process DEGs from scRNA dataset through decoupler GSEA and create plot for initial look at pathways

decoupler tools used: * "GSEA" for biological pathways

Parameters

adata : anndata.AnnData: Object containing rank_genes_groups results in adata.uns
uns_key : str: Key to adata.uns containing rank_genes_groups results
net : pd.DataFrame: Network dataframe required to run ORA. e.g. msigdb where msigdb is a pd.DataFrame from dc.get_resource.
max_FDRpval : float, optional (default=0.05): FDR p-value cutoff for using genes and plotting significant GSEA pathways.
out_dir : str, optional (default="./"): Path to directory to save plots to
save_prefix : str, optional (default=""): String to prepend to output plots to make names unique
save_output : bool, optional (default=False): If True, save output dataframe to out_dir/save_prefix_NMF_GSEA.csv
get_gsea_df_kwargs : dict, optional (default={}): Keyword arguments to pass to dc.get_ora_df. e.g. source, target, n_background.
decoupler_dotplot_facet_kwargs : dict, optional (default={}): Keyword arguments to pass to decoupler_dotplot_facet. e.g. top_n, cmap, dpi

Returns

enr_pvals : pd.DataFrame: ORA output as dataframe

Saves decoupler ORA dotplots to out_dir/.

Examples

nets = fetch_decoupler_resources( resources=["msigdb", "progeny", "collectri", "liana"], genome="human", ) frame = scDEG_decoupler_GSEA( a, uns_key="rank_genes_groups_{}".format(key), net=nets["msigdb"], max_FDRpval=0.05, out_dir="plots/", save_prefix=f"HALLMARK_{key}_", save_output=False, get_gsea_df_kwargs={ "source":"geneset", "target":"genesymbol", "times":200, "min_n":5, "seed":8, }, decoupler_dotplot_facet_kwargs={ "top_n":10, "dpi":200, "cmap":"Reds_r", "figsize_scale":1.2, }, )

def scDEG_decoupler_ORA(adata, uns_key, net, max_FDRpval=0.05, min_logFC=0.5849625007211562, out_dir='./', save_prefix='', save_output=False, get_ora_df_kwargs={}, decoupler_dotplot_facet_kwargs={})

Quickly process DEGs from scRNA dataset through decoupler ORA and create plot for initial look at pathways

decoupler tools used: * "ORA" for biological pathways

Parameters

adata : anndata.AnnData: Object containing rank_genes_groups results in adata.uns
uns_key : str: Key to adata.uns containing rank_genes_groups results
net : pd.DataFrame: Network dataframe required to run ORA. e.g. msigdb where msigdb is a pd.DataFrame from dc.get_resource.
max_FDRpval : float, optional (default=0.05): FDR p-value cutoff for using genes and plotting significant ORA pathways.
min_logFC : float, optional (default=1.5): logFC cutoff for using genes and plotting significant ORA pathways.
out_dir : str, optional (default="./"): Path to directory to save plots to
save_prefix : str, optional (default=""): String to prepend to output plots to make names unique
save_output : bool, optional (default=False): If True, save output dataframe to out_dir/save_prefix_NMF_ORA.csv
get_ora_df_kwargs : dict, optional (default={}): Keyword arguments to pass to dc.get_ora_df. e.g. source, target, n_background.
decoupler_dotplot_facet_kwargs : dict, optional (default={}): Keyword arguments to pass to decoupler_dotplot_facet. e.g. top_n, cmap, dpi

Returns

enr_pvals : pd.DataFrame: ORA output as dataframe

Saves decoupler ORA dotplots to out_dir/.

def score_gene_signatures(adata, signatures_dict, method='seurat', sig_subset=None, layer=None)

Score gene signatures in AnnData using sc.tl.score_genes or dc.run_gsva

Parameters

adata : anndata.AnnData: Object containing gene expression data for scoring
signatures_dict : dict: Dictionary of signature names (keys) and constituent genes (values), with gene names matching adata.var_names
method : literal ['seurat','gsva'], optional (default='seurat'): Method to use for scoring. 'seurat':sc.tl.score_genes; 'gsva':dc.run_gsva.
sig_subset : list, Optional (default=None): Subset of keys from signatures_dict to attempt scoring. If None, score all signatures from signatures_dict
layer : str, Optional (default=None): Layer key from adata.layers to use for scoring. If None, use adata.X.

Returns

adata is edited with signature scores added to .obs

failed_sigs : list: List of signatures not scored properly
scored_sigs : list: List of signatures scored successfully

def subset_adata(adata, subset, verbose=True)

Subsets AnnData object on one or more .obs columns

Columns should contain 0/False for cells to throw out, and 1/True for cells to keep. Keeps union of all labels provided in subset.

Parameters

adata : anndata.AnnData: the data
subset : str or list of str: adata.obs labels to use for subsetting. Labels must be binary (0, "0", False, "False" to toss - 1, "1", True, "True" to keep). Multiple labels will keep intersection.
verbose : bool, optional (default=True): print updates to console

Returns

adata : anndata.AnnData: new anndata object as subset of adata