Module kitchen.recipes
Functions for processing and manipulating .h5ad objects and automated processing of scRNA-seq data
Functions
def NMF_decoupler_ORA(adata, net, top_n_genes=None, max_FDRpval=0.05, out_dir='./', save_prefix='', save_output=False, get_ora_df_kwargs={}, decoupler_dotplot_facet_kwargs={})-
Quickly process cNMF loadings through
decouplerORA and create plot for initial look at pathwaysdecouplertools used: * "ORA" for biological pathwaysParameters
adata:anndata.AnnData- Object containing cNMF results in
adata.uns["cnmf_markers"] net:pd.DataFrame- Network dataframe required to run ORA. e.g.
msigdbwheremsigdbis apd.DataFramefromdc.get_resource. top_n_genes:intorNone, optional(default=None)- If
None, use entireadata.uns["cnmf_markers"]dataframe. If an integer, select firsttop_n_genesrows ofadata.uns["cnmf_markers"]. max_FDRpval:floatorNone, optional(default=0.05)- FDR p-value cutoff for plotting significant ORA pathways. If
None, don't filter and show top 20 terms by FDR p-value. out_dir:str, optional(default="./")- Path to directory to save plots to
save_prefix:str, optional(default="")- String to prepend to output plots to make names unique
save_output:bool, optional(default=False)- If
True, save output dataframe toout_dir/save_prefix_NMF_ORA.csv get_ora_df_kwargs:dict, optional(default={})- Keyword arguments to pass to
dc.get_ora_df. e.g.source,target,n_background. decoupler_dotplot_facet_kwargs:dict, optional(default={})- Keyword arguments to pass to
decoupler_dotplot_facet. e.g.top_n,cmap,dpi
Returns
enr_pvals:pd.DataFrame- ORA output as dataframe
Saves
decouplerORA dotplots toout_dir/. def PCA_decoupler_GSEA(adata, net, n_pcs=None, max_FDRpval=0.05, out_dir='./', save_prefix='', save_output=False, plot=False, get_gsea_df_kwargs={}, decoupler_dotplot_facet_kwargs={})-
Quickly process PCA loadings through
decouplerGSEA and create plot for initial look at pathwaysdecouplertools used: * "GSEA" for biological pathwaysParameters
adata:anndata.AnnData- Object containing PCA results in
adata.varm["PCs"] net:pd.DataFrame- Network dataframe required to run GSEA. e.g.
msigdbwheremsigdbis apd.DataFramefromdc.get_resource. max_FDRpval:floatorNone, optional(default=0.05)- FDR p-value cutoff for plotting significant ORA pathways. If
None, don't filter and show top 20 terms by FDR p-value. out_dir:str, optional(default="./")- Path to directory to save plots to
save_prefix:str, optional(default="")- String to prepend to output plots to make names unique
save_output:bool, optional(default=False)- If
True, save output dataframe toout_dir/save_prefix_PCA_GSEA.csv get_gsea_df_kwargs:dict, optional(default={})- Keyword arguments to pass to
dc.get_gsea_df. e.g.source,target,n_background. decoupler_dotplot_facet_kwargs:dict, optional(default={})- Keyword arguments to pass to
decoupler_dotplot_facet. e.g.top_n,cmap,dpi
Returns
enr_pvals:pd.DataFrame- GSEA output as dataframe
Saves
decouplerGSEA dotplots toout_dir/.Examples
nets = fetch_decoupler_resources( resources=["msigdb", "progeny", "collectri", "liana"], genome="human", ) enr_pvals = PCA_decoupler_GSEA( adata=a, n_pcs=3, net=nets["msigdb"], max_FDRpval=0.05, out_dir="plots/", save_prefix="", save_output=False, get_gsea_df_kwargs={ "source":"geneset", "target":"genesymbol", "times":500, "min_n":5, }, )
def calculate_cell_proportions(obs_df, celltype_column, groupby, celltype_subset=None)-
calculate proportions of each cell type in a column
Parameters
obs_df:pd.DataFrame- Dataframe containing single cells as rows and metadata columns
celltype_column:str- Column of
obs_dfcontaining cell type labels for proportion calculation groupby:str- Column of
obs_dfto group by for proportions (i.e. sample or patient) celltype_subset:listofstr, optional(default=None)- List of cell types in
obs_df.celltype_columnto keep in output. IfNone, return proportions of all celltypes
Returns
props_df:pd.DataFrame- Dataframe with values of
obs_df.groupbyas rows and values ofobs_df.celltype_column(orcelltype_subset) as columns
def cc_score(adata, layer=None, seed=18, verbose=True)-
Calculates cell cycle scores and implied phase for each observation
Parameters
adata:anndata.AnnData- object containing transformed and normalized (arcsinh or log1p) counts in 'layer'.
layer:str, optional(default=None)- key from adata.layers to use for cc phase calculation. Default None to use .X
seed:int, optional(default=18)- random state for PCA, neighbors graph and clustering
verbose:bool, optional(default=True)- print updates to console
Returns
adata is edited in place to add 'G2M_score', 'S_score', and 'phase' to .obs
def cellranger2(adata, expected=1500, upper_quant=0.99, lower_prop=0.1, label='CellRanger_2', verbose=True)-
Labels cells using "knee point" method from CellRanger 2.1
Parameters
adata:anndata.AnnData- object containing unfiltered counts
expected:int, optional(default=1500)- estimated number of real cells expected in dataset
upper_quant:float, optional(default=0.99)- upper quantile of real cells to test
lower_prop:float, optional(default=0.1)- percentage of expected quantile to calculate total counts threshold for
label:str, optional(default="CellRanger_2")- how to name .obs column containing output
verbose:bool, optional(default=True)- print updates to console
Returns
adata edited in place to add .obs[label] binary label
def cellranger3(adata, init_counts=15000, min_umi_frac_of_median=0.01, min_umis_nonambient=500, max_adj_pvalue=0.01)-
Labels cells using "emptydrops" method from CellRanger 3.0
Parameters
adata:anndata.AnnData- object containing unfiltered counts
init_counts:int, optional(default=15000)- initial total counts threshold for calling cells
min_umi_frac_of_median:float, optional(default=0.01)- minimum total counts for testing barcodes as fraction of median counts for initially labeled cells
min_umis_nonambient:float, optional(default=500)- minimum total counts for testing barcodes
max_adj_pvalue:float, optional(default=0.01)- maximum p-value for cell calling after B-H correction
Returns
adata edited in place to add .obs["CellRanger_3"] binary labeland .obs["CellRanger_3_ll"] log-likelihoods for tested barcodes
def dim_reduce(adata, layer=None, use_rep=None, clust_resolution=1.0, paga=True, seed=18, verbose=True)-
Reduces dimensions of single-cell dataset using standard methods
Parameters
adata:anndata.AnnData- object containing preprocessed counts matrix
layer:str, optional(default=None)- layer to use; default None for .X
use_rep:str, optional(default=None)- .obsm key to use for neighbors graph instead of PCA; default None, generate new PCA from layer
clust_resolution:float, optional(default=1.0)- resolution as fraction on [0.0, 1.0] for leiden clustering. default 1.0
paga:bool, optional(default=True)- run PAGA to seed UMAP embedding
seed:int, optional(default=18)- random state for PCA, neighbors graph and clustering
verbose:bool, optional(default=True)- print updates to console
Returns
adata is edited in place, adding PCA, neighbors graph, PAGA, and UMAP
def plot_genes(adata, de_method='t-test_overestim_var', layer='log1p_norm', groupby='leiden', key_added='rank_genes_groups', ambient=False, plot_type=None, n_genes=5, dendrogram=True, cmap='Greys', figsize_scale=1.0, save_to='de.png', verbose=True, **kwargs)-
Calculates and plot
rank_genes_groupsresultsParameters
adata:anndata.AnnData- object containing preprocessed and dimension-reduced counts matrix
de_method:str, optional(default="t-test_overestim_var")- one of "t-test", "t-test_overestim_var", "wilcoxon"
layer:str, optional(default="log1p_norm")- one of
adata.layersto use for DEG analysis. Recommended to use 'raw_counts' ifde_method=='wilcoxon'and 'log1p_norm' ifde_method=='t-test'. groupby:str, optional(default="leiden")adata.obskey to group cells bykey_added:str, optional(default="rank_genes_groups")adata.unskey to add DEG results toambient:bool, optional(default=False)- include ambient genes as a group in the plot/output dictionary
plot_type:str, optional(default=None)- One of "dotplot", "matrixplot", "dotmatrix", "stacked_violin", or "heatmap".
If
None, don't plot, just return DEGs as dictionary. n_genes:int, optional(default=5)- number of top genes per group to show
dendrogram:bool, optional(default=True)- show dendrogram of cluster similarity
cmap:str, optional(default="Greys")- matplotlib colormap for dots
figsize_scale:float, optional(default=1.0)- scale dimensions of the figure
save_to:str, optional(default="de.png")- string to add to plot name using scanpy plot defaults
verbose:bool, optional(default=True)- print updates to console
**kwargs:optional- keyword args to add to
custom_heatmap()
Returns
markers:dict- dictionary of top
n_genesDEGs per group myplot:matplotlib.Figurecustom_heatmapobject ifplot_type!=None
def plot_genes_cnmf(adata, plot_type='heatmap', groupby='leiden', attr='varm', keys='cnmf_spectra', indices=None, n_genes=5, dendrogram=True, figsize_scale=1.0, cmap='Greys', save_to='de_cnmf.png', **kwargs)-
Calculates and plots top cNMF gene loadings
Parameters
adata:anndata.AnnData- object containing preprocessed and dimension-reduced counts matrix
plot_type:str, optional(default=None)- One of "dotplot", "matrixplot", "dotmatrix", "stacked_violin", or "heatmap".
groupby:str, optional(default="leiden")- .obs key to group cells by
attr:str {"var", "obs", "uns", "varm", "obsm"}- attribute of adata that contains the score
keys:strorlistofstr, optional(default="cnmf_spectra")- scores to look up an array from the attribute of adata
indices:listofint, optional(default=None)- column indices of keys for which to plot (e.g. [0,1,2] for first three keys)
n_genes:int, optional(default=5)- number of top genes per group to show
dendrogram:bool, optional(default=True)- show dendrogram of cluster similarity
figsize_scale:float, optional(default=1.0)- scale dimensions of the figure
cmap:str, optional(default="Greys")- valid color map for the plot
save_to:str, optional(default="de.png")- string to add to plot name using scanpy plot defaults
**kwargs:optional- keyword args to add to sc.pl.matrixplot, sc.pl.dotplot, or sc.pl.heatmap
Returns
markers:dict- dictionary of top
n_genesfeatures per NMF factor myplot:matplotlib.Figurecustom_heatmapobject
def scDEG_decoupler_GSEA(adata, uns_key, net, max_FDRpval=0.05, out_dir='./', save_prefix='', save_output=False, get_gsea_df_kwargs={}, decoupler_dotplot_facet_kwargs={})-
Quickly process DEGs from scRNA dataset through
decouplerGSEA and create plot for initial look at pathwaysdecouplertools used: * "GSEA" for biological pathwaysParameters
adata:anndata.AnnData- Object containing rank_genes_groups results in
adata.uns uns_key:str- Key to
adata.unscontaining rank_genes_groups results net:pd.DataFrame- Network dataframe required to run ORA. e.g.
msigdbwheremsigdbis apd.DataFramefromdc.get_resource. max_FDRpval:float, optional(default=0.05)- FDR p-value cutoff for using genes and plotting significant GSEA pathways.
out_dir:str, optional(default="./")- Path to directory to save plots to
save_prefix:str, optional(default="")- String to prepend to output plots to make names unique
save_output:bool, optional(default=False)- If
True, save output dataframe toout_dir/save_prefix_NMF_GSEA.csv get_gsea_df_kwargs:dict, optional(default={})- Keyword arguments to pass to
dc.get_ora_df. e.g.source,target,n_background. decoupler_dotplot_facet_kwargs:dict, optional(default={})- Keyword arguments to pass to
decoupler_dotplot_facet. e.g.top_n,cmap,dpi
Returns
enr_pvals:pd.DataFrame- ORA output as dataframe
Saves
decouplerORA dotplots toout_dir/.Examples
nets = fetch_decoupler_resources( resources=["msigdb", "progeny", "collectri", "liana"], genome="human", ) frame = scDEG_decoupler_GSEA( a, uns_key="rank_genes_groups_{}".format(key), net=nets["msigdb"], max_FDRpval=0.05, out_dir="plots/", save_prefix=f"HALLMARK_{key}_", save_output=False, get_gsea_df_kwargs={ "source":"geneset", "target":"genesymbol", "times":200, "min_n":5, "seed":8, }, decoupler_dotplot_facet_kwargs={ "top_n":10, "dpi":200, "cmap":"Reds_r", "figsize_scale":1.2, }, )
def scDEG_decoupler_ORA(adata, uns_key, net, max_FDRpval=0.05, min_logFC=0.5849625007211562, out_dir='./', save_prefix='', save_output=False, get_ora_df_kwargs={}, decoupler_dotplot_facet_kwargs={})-
Quickly process DEGs from scRNA dataset through
decouplerORA and create plot for initial look at pathwaysdecouplertools used: * "ORA" for biological pathwaysParameters
adata:anndata.AnnData- Object containing rank_genes_groups results in
adata.uns uns_key:str- Key to
adata.unscontaining rank_genes_groups results net:pd.DataFrame- Network dataframe required to run ORA. e.g.
msigdbwheremsigdbis apd.DataFramefromdc.get_resource. max_FDRpval:float, optional(default=0.05)- FDR p-value cutoff for using genes and plotting significant ORA pathways.
min_logFC:float, optional(default=1.5)- logFC cutoff for using genes and plotting significant ORA pathways.
out_dir:str, optional(default="./")- Path to directory to save plots to
save_prefix:str, optional(default="")- String to prepend to output plots to make names unique
save_output:bool, optional(default=False)- If
True, save output dataframe toout_dir/save_prefix_NMF_ORA.csv get_ora_df_kwargs:dict, optional(default={})- Keyword arguments to pass to
dc.get_ora_df. e.g.source,target,n_background. decoupler_dotplot_facet_kwargs:dict, optional(default={})- Keyword arguments to pass to
decoupler_dotplot_facet. e.g.top_n,cmap,dpi
Returns
enr_pvals:pd.DataFrame- ORA output as dataframe
Saves
decouplerORA dotplots toout_dir/. def score_gene_signatures(adata, signatures_dict, method='seurat', sig_subset=None, layer=None)-
Score gene signatures in AnnData using
sc.tl.score_genesordc.run_gsvaParameters
adata:anndata.AnnData- Object containing gene expression data for scoring
signatures_dict:dict- Dictionary of signature names (keys) and constituent genes (values), with gene
names matching
adata.var_names method:literal ['seurat','gsva'], optional(default='seurat')- Method to use for scoring. 'seurat':
sc.tl.score_genes; 'gsva':dc.run_gsva. sig_subset:list, Optional (default=None)- Subset of keys from
signatures_dictto attempt scoring. IfNone, score all signatures fromsignatures_dict layer:str, Optional (default=None)- Layer key from
adata.layersto use for scoring. IfNone, useadata.X.
Returns
adatais edited with signature scores added to.obsfailed_sigs:list- List of signatures not scored properly
scored_sigs:list- List of signatures scored successfully
def subset_adata(adata, subset, verbose=True)-
Subsets AnnData object on one or more .obs columns
Columns should contain 0/False for cells to throw out, and 1/True for cells to keep. Keeps union of all labels provided in subset.
Parameters
adata:anndata.AnnData- the data
subset:strorlistofstr- adata.obs labels to use for subsetting. Labels must be binary (0, "0", False, "False" to toss - 1, "1", True, "True" to keep). Multiple labels will keep intersection.
verbose:bool, optional(default=True)- print updates to console
Returns
adata:anndata.AnnData- new anndata object as subset of
adata