Module kitchen.recipes
Functions for processing and manipulating .h5ad objects and automated processing of scRNA-seq data
Functions
def NMF_decoupler_ORA(adata, net, top_n_genes=None, max_FDRpval=0.05, out_dir='./', save_prefix='', save_output=False, get_ora_df_kwargs={}, decoupler_dotplot_facet_kwargs={})
-
Quickly process cNMF loadings through
decoupler
ORA and create plot for initial look at pathwaysdecoupler
tools used: * "ORA" for biological pathwaysParameters
adata
:anndata.AnnData
- Object containing cNMF results in
adata.uns["cnmf_markers"]
net
:pd.DataFrame
- Network dataframe required to run ORA. e.g.
msigdb
wheremsigdb
is apd.DataFrame
fromdc.get_resource
. top_n_genes
:int
orNone
, optional(default=
None)
- If
None
, use entireadata.uns["cnmf_markers"]
dataframe. If an integer, select firsttop_n_genes
rows ofadata.uns["cnmf_markers"]
. max_FDRpval
:float
orNone
, optional(default=0.05)
- FDR p-value cutoff for plotting significant ORA pathways. If
None
, don't filter and show top 20 terms by FDR p-value. out_dir
:str
, optional(default="./")
- Path to directory to save plots to
save_prefix
:str
, optional(default="")
- String to prepend to output plots to make names unique
save_output
:bool
, optional(default=
False)
- If
True
, save output dataframe toout_dir/save_prefix_NMF_ORA.csv
get_ora_df_kwargs
:dict
, optional(default={})
- Keyword arguments to pass to
dc.get_ora_df
. e.g.source
,target
,n_background
. decoupler_dotplot_facet_kwargs
:dict
, optional(default={})
- Keyword arguments to pass to
decoupler_dotplot_facet
. e.g.top_n
,cmap
,dpi
Returns
enr_pvals
:pd.DataFrame
- ORA output as dataframe
Saves
decoupler
ORA dotplots toout_dir/
. def PCA_decoupler_GSEA(adata, net, n_pcs=None, max_FDRpval=0.05, out_dir='./', save_prefix='', save_output=False, plot=False, get_gsea_df_kwargs={}, decoupler_dotplot_facet_kwargs={})
-
Quickly process PCA loadings through
decoupler
GSEA and create plot for initial look at pathwaysdecoupler
tools used: * "GSEA" for biological pathwaysParameters
adata
:anndata.AnnData
- Object containing PCA results in
adata.varm["PCs"]
net
:pd.DataFrame
- Network dataframe required to run GSEA. e.g.
msigdb
wheremsigdb
is apd.DataFrame
fromdc.get_resource
. max_FDRpval
:float
orNone
, optional(default=0.05)
- FDR p-value cutoff for plotting significant ORA pathways. If
None
, don't filter and show top 20 terms by FDR p-value. out_dir
:str
, optional(default="./")
- Path to directory to save plots to
save_prefix
:str
, optional(default="")
- String to prepend to output plots to make names unique
save_output
:bool
, optional(default=
False)
- If
True
, save output dataframe toout_dir/save_prefix_PCA_GSEA.csv
get_gsea_df_kwargs
:dict
, optional(default={})
- Keyword arguments to pass to
dc.get_gsea_df
. e.g.source
,target
,n_background
. decoupler_dotplot_facet_kwargs
:dict
, optional(default={})
- Keyword arguments to pass to
decoupler_dotplot_facet
. e.g.top_n
,cmap
,dpi
Returns
enr_pvals
:pd.DataFrame
- GSEA output as dataframe
Saves
decoupler
GSEA dotplots toout_dir/
.Examples
nets = fetch_decoupler_resources( resources=["msigdb", "progeny", "collectri", "liana"], genome="human", ) enr_pvals = PCA_decoupler_GSEA( adata=a, n_pcs=3, net=nets["msigdb"], max_FDRpval=0.05, out_dir="plots/", save_prefix="", save_output=False, get_gsea_df_kwargs={ "source":"geneset", "target":"genesymbol", "times":500, "min_n":5, }, )
def calculate_cell_proportions(obs_df, celltype_column, groupby, celltype_subset=None)
-
calculate proportions of each cell type in a column
Parameters
obs_df
:pd.DataFrame
- Dataframe containing single cells as rows and metadata columns
celltype_column
:str
- Column of
obs_df
containing cell type labels for proportion calculation groupby
:str
- Column of
obs_df
to group by for proportions (i.e. sample or patient) celltype_subset
:list
ofstr
, optional(default=
None)
- List of cell types in
obs_df.celltype_column
to keep in output. IfNone
, return proportions of all celltypes
Returns
props_df
:pd.DataFrame
- Dataframe with values of
obs_df.groupby
as rows and values ofobs_df.celltype_column
(orcelltype_subset
) as columns
def cc_score(adata, layer=None, seed=18, verbose=True)
-
Calculates cell cycle scores and implied phase for each observation
Parameters
adata
:anndata.AnnData
- object containing transformed and normalized (arcsinh or log1p) counts in 'layer'.
layer
:str
, optional(default=None)
- key from adata.layers to use for cc phase calculation. Default None to use .X
seed
:int
, optional(default=18)
- random state for PCA, neighbors graph and clustering
verbose
:bool
, optional(default=True)
- print updates to console
Returns
adata is edited in place to add 'G2M_score', 'S_score', and 'phase' to .obs
def cellranger2(adata, expected=1500, upper_quant=0.99, lower_prop=0.1, label='CellRanger_2', verbose=True)
-
Labels cells using "knee point" method from CellRanger 2.1
Parameters
adata
:anndata.AnnData
- object containing unfiltered counts
expected
:int
, optional(default=1500)
- estimated number of real cells expected in dataset
upper_quant
:float
, optional(default=0.99)
- upper quantile of real cells to test
lower_prop
:float
, optional(default=0.1)
- percentage of expected quantile to calculate total counts threshold for
label
:str
, optional(default="CellRanger_2")
- how to name .obs column containing output
verbose
:bool
, optional(default=True)
- print updates to console
Returns
adata edited in place to add .obs[label] binary label
def cellranger3(adata, init_counts=15000, min_umi_frac_of_median=0.01, min_umis_nonambient=500, max_adj_pvalue=0.01)
-
Labels cells using "emptydrops" method from CellRanger 3.0
Parameters
adata
:anndata.AnnData
- object containing unfiltered counts
init_counts
:int
, optional(default=15000)
- initial total counts threshold for calling cells
min_umi_frac_of_median
:float
, optional(default=0.01)
- minimum total counts for testing barcodes as fraction of median counts for initially labeled cells
min_umis_nonambient
:float
, optional(default=500)
- minimum total counts for testing barcodes
max_adj_pvalue
:float
, optional(default=0.01)
- maximum p-value for cell calling after B-H correction
Returns
adata edited in place to add .obs["CellRanger_3"] binary label
and .obs["CellRanger_3_ll"] log-likelihoods for tested barcodes
def dim_reduce(adata, layer=None, use_rep=None, clust_resolution=1.0, paga=True, seed=18, verbose=True)
-
Reduces dimensions of single-cell dataset using standard methods
Parameters
adata
:anndata.AnnData
- object containing preprocessed counts matrix
layer
:str
, optional(default=None)
- layer to use; default None for .X
use_rep
:str
, optional(default=None)
- .obsm key to use for neighbors graph instead of PCA; default None, generate new PCA from layer
clust_resolution
:float
, optional(default=1.0)
- resolution as fraction on [0.0, 1.0] for leiden clustering. default 1.0
paga
:bool
, optional(default=True)
- run PAGA to seed UMAP embedding
seed
:int
, optional(default=18)
- random state for PCA, neighbors graph and clustering
verbose
:bool
, optional(default=True)
- print updates to console
Returns
adata is edited in place, adding PCA, neighbors graph, PAGA, and UMAP
def plot_genes(adata, de_method='t-test_overestim_var', layer='log1p_norm', groupby='leiden', key_added='rank_genes_groups', ambient=False, plot_type=None, n_genes=5, dendrogram=True, cmap='Greys', figsize_scale=1.0, save_to='de.png', verbose=True, **kwargs)
-
Calculates and plot
rank_genes_groups
resultsParameters
adata
:anndata.AnnData
- object containing preprocessed and dimension-reduced counts matrix
de_method
:str
, optional(default="t-test_overestim_var")
- one of "t-test", "t-test_overestim_var", "wilcoxon"
layer
:str
, optional(default="log1p_norm")
- one of
adata.layers
to use for DEG analysis. Recommended to use 'raw_counts' ifde_method=='wilcoxon'
and 'log1p_norm' ifde_method=='t-test'
. groupby
:str
, optional(default="leiden")
adata.obs
key to group cells bykey_added
:str
, optional(default="rank_genes_groups")
adata.uns
key to add DEG results toambient
:bool
, optional(default=False)
- include ambient genes as a group in the plot/output dictionary
plot_type
:str
, optional(default=
None)
- One of "dotplot", "matrixplot", "dotmatrix", "stacked_violin", or "heatmap".
If
None
, don't plot, just return DEGs as dictionary. n_genes
:int
, optional(default=5)
- number of top genes per group to show
dendrogram
:bool
, optional(default=True)
- show dendrogram of cluster similarity
cmap
:str
, optional(default="Greys")
- matplotlib colormap for dots
figsize_scale
:float
, optional(default=1.0)
- scale dimensions of the figure
save_to
:str
, optional(default="de.png")
- string to add to plot name using scanpy plot defaults
verbose
:bool
, optional(default=True)
- print updates to console
**kwargs
:optional
- keyword args to add to
custom_heatmap()
Returns
markers
:dict
- dictionary of top
n_genes
DEGs per group myplot
:matplotlib.Figure
custom_heatmap
object ifplot_type!=None
def plot_genes_cnmf(adata, plot_type='heatmap', groupby='leiden', attr='varm', keys='cnmf_spectra', indices=None, n_genes=5, dendrogram=True, figsize_scale=1.0, cmap='Greys', save_to='de_cnmf.png', **kwargs)
-
Calculates and plots top cNMF gene loadings
Parameters
adata
:anndata.AnnData
- object containing preprocessed and dimension-reduced counts matrix
plot_type
:str
, optional(default=
None)
- One of "dotplot", "matrixplot", "dotmatrix", "stacked_violin", or "heatmap".
groupby
:str
, optional(default="leiden")
- .obs key to group cells by
attr
:str {"var", "obs", "uns", "varm", "obsm"}
- attribute of adata that contains the score
keys
:str
orlist
ofstr
, optional(default="cnmf_spectra")
- scores to look up an array from the attribute of adata
indices
:list
ofint
, optional(default=None)
- column indices of keys for which to plot (e.g. [0,1,2] for first three keys)
n_genes
:int
, optional(default=5)
- number of top genes per group to show
dendrogram
:bool
, optional(default=True)
- show dendrogram of cluster similarity
figsize_scale
:float
, optional(default=1.0)
- scale dimensions of the figure
cmap
:str
, optional(default="Greys")
- valid color map for the plot
save_to
:str
, optional(default="de.png")
- string to add to plot name using scanpy plot defaults
**kwargs
:optional
- keyword args to add to sc.pl.matrixplot, sc.pl.dotplot, or sc.pl.heatmap
Returns
markers
:dict
- dictionary of top
n_genes
features per NMF factor myplot
:matplotlib.Figure
custom_heatmap
object
def scDEG_decoupler_GSEA(adata, uns_key, net, max_FDRpval=0.05, out_dir='./', save_prefix='', save_output=False, get_gsea_df_kwargs={}, decoupler_dotplot_facet_kwargs={})
-
Quickly process DEGs from scRNA dataset through
decoupler
GSEA and create plot for initial look at pathwaysdecoupler
tools used: * "GSEA" for biological pathwaysParameters
adata
:anndata.AnnData
- Object containing rank_genes_groups results in
adata.uns
uns_key
:str
- Key to
adata.uns
containing rank_genes_groups results net
:pd.DataFrame
- Network dataframe required to run ORA. e.g.
msigdb
wheremsigdb
is apd.DataFrame
fromdc.get_resource
. max_FDRpval
:float
, optional(default=0.05)
- FDR p-value cutoff for using genes and plotting significant GSEA pathways.
out_dir
:str
, optional(default="./")
- Path to directory to save plots to
save_prefix
:str
, optional(default="")
- String to prepend to output plots to make names unique
save_output
:bool
, optional(default=
False)
- If
True
, save output dataframe toout_dir/save_prefix_NMF_GSEA.csv
get_gsea_df_kwargs
:dict
, optional(default={})
- Keyword arguments to pass to
dc.get_ora_df
. e.g.source
,target
,n_background
. decoupler_dotplot_facet_kwargs
:dict
, optional(default={})
- Keyword arguments to pass to
decoupler_dotplot_facet
. e.g.top_n
,cmap
,dpi
Returns
enr_pvals
:pd.DataFrame
- ORA output as dataframe
Saves
decoupler
ORA dotplots toout_dir/
.Examples
nets = fetch_decoupler_resources( resources=["msigdb", "progeny", "collectri", "liana"], genome="human", ) frame = scDEG_decoupler_GSEA( a, uns_key="rank_genes_groups_{}".format(key), net=nets["msigdb"], max_FDRpval=0.05, out_dir="plots/", save_prefix=f"HALLMARK_scATAC_{key}_", save_output=False, get_gsea_df_kwargs={ "source":"geneset", "target":"genesymbol", "times":200, "min_n":5, "seed":8, }, decoupler_dotplot_facet_kwargs={ "top_n":10, "dpi":200, "cmap":"Reds_r", "figsize_scale":1.2, }, )
def scDEG_decoupler_ORA(adata, uns_key, net, max_FDRpval=0.05, min_logFC=0.5849625007211562, out_dir='./', save_prefix='', save_output=False, get_ora_df_kwargs={}, decoupler_dotplot_facet_kwargs={})
-
Quickly process DEGs from scRNA dataset through
decoupler
ORA and create plot for initial look at pathwaysdecoupler
tools used: * "ORA" for biological pathwaysParameters
adata
:anndata.AnnData
- Object containing rank_genes_groups results in
adata.uns
uns_key
:str
- Key to
adata.uns
containing rank_genes_groups results net
:pd.DataFrame
- Network dataframe required to run ORA. e.g.
msigdb
wheremsigdb
is apd.DataFrame
fromdc.get_resource
. max_FDRpval
:float
, optional(default=0.05)
- FDR p-value cutoff for using genes and plotting significant ORA pathways.
min_logFC
:float
, optional(default=1.5)
- logFC cutoff for using genes and plotting significant ORA pathways.
out_dir
:str
, optional(default="./")
- Path to directory to save plots to
save_prefix
:str
, optional(default="")
- String to prepend to output plots to make names unique
save_output
:bool
, optional(default=
False)
- If
True
, save output dataframe toout_dir/save_prefix_NMF_ORA.csv
get_ora_df_kwargs
:dict
, optional(default={})
- Keyword arguments to pass to
dc.get_ora_df
. e.g.source
,target
,n_background
. decoupler_dotplot_facet_kwargs
:dict
, optional(default={})
- Keyword arguments to pass to
decoupler_dotplot_facet
. e.g.top_n
,cmap
,dpi
Returns
enr_pvals
:pd.DataFrame
- ORA output as dataframe
Saves
decoupler
ORA dotplots toout_dir/
. def score_gene_signatures(adata, signatures_dict, sig_subset=None, layer=None)
-
Score gene signatures in AnnData using
sc.tl.score_genes
Parameters
adata
:anndata.AnnData
- Object containing gene expression data for scoring
signatures_dict
:dict
- Dictionary of signature names (keys) and constituent genes (values), with gene
names matching
adata.var_names
sig_subset
:list, Optional (default=
None)
- Subset of keys from
signatures_dict
to attempt scoring. IfNone
, score all signatures fromsignatures_dict
layer
:str, Optional (default=
None)
- Layer key from
adata.layers
to use for scoring. IfNone
, useadata.X
.
Returns
adata
is edited with signature scores added to.obs
failed_sigs
:list
- List of signatures not scored properly
scored_sigs
:list
- List of signatures scored successfully
def subset_adata(adata, subset, verbose=True)
-
Subsets AnnData object on one or more .obs columns
Columns should contain 0/False for cells to throw out, and 1/True for cells to keep. Keeps union of all labels provided in subset.
Parameters
adata
:anndata.AnnData
- the data
subset
:str
orlist
ofstr
- adata.obs labels to use for subsetting. Labels must be binary (0, "0", False, "False" to toss - 1, "1", True, "True" to keep). Multiple labels will keep intersection.
verbose
:bool
, optional(default=True)
- print updates to console
Returns
adata
:anndata.AnnData
- new anndata object as subset of
adata