Tools Module
The tools
module provides essential utility functions to support various analytical workflows in single-cell and multi-omics data processing. These functions serve as building blocks for higher-level analysis by offering efficient, reusable operations that streamline common computational tasks.
Highly Variable Genes (HVG)
The module includes methods for identifying highly variable genes, which are crucial for downstream analyses such as clustering and dimensionality reduction. These functions help select informative genes by computing variability metrics across cells.
Principal Component Analysis (PCA)
PCA is a widely used dimensionality reduction technique that captures the most significant variations in the dataset. The tools
module provides efficient implementations for computing PCA, enabling users to reduce data complexity while preserving important biological signals.
Clustering
The module supports clustering techniques to group cells based on gene expression patterns. These methods help uncover underlying cellular heterogeneity and identify distinct cell populations, facilitating biological interpretation of single-cell data.
By providing these fundamental tools, the tools
module enhances data processing workflows, making it easier to perform robust, reproducible analyses across different stages of research.
Juscan.Tl.clustering!
Juscan.Tl.highly_variable_genes
Juscan.Tl.highly_variable_genes!
Juscan.Tl.log_transform!
Juscan.Tl.logp1_transform!
Juscan.Tl.pca!
Juscan.Tl.subset_to_hvg!
Juscan.Tl.umap!
highly variable genes
Juscan.Tl.subset_to_hvg!
— Functionsubset_to_hvg!(adata::AnnData;
layer::Union{String,Nothing} = nothing,
n_top_genes::Int=2000,
batch_key::Union{String,Nothing} = nothing,
span::Float64=0.3,
verbose::Bool=true
)
Calculates highly variable genes with highly_variable_genes!
and subsets the AnnData
object to the calculated HVGs. For description of input arguments, see highly_variable_genes!
Arguments
adata
:AnnData
object
Keyword arguments
layer
: optional; which layer to use for calculating the HVGs. Function assumes this is a layer of counts. Iflayer
is not provided,adata.X
is used.n_top_genes
: optional; desired number of highly variable genes. Default: 2000.batch_key
: optional; key where to look for the batch indices inadata.obs
. If not provided, data is treated as one batch.span
: span to use in the loess fit for the mean-variance local regression. See the Loess.jl docs for details.verbose
: whether or not to print info on current status
Returns
adata
object subset to the calculated HVGs, both in the countmatrix/layer data used for HVG calculation and in theadata.var
dictionary.
Juscan.Tl.highly_variable_genes
— Functionhighly_variable_genes(adata::AnnData;
layer::Union{String,Nothing} = nothing,
n_top_genes::Int=2000,
batch_key::Union{String,Nothing} = nothing,
span::Float64=0.3
)
Computes highly variable genes according to the workflows on scanpy
and Seurat v3 per batch and returns a dictionary with the information on the joint HVGs. For the in-place version, see highly_variable_genes!
More specifically, it is the Julia re-implementation of the corresponding scanpy
function For implementation details, please check the scanpy
/Seurat documentations or the source code of the lower-level _highly_variable_genes_seurat_v3
function in this package. Results are almost identical to the scanpy
function. The differences have been traced back to differences in the local regression for the mean-variance relationship implemented in the Loess.jl package, that differs slightly from the corresponding Python implementation.
Arguments
adata
:AnnData
object
Keyword arguments
layer
: optional; which layer to use for calculating the HVGs. Function assumes this is a layer of counts. Iflayer
is not provided,adata.X
is used.n_top_genes
: optional; desired number of highly variable genes. Default: 2000.batch_key
: optional; key where to look for the batch indices inadata.obs
. If not provided, data is treated as one batch.span
: span to use in the loess fit for the mean-variance local regression. See the Loess.jl docs for details.replace_hvgs
: whether or not to replace the hvg information if there are already hvgs calculated. If false, the new values are added with a "_1" suffix. Default:true,verbose
: whether or not to print info on current status
Returns
- a dictionary containing information on the highly variable genes, specifically containing the following keys is added:
highly_variable
: vector ofBool
s indicating which genes are highly variablehighly_variable_rank
: rank of the highly variable genes according to (corrected) variancemeans
: vector with means of each genevariances
: vector with variances of each genevariances_norm
: normalized variances of each genehighly_variable_nbatches
: if there are batches in the dataset, logs the number of batches in which each highly variable gene was actually detected as highly variable.
Juscan.Tl.highly_variable_genes!
— Functionhighly_variable_genes!(adata::AnnData;
layer::Union{String,Nothing} = nothing,
n_top_genes::Int=2000,
batch_key::Union{String,Nothing} = nothing,
span::Float64=0.3,
replace_hvgs::Bool=true,
verbose::Bool=false
)
Computes highly variable genes per batch according to the workflows on scanpy
and Seurat v3 in-place. This is the in-place version that adds an dictionary containing information on the highly variable genes directly to the adata.var
and returns the modified AnnData
object. For details, see the not-in-place version ?highly_variable_genes
.
principal component analysis
Juscan.Tl.pca!
— Functionpca!(adata::Muon.AnnData; layer="log_transformed", n_pcs=1000, key_added="pca", verbose=true)
Performs Principal Component Analysis (PCA) on the specified layer of an AnnData
object and stores the result in adata.obsm
.
If the specified layer is missing, the function will attempt to log-transform a normalized layer, or normalize and log-transform the raw counts if needed.
Arguments
adata::Muon.AnnData
: The annotated data object on which to perform PCA.
Keyword Arguments
layer::String = "log_transformed"
: The data layer to use for PCA. Defaults to"log_transformed"
.n_pcs::Int = 1000
: The number of principal components to compute. Automatically clipped to the smallest matrix dimension if too large.key_added::String = "pca"
: The key under which to store the PCA result inadata.obsm
.verbose::Bool = true
: Whether to print progress messages.
Returns
The modified AnnData
object with PCA results stored in adata.obsm[key_added]
.
Notes
- This function performs automatic preprocessing if the requested layer is not present.
- The PCA is computed via SVD on standardized data.
Juscan.Tl.umap!
— Functionumap!(adata::Muon.AnnData; layer="log_transformed", use_pca=nothing, n_pcs=100, key_added="umap", verbose=true, kwargs...)
Computes a UMAP embedding from the data in the specified layer or PCA representation and stores the result in adata.obsm
.
If PCA is requested via use_pca
, it will be computed automatically if not already present.
Arguments
adata::Muon.AnnData
: The annotated data object on which to compute UMAP.
Keyword Arguments
layer::String = "log_transformed"
: The layer to use as input ifuse_pca
is not specified.use_pca::Union{String, Nothing} = nothing
: If specified, use this key inadata.obsm
as PCA input. If missing, it will be computed.n_pcs::Int = 100
: Number of principal components to use if PCA needs to be computed.key_added::String = "umap"
: The key under which to store the UMAP embedding.verbose::Bool = true
: Whether to print progress messages.kwargs...
: Additional keyword arguments passed toUMAP.UMAP_()
.
Returns
The modified AnnData
object with UMAP results stored in:
adata.obsm[key_added]
: The UMAP embedding.adata.obsm["knns"]
: K-nearest neighbors matrix.adata.obsm["knn_dists"]
: KNN distance matrix.adata.obsp["fuzzy_neighbor_graph"]
: Fuzzy graph representation.
Notes
- Automatically performs normalization and log transformation if necessary.
- Uses the
UMAP.jl
package under the hood.
Juscan.Tl.log_transform!
— Functionlog_transform!(adata::Muon.AnnData; layer="normalized", key_added="log_transformed", verbose=false)
Applies a log transformation to the specified data layer of an AnnData
object and stores the result in adata.layers
.
If the specified layer is missing, the function defaults to applying a log(1 + x) transformation to adata.X
.
Arguments
adata::Muon.AnnData
: The data object to transform.
Keyword Arguments
layer::String = "normalized"
: The layer to transform. Must exist inadata.layers
.key_added::String = "log_transformed"
: The key to store the result under inadata.layers
.verbose::Bool = false
: Whether to print messages during the process.
Returns
The modified AnnData
object with the log-transformed data added to adata.layers[key_added]
.
Notes
- The transformation is log(x + ϵ), where ϵ is a small constant to avoid log(0).
- For default fallback behavior, see
logp1_transform!()
.
Juscan.Tl.logp1_transform!
— Functionlogp1_transform!(adata::Muon.AnnData; layer=nothing, key_added="log1_transformed", verbose=false)
Applies a log(1 + x) transformation to the specified layer or the main data matrix adata.X
in an AnnData
object. The result is stored in adata.layers[key_added]
.
Arguments
adata::Muon.AnnData
: The annotated data object to transform.
Keyword Arguments
layer::Union{String, Nothing} = nothing
: The name of the data layer to transform. Ifnothing
or the layer is missing, usesadata.X
.key_added::AbstractString = "log1_transformed"
: The name under which to store the transformed result inadata.layers
.verbose::Bool = false
: Whether to print transformation messages.
Returns
The modified AnnData
object with the log(1 + x) transformed data stored in adata.layers[key_added]
.
Notes
- This transformation is commonly used to stabilize variance and reduce the effect of outliers.
- Compared to
log_transform!
, this version adds 1 to the data before taking the logarithm, making it more robust to zero entries.
clustering
Juscan.Tl.clustering!
— Functionclustering!(data::Muon.AnnData; kwargs...)
Perform clustering on a Muon.AnnData
object and store the results in place.
This function supports modularity-based graph clustering ("mc"
) and K-means clustering ("km"
). By default, modularity clustering is applied using a shared nearest neighbor (SNN) graph constructed from PCA-reduced data.
Arguments
data::Muon.AnnData
: The annotated data matrix to cluster. The object will be modified in place.
Keyword Arguments
method::AbstractString = "mc"
: Clustering method. Options are"mc"
(modularity clustering) or"km"
(K-means).reduction::Union{AbstractString, Symbol} = :auto
: Dimensionality reduction method to use. Accepts"pca"
or"harmony"
. When:auto
, it defaults to"pca"
(support for"harmony"
is planned).use_pca::Union{AbstractString, Integer} = "pca_cut"
: Number of principal components to use, or the key inobsm
specifying a PCA representation.tree_K::Integer = 20
: Number of neighbors to use when building the SNN graph. Relevant only for"mc"
clustering.resolution::Union{Symbol, Real, AbstractRange} = :auto
: Resolution(s) for modularity optimization.:auto
uses0.2:0.1:2.0
.cluster_K::Union{Nothing, Integer} = nothing
: Number of clusters for K-means. Ifnothing
, it will be auto-determined.cluster_K_max::Union{Nothing, Integer} = 30
: Maximum number of clusters to try for automatic K-means clustering.dist::AbstractString = "Euclidean"
: Distance metric for K-means, e.g.,"Euclidean"
.network::AbstractString = "SNN"
: Type of graph network to construct. Currently supports"SNN"
.random_starts_number::Integer = 10
: Number of random initializations for clustering (modularity clustering only).iter_number::Integer = 10
: Maximum number of iterations for the clustering optimization.prune::AbstractFloat = 1/15
: Pruning factor for the graph. Must be between 0 and 1.seed::Integer = -1
: Random seed. Use a negative number to skip setting the seed.
Returns
Nothing. The clustering results are stored in the obs
field of the input AnnData
object, typically under a key like "clusters"
.
Example
using Juscan
clustering!(adata; method="mc", use_pca="X_pca", resolution=1.0)
Notes
"harmony"
support is planned but not yet implemented.- The
"mc"
method builds a neighbor graph and performs community detection;"km"
performs K-means clustering in reduced space. - The
method
parameter only supports"mc"
and"km"
; invalid inputs will return a warning.