Tools Module

The tools module provides essential utility functions to support various analytical workflows in single-cell and multi-omics data processing. These functions serve as building blocks for higher-level analysis by offering efficient, reusable operations that streamline common computational tasks.

Highly Variable Genes (HVG)

The module includes methods for identifying highly variable genes, which are crucial for downstream analyses such as clustering and dimensionality reduction. These functions help select informative genes by computing variability metrics across cells.

Principal Component Analysis (PCA)

PCA is a widely used dimensionality reduction technique that captures the most significant variations in the dataset. The tools module provides efficient implementations for computing PCA, enabling users to reduce data complexity while preserving important biological signals.

Clustering

The module supports clustering techniques to group cells based on gene expression patterns. These methods help uncover underlying cellular heterogeneity and identify distinct cell populations, facilitating biological interpretation of single-cell data.

By providing these fundamental tools, the tools module enhances data processing workflows, making it easier to perform robust, reproducible analyses across different stages of research.

highly variable genes

Juscan.Tl.subset_to_hvg!Function
subset_to_hvg!(adata::AnnData;
    layer::Union{String,Nothing} = nothing,
    n_top_genes::Int=2000,
    batch_key::Union{String,Nothing} = nothing,
    span::Float64=0.3,
    verbose::Bool=true
)

Calculates highly variable genes with highly_variable_genes! and subsets the AnnData object to the calculated HVGs. For description of input arguments, see highly_variable_genes!

Arguments

  • adata: AnnData object

Keyword arguments

  • layer: optional; which layer to use for calculating the HVGs. Function assumes this is a layer of counts. If layer is not provided, adata.X is used.
  • n_top_genes: optional; desired number of highly variable genes. Default: 2000.
  • batch_key: optional; key where to look for the batch indices in adata.obs. If not provided, data is treated as one batch.
  • span: span to use in the loess fit for the mean-variance local regression. See the Loess.jl docs for details.
  • verbose: whether or not to print info on current status

Returns

  • adata object subset to the calculated HVGs, both in the countmatrix/layer data used for HVG calculation and in the adata.var dictionary.
source
Juscan.Tl.highly_variable_genesFunction
highly_variable_genes(adata::AnnData;
    layer::Union{String,Nothing} = nothing,
    n_top_genes::Int=2000,
    batch_key::Union{String,Nothing} = nothing,
    span::Float64=0.3
    )

Computes highly variable genes according to the workflows on scanpy and Seurat v3 per batch and returns a dictionary with the information on the joint HVGs. For the in-place version, see highly_variable_genes!

More specifically, it is the Julia re-implementation of the corresponding scanpy function For implementation details, please check the scanpy/Seurat documentations or the source code of the lower-level _highly_variable_genes_seurat_v3 function in this package. Results are almost identical to the scanpy function. The differences have been traced back to differences in the local regression for the mean-variance relationship implemented in the Loess.jl package, that differs slightly from the corresponding Python implementation.

Arguments

  • adata: AnnData object

Keyword arguments

  • layer: optional; which layer to use for calculating the HVGs. Function assumes this is a layer of counts. If layer is not provided, adata.X is used.
  • n_top_genes: optional; desired number of highly variable genes. Default: 2000.
  • batch_key: optional; key where to look for the batch indices in adata.obs. If not provided, data is treated as one batch.
  • span: span to use in the loess fit for the mean-variance local regression. See the Loess.jl docs for details.
  • replace_hvgs: whether or not to replace the hvg information if there are already hvgs calculated. If false, the new values are added with a "_1" suffix. Default:true,
  • verbose: whether or not to print info on current status

Returns

  • a dictionary containing information on the highly variable genes, specifically containing the following keys is added:
    • highly_variable: vector of Bools indicating which genes are highly variable
    • highly_variable_rank: rank of the highly variable genes according to (corrected) variance
    • means: vector with means of each gene
    • variances: vector with variances of each gene
    • variances_norm: normalized variances of each gene
    • highly_variable_nbatches: if there are batches in the dataset, logs the number of batches in which each highly variable gene was actually detected as highly variable.
source
Juscan.Tl.highly_variable_genes!Function
highly_variable_genes!(adata::AnnData;
    layer::Union{String,Nothing} = nothing,
    n_top_genes::Int=2000,
    batch_key::Union{String,Nothing} = nothing,
    span::Float64=0.3,
    replace_hvgs::Bool=true,
    verbose::Bool=false
    )

Computes highly variable genes per batch according to the workflows on scanpy and Seurat v3 in-place. This is the in-place version that adds an dictionary containing information on the highly variable genes directly to the adata.var and returns the modified AnnData object. For details, see the not-in-place version ?highly_variable_genes.

source

principal component analysis

Juscan.Tl.pca!Function
pca!(adata::Muon.AnnData; layer="log_transformed", n_pcs=1000, key_added="pca", verbose=true)

Performs Principal Component Analysis (PCA) on the specified layer of an AnnData object and stores the result in adata.obsm.

If the specified layer is missing, the function will attempt to log-transform a normalized layer, or normalize and log-transform the raw counts if needed.

Arguments

  • adata::Muon.AnnData: The annotated data object on which to perform PCA.

Keyword Arguments

  • layer::String = "log_transformed": The data layer to use for PCA. Defaults to "log_transformed".
  • n_pcs::Int = 1000: The number of principal components to compute. Automatically clipped to the smallest matrix dimension if too large.
  • key_added::String = "pca": The key under which to store the PCA result in adata.obsm.
  • verbose::Bool = true: Whether to print progress messages.

Returns

The modified AnnData object with PCA results stored in adata.obsm[key_added].

Notes

  • This function performs automatic preprocessing if the requested layer is not present.
  • The PCA is computed via SVD on standardized data.
source
Juscan.Tl.umap!Function
umap!(adata::Muon.AnnData; layer="log_transformed", use_pca=nothing, n_pcs=100, key_added="umap", verbose=true, kwargs...)

Computes a UMAP embedding from the data in the specified layer or PCA representation and stores the result in adata.obsm.

If PCA is requested via use_pca, it will be computed automatically if not already present.

Arguments

  • adata::Muon.AnnData: The annotated data object on which to compute UMAP.

Keyword Arguments

  • layer::String = "log_transformed": The layer to use as input if use_pca is not specified.
  • use_pca::Union{String, Nothing} = nothing: If specified, use this key in adata.obsm as PCA input. If missing, it will be computed.
  • n_pcs::Int = 100: Number of principal components to use if PCA needs to be computed.
  • key_added::String = "umap": The key under which to store the UMAP embedding.
  • verbose::Bool = true: Whether to print progress messages.
  • kwargs...: Additional keyword arguments passed to UMAP.UMAP_().

Returns

The modified AnnData object with UMAP results stored in:

  • adata.obsm[key_added]: The UMAP embedding.
  • adata.obsm["knns"]: K-nearest neighbors matrix.
  • adata.obsm["knn_dists"]: KNN distance matrix.
  • adata.obsp["fuzzy_neighbor_graph"]: Fuzzy graph representation.

Notes

  • Automatically performs normalization and log transformation if necessary.
  • Uses the UMAP.jl package under the hood.
source
Juscan.Tl.log_transform!Function
log_transform!(adata::Muon.AnnData; layer="normalized", key_added="log_transformed", verbose=false)

Applies a log transformation to the specified data layer of an AnnData object and stores the result in adata.layers.

If the specified layer is missing, the function defaults to applying a log(1 + x) transformation to adata.X.

Arguments

  • adata::Muon.AnnData: The data object to transform.

Keyword Arguments

  • layer::String = "normalized": The layer to transform. Must exist in adata.layers.
  • key_added::String = "log_transformed": The key to store the result under in adata.layers.
  • verbose::Bool = false: Whether to print messages during the process.

Returns

The modified AnnData object with the log-transformed data added to adata.layers[key_added].

Notes

  • The transformation is log(x + ϵ), where ϵ is a small constant to avoid log(0).
  • For default fallback behavior, see logp1_transform!().
source
Juscan.Tl.logp1_transform!Function
logp1_transform!(adata::Muon.AnnData; layer=nothing, key_added="log1_transformed", verbose=false)

Applies a log(1 + x) transformation to the specified layer or the main data matrix adata.X in an AnnData object. The result is stored in adata.layers[key_added].

Arguments

  • adata::Muon.AnnData: The annotated data object to transform.

Keyword Arguments

  • layer::Union{String, Nothing} = nothing: The name of the data layer to transform. If nothing or the layer is missing, uses adata.X.
  • key_added::AbstractString = "log1_transformed": The name under which to store the transformed result in adata.layers.
  • verbose::Bool = false: Whether to print transformation messages.

Returns

The modified AnnData object with the log(1 + x) transformed data stored in adata.layers[key_added].

Notes

  • This transformation is commonly used to stabilize variance and reduce the effect of outliers.
  • Compared to log_transform!, this version adds 1 to the data before taking the logarithm, making it more robust to zero entries.
source

clustering

Juscan.Tl.clustering!Function
clustering!(data::Muon.AnnData; kwargs...)

Perform clustering on a Muon.AnnData object and store the results in place.

This function supports modularity-based graph clustering ("mc") and K-means clustering ("km"). By default, modularity clustering is applied using a shared nearest neighbor (SNN) graph constructed from PCA-reduced data.

Arguments

  • data::Muon.AnnData: The annotated data matrix to cluster. The object will be modified in place.

Keyword Arguments

  • method::AbstractString = "mc": Clustering method. Options are "mc" (modularity clustering) or "km" (K-means).
  • reduction::Union{AbstractString, Symbol} = :auto: Dimensionality reduction method to use. Accepts "pca" or "harmony". When :auto, it defaults to "pca" (support for "harmony" is planned).
  • use_pca::Union{AbstractString, Integer} = "pca_cut": Number of principal components to use, or the key in obsm specifying a PCA representation.
  • tree_K::Integer = 20: Number of neighbors to use when building the SNN graph. Relevant only for "mc" clustering.
  • resolution::Union{Symbol, Real, AbstractRange} = :auto: Resolution(s) for modularity optimization. :auto uses 0.2:0.1:2.0.
  • cluster_K::Union{Nothing, Integer} = nothing: Number of clusters for K-means. If nothing, it will be auto-determined.
  • cluster_K_max::Union{Nothing, Integer} = 30: Maximum number of clusters to try for automatic K-means clustering.
  • dist::AbstractString = "Euclidean": Distance metric for K-means, e.g., "Euclidean".
  • network::AbstractString = "SNN": Type of graph network to construct. Currently supports "SNN".
  • random_starts_number::Integer = 10: Number of random initializations for clustering (modularity clustering only).
  • iter_number::Integer = 10: Maximum number of iterations for the clustering optimization.
  • prune::AbstractFloat = 1/15: Pruning factor for the graph. Must be between 0 and 1.
  • seed::Integer = -1: Random seed. Use a negative number to skip setting the seed.

Returns

Nothing. The clustering results are stored in the obs field of the input AnnData object, typically under a key like "clusters".

Example

using Juscan

clustering!(adata; method="mc", use_pca="X_pca", resolution=1.0)

Notes

  • "harmony" support is planned but not yet implemented.
  • The "mc" method builds a neighbor graph and performs community detection; "km" performs K-means clustering in reduced space.
  • The method parameter only supports "mc" and "km"; invalid inputs will return a warning.
source