Preprocessing Module

Preprocessing is a crucial step in single-cell and multi-omics data analysis. This module provides functions for quality control, filtering, and normalization to ensure that datasets are clean and ready for downstream analyses.

Quality control (QC) metrics help identify low-quality cells and genes, such as those with extremely high or low counts, high dropout rates, or disproportionate expression levels. By computing these metrics, users can apply appropriate thresholds to retain only reliable data.

Filtering enables users to remove unwanted cells or genes based on QC criteria, ensuring that low-quality features do not affect downstream analyses.

Normalization methods adjust for differences in sequencing depth and technical variation, allowing meaningful comparisons across cells and conditions.

These preprocessing functions ensure that data is well-structured and suitable for clustering, dimensionality reduction, and other analytical tasks.

quality control metrics

Juscan.Pp.calculate_qc_metricsFunction
calculate_qc_metrics!(adata; expr_type="counts", var_type="genes", qc_vars=String[], 
                      percent_top=[50, 100, 200, 500], layer=nothing, use_raw=false, 
                      use_log1p=true, parallel=nothing) -> Tuple{DataFrame, DataFrame}

Calculate quality control metrics and return.

Arguments

  • adata: Muon.AnnData Annotated data matrix containing single-cell or multi-omics data.
  • expr_type: String (default: "counts") The type of expression data to use, such as raw counts or normalized values.
  • var_type: String (default: "genes") Specifies the variable type, e.g., "genes" for gene expression.
  • qc_vars: Union{Vector{String}, String} (default: String[]) A list of quality control variables to consider. If a single string is provided, it will be converted to a vector.
  • percent_top: Union{Vector{Int}, Nothing} (default: [50, 100, 200, 500]) A list of top feature (e.g., gene) counts to compute the percentage of total counts for each cell.
  • layer: Union{String, Nothing} (default: nothing) Specifies the layer in adata to use. If nothing, the default matrix is used.
  • use_raw: Bool (default: false) Whether to use the raw counts matrix (adata.raw) instead of the processed data.
  • use_log1p: Bool (default: true) Whether to apply log1p transformation to the data before computing quality control metrics.
  • parallel: Union{Bool, Nothing} (default: nothing) (Deprecated) This argument is ignored but retained for backward compatibility.

Returns

  • A tuple of two DataFrames.DataFrame:
    1. The first DataFrame contains per-cell quality control metrics.
    2. The second DataFrame contains per-gene quality control metrics.
source
Juscan.Pp.calculate_qc_metrics!Function
calculate_qc_metrics!(adata; expr_type="counts", var_type="genes", qc_vars=String[], 
                      percent_top=[50, 100, 200, 500], layer=nothing, use_raw=false, 
                      use_log1p=true, parallel=nothing) -> Nothing

Calculate and store quality control (QC) metrics directly in the obs and var attributes of adata.

Arguments

  • adata: Muon.AnnData Annotated data matrix containing single-cell or multi-omics data. The QC metrics will be stored in adata.obs (per-cell) and adata.var (per-gene).
  • expr_type: String (default: "counts") The type of expression data to use, such as raw counts or normalized values.
  • var_type: String (default: "genes") Specifies the variable type, e.g., "genes" for gene expression.
  • qc_vars: Union{Vector{String}, String} (default: String[]) A list of quality control variables to consider. If a single string is provided, it will be converted to a vector.
  • percent_top: Union{Vector{Int}, Nothing} (default: [50, 100, 200, 500]) A list of top feature (e.g., gene) counts to compute the percentage of total counts for each cell.
  • layer: Union{String, Nothing} (default: nothing) Specifies the layer in adata to use. If nothing, the default matrix is used.
  • use_raw: Bool (default: false) Whether to use the raw counts matrix (adata.raw) instead of the processed data.
  • use_log1p: Bool (default: true) Whether to apply log1p transformation to the data before computing quality control metrics.
  • parallel: Union{Bool, Nothing} (default: nothing) (Deprecated) This argument is ignored but retained for backward compatibility.

Returns

  • Nothing: This function modifies adata in place by adding QC metrics to adata.obs and adata.var.

Notes

  • Unlike calculate_qc_metrics, which returns QC metrics as DataFrame objects, this function directly updates adata without returning any values.
source

sub functions

Juscan.Pp.describe_obsFunction
describe_obs(adata; expr_type="counts", var_type="genes", qc_vars=String[], 
             percent_top=[50, 100, 200, 500], layer=nothing, use_raw=false, 
             use_log1p=true, X=nothing, parallel=nothing) -> DataFrames.DataFrame

Compute per-cell quality control (QC) metrics from the expression matrix and return them as a DataFrame.

Arguments

  • adata: Muon.AnnData Annotated data matrix containing single-cell or multi-omics data.
  • expr_type: String (default: "counts") The type of expression data to use, such as raw counts or normalized values.
  • var_type: String (default: "genes") Specifies the variable type, e.g., "genes" for gene expression.
  • qc_vars: Union{Vector{String}, String} (default: String[]) A list of quality control variables to consider. If a single string is provided, it will be converted to a vector.
  • percent_top: Union{Vector{Int}, Nothing} (default: [50, 100, 200, 500]) A list of top feature (e.g., gene) counts to compute the percentage of total counts for each cell.
  • layer: Union{String, Nothing} (default: nothing) Specifies the layer in adata to use. If nothing, the default matrix is used.
  • use_raw: Bool (default: false) Whether to use the raw counts matrix (adata.raw) instead of the processed data.
  • use_log1p: Bool (default: true) Whether to apply log1p transformation to the data before computing quality control metrics.
  • X: Union{Nothing, SparseMatrixCSC} (default: nothing) The expression matrix to use. If nothing, the matrix is determined by _choose_mtx_rep.
  • parallel: Union{Bool, Nothing} (default: nothing) (Deprecated) This argument is ignored but retained for backward compatibility.

Returns

  • DataFrames.DataFrame: A DataFrame containing per-cell QC metrics, including:
    • "n_$(var_type)_by_$(expr_type)": Number of detected features per cell.
    • "log1p_n_$(var_type)_by_$(expr_type)": Log-transformed number of features per cell.
    • "total_$(expr_type)": Total expression count per cell.
    • "log1p_total_$(expr_type)": Log-transformed total expression count per cell.
    • "pct_$(expr_type)_in_top_N_$(var_type)": Percentage of expression from the top N features.
    • "total_$(expr_type)_$(qc_var)": Total expression for each QC variable per cell.
    • "log1p_total_$(expr_type)_$(qc_var)": Log-transformed total expression for each QC variable per cell.
    • "pct_$(expr_type)_$(qc_var)": Percentage of total expression contributed by each QC variable.

Notes

  • This function computes per-cell QC metrics based on expression data. For per-gene metrics, see describe_var.
source
Juscan.Pp.describe_obs!Function
describe_obs!(adata; expr_type="counts", var_type="genes", qc_vars=String[], 
              percent_top=[50, 100, 200, 500], layer=nothing, use_raw=false, 
              use_log1p=true, X=nothing, parallel=nothing) -> Nothing

Compute per-cell quality control (QC) metrics and store them in adata.obs.

Arguments

  • adata: Muon.AnnData Annotated data matrix containing single-cell or multi-omics data.
  • expr_type: String (default: "counts") The type of expression data to use, such as raw counts or normalized values.
  • var_type: String (default: "genes") Specifies the variable type, e.g., "genes" for gene expression.
  • qc_vars: Union{Vector{String}, String} (default: String[]) A list of quality control variables to consider. If a single string is provided, it will be converted to a vector.
  • percent_top: Union{Vector{Int}, Nothing} (default: [50, 100, 200, 500]) A list of top feature (e.g., gene) counts to compute the percentage of total counts for each cell.
  • layer: Union{String, Nothing} (default: nothing) Specifies the layer in adata to use. If nothing, the default matrix is used.
  • use_raw: Bool (default: false) Whether to use the raw counts matrix (adata.raw) instead of the processed data.
  • use_log1p: Bool (default: true) Whether to apply log1p transformation to the data before computing quality control metrics.
  • X: Union{Nothing, SparseMatrixCSC} (default: nothing) The expression matrix to use. If nothing, the matrix is determined by _choose_mtx_rep.
  • parallel: Union{Bool, Nothing} (default: nothing) (Deprecated) This argument is ignored but retained for backward compatibility.

Returns

  • Nothing: This function modifies adata in place by adding per-cell QC metrics to adata.obs.

Notes

  • This function is similar to describe_obs but modifies adata directly. It computes QC metrics using describe_obs and stores the results in adata.obs.
source
Juscan.Pp.describe_varFunction
describe_var(adata; expr_type="counts", var_type="genes", layer=nothing, 
             use_raw=false, use_log1p=true, X=nothing) -> DataFrame

Compute per-gene quality control (QC) metrics and return as a DataFrame.

Arguments

  • adata: Muon.AnnData Annotated data matrix containing single-cell or multi-omics data.
  • expr_type: String (default: "counts") The type of expression data to use, such as raw counts or normalized values.
  • var_type: String (default: "genes") Specifies the variable type, e.g., "genes" for gene expression.
  • layer: Union{String, Nothing} (default: nothing) Specifies the layer in adata to use. If nothing, the default matrix is used.
  • use_raw: Bool (default: false) Whether to use the raw counts matrix (adata.raw) instead of the processed data.
  • use_log1p: Bool (default: true) Whether to apply log1p transformation to the data before computing quality control metrics.
  • X: Union{Nothing, SparseMatrixCSC} (default: nothing) The expression matrix to use. If nothing, the matrix is determined by _choose_mtx_rep.

Returns

  • DataFrames.DataFrame: A DataFrame where each row represents a gene (or variable), containing the following QC metrics:
    • "n_cells_by_<expr_type>": Number of cells in which each gene is detected.
    • "mean_<expr_type>": Mean expression level of each gene.
    • "log1p_mean_<expr_type>": Log-transformed mean expression level.
    • "pct_dropout_by_<expr_type>": Percentage of cells in which each gene is not detected.
    • "total_<expr_type>": Total expression level of each gene.
    • "log1p_total_<expr_type>": Log-transformed total expression level.

Notes

  • This function calculates quality control metrics per gene (or other variables).
source
Juscan.Pp.describe_var!Function
describe_var!(adata; expr_type="counts", var_type="genes", layer=nothing, 
              use_raw=false, use_log1p=true, X=nothing) -> Nothing

Compute per-gene quality control (QC) metrics and store them in adata.var.

Arguments

  • adata: Muon.AnnData Annotated data matrix containing single-cell or multi-omics data.
  • expr_type: String (default: "counts") The type of expression data to use, such as raw counts or normalized values.
  • var_type: String (default: "genes") Specifies the variable type, e.g., "genes" for gene expression.
  • layer: Union{String, Nothing} (default: nothing) Specifies the layer in adata to use. If nothing, the default matrix is used.
  • use_raw: Bool (default: false) Whether to use the raw counts matrix (adata.raw) instead of the processed data.
  • use_log1p: Bool (default: true) Whether to apply log1p transformation to the data before computing quality control metrics.
  • X: Union{Nothing, SparseMatrixCSC} (default: nothing) The expression matrix to use. If nothing, the matrix is determined by _choose_mtx_rep.

Returns

  • Nothing: This function modifies adata in place by adding per-gene QC metrics to adata.var.

Notes

  • This function is similar to describe_var but modifies adata directly. It computes QC metrics using describe_var and stores the results in adata.var.
source

filter cells and genes

filter cells

Juscan.Pp.filter_cellsFunction
filter_cells(
  data::Muon.AnnData;
  min_counts=nothing, min_genes=nothing,
  max_counts=nothing, max_genes=nothing,
  copy=false) -> Union{Tuple{BitVector, Vector}, Muon.AnnData}

Filters cells based on the given threshold criteria and returns either a subset mask and count vector or a new filtered Muon.AnnData object.

Arguments

  • data::Muon.AnnData: The input single-cell data.
  • min_counts::Union{Int, Nothing}: Minimum total counts per cell.
  • min_genes::Union{Int, Nothing}: Minimum number of genes expressed per cell.
  • max_counts::Union{Int, Nothing}: Maximum total counts per cell.
  • max_genes::Union{Int, Nothing}: Maximum number of genes expressed per cell.
  • copy::Bool: If true, returns a filtered copy; otherwise, returns a mask and count vector.

Returns

  • If copy == false, returns a tuple (cell_subset::BitVector, number_per_cell::Vector).
  • If copy == true, returns a filtered Muon.AnnData object.
source
filter_cells(
  data::AbstractMatrix;
  min_counts=nothing, min_genes=nothing,
  max_counts=nothing, max_genes=nothing
) -> Tuple{BitVector, Vector}

Filters cells from a count matrix based on the given threshold criteria.

Arguments

  • data::AbstractMatrix: Gene expression count matrix with cells as rows.
  • min_counts::Union{Int, Nothing}: Minimum total counts per cell.
  • min_genes::Union{Int, Nothing}: Minimum number of genes expressed per cell.
  • max_counts::Union{Int, Nothing}: Maximum total counts per cell.
  • max_genes::Union{Int, Nothing}: Maximum number of genes expressed per cell.

Returns

  • cell_subset::BitVector: A mask indicating cells that pass the filter.
  • number_per_cell::Vector: A vector of counts or expressed gene numbers per cell.
source
Juscan.Pp.filter_cells!Function
filter_cells!(
  data::Muon.AnnData;
  min_counts=nothing, min_genes=nothing,
  max_counts=nothing, max_genes=nothing
) -> Nothing

Filters cells in-place in a Muon.AnnData object.

Arguments

  • data::Muon.AnnData: The input single-cell data.
  • min_counts::Union{Int, Nothing}: Minimum total counts per cell.
  • min_genes::Union{Int, Nothing}: Minimum number of genes expressed per cell.
  • max_counts::Union{Int, Nothing}: Maximum total counts per cell.
  • max_genes::Union{Int, Nothing}: Maximum number of genes expressed per cell.

Returns

  • Nothing. The filtering is applied in-place.
source

filter genes

Juscan.Pp.filter_genesFunction
filter_genes(
  data::Muon.AnnData;
  min_counts=nothing, min_cells=nothing,
  max_counts=nothing, max_cells=nothing,
  copy=false
) -> Union{Tuple{BitVector, Vector}, Muon.AnnData}

Filters genes based on the given threshold criteria and returns either a subset mask and count vector or a new filtered Muon.AnnData object.

Arguments

  • data::Muon.AnnData: The input single-cell data.
  • min_counts::Union{Int, Nothing}: Minimum total counts per gene.
  • min_cells::Union{Int, Nothing}: Minimum number of cells expressing the gene.
  • max_counts::Union{Int, Nothing}: Maximum total counts per gene.
  • max_cells::Union{Int, Nothing}: Maximum number of cells expressing the gene.
  • copy::Bool: If true, returns a filtered copy; otherwise, returns a mask and count vector.

Returns

  • If copy == false, returns a tuple (gene_subset::BitVector, number_per_gene::Vector).
  • If copy == true, returns a filtered Muon.AnnData object.
source
filter_genes(
  data::AbstractMatrix;
  min_counts=nothing, min_cells=nothing,
  max_counts=nothing, max_cells=nothing
) -> Tuple{BitVector, Vector}

Filters genes from a count matrix based on the given threshold criteria.

Arguments

  • data::AbstractMatrix: Gene expression count matrix with genes as columns.
  • min_counts::Union{Int, Nothing}: Minimum total counts per gene.
  • min_cells::Union{Int, Nothing}: Minimum number of cells expressing the gene.
  • max_counts::Union{Int, Nothing}: Maximum total counts per gene.
  • max_cells::Union{Int, Nothing}: Maximum number of cells expressing the gene.

Returns

  • gene_subset::BitVector: A mask indicating genes that pass the filter.
  • number_per_gene::Vector: A vector of counts or expressed cell numbers per gene.
source
Juscan.Pp.filter_genes!Function
filter_genes!(
  data::Muon.AnnData;
  min_counts=nothing, min_cells=nothing,
  max_counts=nothing, max_cells=nothing
) -> Nothing

Filters genes in-place in a Muon.AnnData object.

Arguments

  • data::Muon.AnnData: The input single-cell data.
  • min_counts::Union{Int, Nothing}: Minimum total counts per gene.
  • min_cells::Union{Int, Nothing}: Minimum number of cells expressing the gene.
  • max_counts::Union{Int, Nothing}: Maximum total counts per gene.
  • max_cells::Union{Int, Nothing}: Maximum number of cells expressing the gene.

Returns

  • Nothing. The filtering is applied in-place.
source

normalization

Juscan.Pp.normalize_totalFunction
normalize_total(
    adata::AnnData;
    target_sum::Union{Real, Nothing}=nothing,
    exclude_highly_expressed::Bool=false,
    max_fraction::Float64=0.05,
    key_added::Union{String, Nothing}=nothing,
    layer::Union{String, Nothing}=nothing,
    layers::Union{String, Vector{String}, Nothing}=nothing,
    layer_norm::Union{String, Nothing}=nothing,
    copy::Bool=false
) -> Union{AnnData, Dict{String, Any}}

Normalize total counts per cell to a target sum.

Arguments

  • adata::AnnData: The single-cell dataset to be normalized.
  • target_sum::Union{Real, Nothing}=nothing: Target total count per cell. If nothing, normalization is performed to the median of counts per cell.
  • exclude_highly_expressed::Bool=false: If true, highly expressed genes are excluded from the normalization factor calculation.
  • max_fraction::Float64=0.05: A gene is considered highly expressed if it accounts for more than max_fraction of the total counts in any cell.
  • key_added::Union{String, Nothing}=nothing: If provided, the computed normalization factors are stored in adata.obs[key_added].
  • layer::Union{String, Nothing}=nothing: The layer to normalize. If nothing, the main count matrix (adata.X) is used.
  • layers::Union{String, Vector{String}, Nothing}=nothing: Specifies multiple layers to normalize. If "all", all layers in adata.layers are normalized.
  • layer_norm::Union{String, Nothing}=nothing: Defines the normalization reference for additional layers. Can be "after", "X", or nothing.
  • copy::Bool=false: If true, returns a copy of adata with normalized counts; otherwise, modifies adata in place.

Returns

  • If copy=true, returns a new AnnData object with normalized counts.
  • If copy=false, modifies adata in place and returns a dictionary containing:
    • "X": The normalized count matrix.
    • "norm_factor": The computed normalization factors.
    • Additional layers if layers is specified.

Notes

  • If exclude_highly_expressed=true, genes that exceed max_fraction in any cell are ignored when computing normalization factors.
  • Cells with zero counts will generate a warning.
  • The function supports layer-wise normalization if layers is specified.

Example

adata = AnnData(X)
normalize_total(adata; target_sum=10000)
source
Juscan.Pp.normalize_total!Function
normalize_total!(
    adata::AnnData;
    target_sum::Union{Real, Nothing}=nothing,
    exclude_highly_expressed::Bool=false,
    max_fraction::Float64=0.05,
    key_added::Union{String, Nothing}=nothing,
    layer::Union{String, Nothing}=nothing,
    layers::Union{String, Vector{String}, Nothing}=nothing
) -> Nothing

Normalize total counts per cell in-place.

Arguments

  • adata::AnnData: The single-cell dataset to be normalized.
  • target_sum::Union{Real, Nothing}=nothing: Target total count per cell. If nothing, normalization is performed to the median of counts per cell.
  • exclude_highly_expressed::Bool=false: If true, highly expressed genes are excluded from the normalization factor calculation.
  • max_fraction::Float64=0.05: A gene is considered highly expressed if it accounts for more than max_fraction of the total counts in any cell.
  • key_added::Union{String, Nothing}=nothing: If provided, the computed normalization factors are stored in adata.obs[key_added].
  • layer::Union{String, Nothing}=nothing: The layer to normalize. If nothing, the main count matrix (adata.X) is used.
  • layers::Union{String, Vector{String}, Nothing}=nothing: Specifies multiple layers to normalize. If "all", all layers in adata.layers are normalized.

Returns

  • This function modifies adata in place and does not return a new object.

Notes

  • If exclude_highly_expressed=true, genes that exceed max_fraction in any cell are ignored when computing normalization factors.
  • Cells with zero counts will generate a warning.
  • The function supports layer-wise normalization if layers is specified.

Example

adata = AnnData(X)
normalize_total!(adata; target_sum=10000)
source