Preprocessing Module

Preprocessing is a crucial step in single-cell and multi-omics data analysis. This module provides functions for quality control, filtering, and normalization to ensure that datasets are clean and ready for downstream analyses.

Quality control (QC) metrics help identify low-quality cells and genes, such as those with extremely high or low counts, high dropout rates, or disproportionate expression levels. By computing these metrics, users can apply appropriate thresholds to retain only reliable data.

Filtering enables users to remove unwanted cells or genes based on QC criteria, ensuring that low-quality features do not affect downstream analyses.

Normalization methods adjust for differences in sequencing depth and technical variation, allowing meaningful comparisons across cells and conditions.

These preprocessing functions ensure that data is well-structured and suitable for clustering, dimensionality reduction, and other analytical tasks.

Juscan.Pp.calculate_qc_metrics
Juscan.Pp.calculate_qc_metrics!
Juscan.Pp.describe_obs
Juscan.Pp.describe_obs!
Juscan.Pp.describe_var
Juscan.Pp.describe_var!
Juscan.Pp.filter_cells
Juscan.Pp.filter_cells!
Juscan.Pp.filter_genes
Juscan.Pp.filter_genes!
Juscan.Pp.normalize_total
Juscan.Pp.normalize_total!

quality control metrics

Juscan.Pp.calculate_qc_metrics — Function

calculate_qc_metrics!(adata; expr_type="counts", var_type="genes", qc_vars=String[], 
                      percent_top=[50, 100, 200, 500], layer=nothing, use_raw=false, 
                      use_log1p=true, parallel=nothing) -> Tuple{DataFrame, DataFrame}

Calculate quality control metrics and return.

Arguments

adata: Muon.AnnData Annotated data matrix containing single-cell or multi-omics data.
expr_type: String (default: "counts") The type of expression data to use, such as raw counts or normalized values.
var_type: String (default: "genes") Specifies the variable type, e.g., "genes" for gene expression.
qc_vars: Union{Vector{String}, String} (default: String[]) A list of quality control variables to consider. If a single string is provided, it will be converted to a vector.
percent_top: Union{Vector{Int}, Nothing} (default: [50, 100, 200, 500]) A list of top feature (e.g., gene) counts to compute the percentage of total counts for each cell.
layer: Union{String, Nothing} (default: nothing) Specifies the layer in adata to use. If nothing, the default matrix is used.
use_raw: Bool (default: false) Whether to use the raw counts matrix (adata.raw) instead of the processed data.
use_log1p: Bool (default: true) Whether to apply log1p transformation to the data before computing quality control metrics.
parallel: Union{Bool, Nothing} (default: nothing) (Deprecated) This argument is ignored but retained for backward compatibility.

Returns

A tuple of two DataFrames.DataFrame:
1. The first DataFrame contains per-cell quality control metrics.
2. The second DataFrame contains per-gene quality control metrics.

source

Juscan.Pp.calculate_qc_metrics! — Function

calculate_qc_metrics!(adata; expr_type="counts", var_type="genes", qc_vars=String[], 
                      percent_top=[50, 100, 200, 500], layer=nothing, use_raw=false, 
                      use_log1p=true, parallel=nothing) -> Nothing

Calculate and store quality control (QC) metrics directly in the obs and var attributes of adata.

Arguments

adata: Muon.AnnData Annotated data matrix containing single-cell or multi-omics data. The QC metrics will be stored in adata.obs (per-cell) and adata.var (per-gene).
expr_type: String (default: "counts") The type of expression data to use, such as raw counts or normalized values.
var_type: String (default: "genes") Specifies the variable type, e.g., "genes" for gene expression.
qc_vars: Union{Vector{String}, String} (default: String[]) A list of quality control variables to consider. If a single string is provided, it will be converted to a vector.
percent_top: Union{Vector{Int}, Nothing} (default: [50, 100, 200, 500]) A list of top feature (e.g., gene) counts to compute the percentage of total counts for each cell.
layer: Union{String, Nothing} (default: nothing) Specifies the layer in adata to use. If nothing, the default matrix is used.
use_raw: Bool (default: false) Whether to use the raw counts matrix (adata.raw) instead of the processed data.
use_log1p: Bool (default: true) Whether to apply log1p transformation to the data before computing quality control metrics.
parallel: Union{Bool, Nothing} (default: nothing) (Deprecated) This argument is ignored but retained for backward compatibility.

Returns

Nothing: This function modifies adata in place by adding QC metrics to adata.obs and adata.var.

Notes

Unlike calculate_qc_metrics, which returns QC metrics as DataFrame objects, this function directly updates adata without returning any values.

source

sub functions

Juscan.Pp.describe_obs — Function

describe_obs(adata; expr_type="counts", var_type="genes", qc_vars=String[], 
             percent_top=[50, 100, 200, 500], layer=nothing, use_raw=false, 
             use_log1p=true, X=nothing, parallel=nothing) -> DataFrames.DataFrame

Compute per-cell quality control (QC) metrics from the expression matrix and return them as a DataFrame.

Arguments

adata: Muon.AnnData Annotated data matrix containing single-cell or multi-omics data.
expr_type: String (default: "counts") The type of expression data to use, such as raw counts or normalized values.
var_type: String (default: "genes") Specifies the variable type, e.g., "genes" for gene expression.
qc_vars: Union{Vector{String}, String} (default: String[]) A list of quality control variables to consider. If a single string is provided, it will be converted to a vector.
percent_top: Union{Vector{Int}, Nothing} (default: [50, 100, 200, 500]) A list of top feature (e.g., gene) counts to compute the percentage of total counts for each cell.
layer: Union{String, Nothing} (default: nothing) Specifies the layer in adata to use. If nothing, the default matrix is used.
use_raw: Bool (default: false) Whether to use the raw counts matrix (adata.raw) instead of the processed data.
use_log1p: Bool (default: true) Whether to apply log1p transformation to the data before computing quality control metrics.
X: Union{Nothing, SparseMatrixCSC} (default: nothing) The expression matrix to use. If nothing, the matrix is determined by _choose_mtx_rep.
parallel: Union{Bool, Nothing} (default: nothing) (Deprecated) This argument is ignored but retained for backward compatibility.

Returns

DataFrames.DataFrame: A DataFrame containing per-cell QC metrics, including:
- "n_$(var_type)_by_$(expr_type)": Number of detected features per cell.
- "log1p_n_$(var_type)_by_$(expr_type)": Log-transformed number of features per cell.
- "total_$(expr_type)": Total expression count per cell.
- "log1p_total_$(expr_type)": Log-transformed total expression count per cell.
- "pct_$(expr_type)_in_top_N_$(var_type)": Percentage of expression from the top N features.
- "total_$(expr_type)_$(qc_var)": Total expression for each QC variable per cell.
- "log1p_total_$(expr_type)_$(qc_var)": Log-transformed total expression for each QC variable per cell.
- "pct_$(expr_type)_$(qc_var)": Percentage of total expression contributed by each QC variable.

Notes

This function computes per-cell QC metrics based on expression data. For per-gene metrics, see describe_var.

source

Juscan.Pp.describe_obs! — Function

describe_obs!(adata; expr_type="counts", var_type="genes", qc_vars=String[], 
              percent_top=[50, 100, 200, 500], layer=nothing, use_raw=false, 
              use_log1p=true, X=nothing, parallel=nothing) -> Nothing

Compute per-cell quality control (QC) metrics and store them in adata.obs.

Arguments

adata: Muon.AnnData Annotated data matrix containing single-cell or multi-omics data.
expr_type: String (default: "counts") The type of expression data to use, such as raw counts or normalized values.
var_type: String (default: "genes") Specifies the variable type, e.g., "genes" for gene expression.
qc_vars: Union{Vector{String}, String} (default: String[]) A list of quality control variables to consider. If a single string is provided, it will be converted to a vector.
percent_top: Union{Vector{Int}, Nothing} (default: [50, 100, 200, 500]) A list of top feature (e.g., gene) counts to compute the percentage of total counts for each cell.
layer: Union{String, Nothing} (default: nothing) Specifies the layer in adata to use. If nothing, the default matrix is used.
use_raw: Bool (default: false) Whether to use the raw counts matrix (adata.raw) instead of the processed data.
use_log1p: Bool (default: true) Whether to apply log1p transformation to the data before computing quality control metrics.
X: Union{Nothing, SparseMatrixCSC} (default: nothing) The expression matrix to use. If nothing, the matrix is determined by _choose_mtx_rep.
parallel: Union{Bool, Nothing} (default: nothing) (Deprecated) This argument is ignored but retained for backward compatibility.

Returns

Nothing: This function modifies adata in place by adding per-cell QC metrics to adata.obs.

Notes

This function is similar to describe_obs but modifies adata directly. It computes QC metrics using describe_obs and stores the results in adata.obs.
- For per-gene QC metrics, see describe_var!.

source

Juscan.Pp.describe_var — Function

describe_var(adata; expr_type="counts", var_type="genes", layer=nothing, 
             use_raw=false, use_log1p=true, X=nothing) -> DataFrame

Compute per-gene quality control (QC) metrics and return as a DataFrame.

Arguments

adata: Muon.AnnData Annotated data matrix containing single-cell or multi-omics data.
expr_type: String (default: "counts") The type of expression data to use, such as raw counts or normalized values.
var_type: String (default: "genes") Specifies the variable type, e.g., "genes" for gene expression.
layer: Union{String, Nothing} (default: nothing) Specifies the layer in adata to use. If nothing, the default matrix is used.
use_raw: Bool (default: false) Whether to use the raw counts matrix (adata.raw) instead of the processed data.
use_log1p: Bool (default: true) Whether to apply log1p transformation to the data before computing quality control metrics.
X: Union{Nothing, SparseMatrixCSC} (default: nothing) The expression matrix to use. If nothing, the matrix is determined by _choose_mtx_rep.

Returns

DataFrames.DataFrame: A DataFrame where each row represents a gene (or variable), containing the following QC metrics:
- "n_cells_by_<expr_type>": Number of cells in which each gene is detected.
- "mean_<expr_type>": Mean expression level of each gene.
- "log1p_mean_<expr_type>": Log-transformed mean expression level.
- "pct_dropout_by_<expr_type>": Percentage of cells in which each gene is not detected.
- "total_<expr_type>": Total expression level of each gene.
- "log1p_total_<expr_type>": Log-transformed total expression level.

Notes

This function calculates quality control metrics per gene (or other variables).
- To store these metrics directly in adata.var, use describe_var!.
- To compute per-cell QC metrics, see describe_obs.

source

Juscan.Pp.describe_var! — Function

describe_var!(adata; expr_type="counts", var_type="genes", layer=nothing, 
              use_raw=false, use_log1p=true, X=nothing) -> Nothing

Compute per-gene quality control (QC) metrics and store them in adata.var.

Arguments

adata: Muon.AnnData Annotated data matrix containing single-cell or multi-omics data.
expr_type: String (default: "counts") The type of expression data to use, such as raw counts or normalized values.
var_type: String (default: "genes") Specifies the variable type, e.g., "genes" for gene expression.
layer: Union{String, Nothing} (default: nothing) Specifies the layer in adata to use. If nothing, the default matrix is used.
use_raw: Bool (default: false) Whether to use the raw counts matrix (adata.raw) instead of the processed data.
use_log1p: Bool (default: true) Whether to apply log1p transformation to the data before computing quality control metrics.
X: Union{Nothing, SparseMatrixCSC} (default: nothing) The expression matrix to use. If nothing, the matrix is determined by _choose_mtx_rep.

Returns

Nothing: This function modifies adata in place by adding per-gene QC metrics to adata.var.

Notes

This function is similar to describe_var but modifies adata directly. It computes QC metrics using describe_var and stores the results in adata.var.
- For per-cell QC metrics, see describe_obs!.

source

filter cells and genes

filter cells

Juscan.Pp.filter_cells — Function

filter_cells(
  data::Muon.AnnData;
  min_counts=nothing, min_genes=nothing,
  max_counts=nothing, max_genes=nothing,
  copy=false) -> Union{Tuple{BitVector, Vector}, Muon.AnnData}

Filters cells based on the given threshold criteria and returns either a subset mask and count vector or a new filtered Muon.AnnData object.

Arguments

data::Muon.AnnData: The input single-cell data.
min_counts::Union{Int, Nothing}: Minimum total counts per cell.
min_genes::Union{Int, Nothing}: Minimum number of genes expressed per cell.
max_counts::Union{Int, Nothing}: Maximum total counts per cell.
max_genes::Union{Int, Nothing}: Maximum number of genes expressed per cell.
copy::Bool: If true, returns a filtered copy; otherwise, returns a mask and count vector.

Returns

If copy == false, returns a tuple (cell_subset::BitVector, number_per_cell::Vector).
If copy == true, returns a filtered Muon.AnnData object.

source

filter_cells(
  data::AbstractMatrix;
  min_counts=nothing, min_genes=nothing,
  max_counts=nothing, max_genes=nothing
) -> Tuple{BitVector, Vector}

Filters cells from a count matrix based on the given threshold criteria.

Arguments

data::AbstractMatrix: Gene expression count matrix with cells as rows.
min_counts::Union{Int, Nothing}: Minimum total counts per cell.
min_genes::Union{Int, Nothing}: Minimum number of genes expressed per cell.
max_counts::Union{Int, Nothing}: Maximum total counts per cell.
max_genes::Union{Int, Nothing}: Maximum number of genes expressed per cell.

Returns

cell_subset::BitVector: A mask indicating cells that pass the filter.
number_per_cell::Vector: A vector of counts or expressed gene numbers per cell.

source

Juscan.Pp.filter_cells! — Function

filter_cells!(
  data::Muon.AnnData;
  min_counts=nothing, min_genes=nothing,
  max_counts=nothing, max_genes=nothing
) -> Nothing

Filters cells in-place in a Muon.AnnData object.

Arguments

data::Muon.AnnData: The input single-cell data.
min_counts::Union{Int, Nothing}: Minimum total counts per cell.
min_genes::Union{Int, Nothing}: Minimum number of genes expressed per cell.
max_counts::Union{Int, Nothing}: Maximum total counts per cell.
max_genes::Union{Int, Nothing}: Maximum number of genes expressed per cell.

Returns

Nothing. The filtering is applied in-place.

source

filter genes

Juscan.Pp.filter_genes — Function

filter_genes(
  data::Muon.AnnData;
  min_counts=nothing, min_cells=nothing,
  max_counts=nothing, max_cells=nothing,
  copy=false
) -> Union{Tuple{BitVector, Vector}, Muon.AnnData}

Filters genes based on the given threshold criteria and returns either a subset mask and count vector or a new filtered Muon.AnnData object.

Arguments

data::Muon.AnnData: The input single-cell data.
min_counts::Union{Int, Nothing}: Minimum total counts per gene.
min_cells::Union{Int, Nothing}: Minimum number of cells expressing the gene.
max_counts::Union{Int, Nothing}: Maximum total counts per gene.
max_cells::Union{Int, Nothing}: Maximum number of cells expressing the gene.
copy::Bool: If true, returns a filtered copy; otherwise, returns a mask and count vector.

Returns

If copy == false, returns a tuple (gene_subset::BitVector, number_per_gene::Vector).
If copy == true, returns a filtered Muon.AnnData object.

source

filter_genes(
  data::AbstractMatrix;
  min_counts=nothing, min_cells=nothing,
  max_counts=nothing, max_cells=nothing
) -> Tuple{BitVector, Vector}

Filters genes from a count matrix based on the given threshold criteria.

Arguments

data::AbstractMatrix: Gene expression count matrix with genes as columns.
min_counts::Union{Int, Nothing}: Minimum total counts per gene.
min_cells::Union{Int, Nothing}: Minimum number of cells expressing the gene.
max_counts::Union{Int, Nothing}: Maximum total counts per gene.
max_cells::Union{Int, Nothing}: Maximum number of cells expressing the gene.

Returns

gene_subset::BitVector: A mask indicating genes that pass the filter.
number_per_gene::Vector: A vector of counts or expressed cell numbers per gene.

source

Juscan.Pp.filter_genes! — Function

filter_genes!(
  data::Muon.AnnData;
  min_counts=nothing, min_cells=nothing,
  max_counts=nothing, max_cells=nothing
) -> Nothing

Filters genes in-place in a Muon.AnnData object.

Arguments

data::Muon.AnnData: The input single-cell data.
min_counts::Union{Int, Nothing}: Minimum total counts per gene.
min_cells::Union{Int, Nothing}: Minimum number of cells expressing the gene.
max_counts::Union{Int, Nothing}: Maximum total counts per gene.
max_cells::Union{Int, Nothing}: Maximum number of cells expressing the gene.

Returns

Nothing. The filtering is applied in-place.

source

normalization

Juscan.Pp.normalize_total — Function

normalize_total(
    adata::AnnData;
    target_sum::Union{Real, Nothing}=nothing,
    exclude_highly_expressed::Bool=false,
    max_fraction::Float64=0.05,
    key_added::Union{String, Nothing}=nothing,
    layer::Union{String, Nothing}=nothing,
    layers::Union{String, Vector{String}, Nothing}=nothing,
    layer_norm::Union{String, Nothing}=nothing,
    copy::Bool=false
) -> Union{AnnData, Dict{String, Any}}

Normalize total counts per cell to a target sum.

Arguments

adata::AnnData: The single-cell dataset to be normalized.
target_sum::Union{Real, Nothing}=nothing: Target total count per cell. If nothing, normalization is performed to the median of counts per cell.
exclude_highly_expressed::Bool=false: If true, highly expressed genes are excluded from the normalization factor calculation.
max_fraction::Float64=0.05: A gene is considered highly expressed if it accounts for more than max_fraction of the total counts in any cell.
key_added::Union{String, Nothing}=nothing: If provided, the computed normalization factors are stored in adata.obs[key_added].
layer::Union{String, Nothing}=nothing: The layer to normalize. If nothing, the main count matrix (adata.X) is used.
layers::Union{String, Vector{String}, Nothing}=nothing: Specifies multiple layers to normalize. If "all", all layers in adata.layers are normalized.
layer_norm::Union{String, Nothing}=nothing: Defines the normalization reference for additional layers. Can be "after", "X", or nothing.
copy::Bool=false: If true, returns a copy of adata with normalized counts; otherwise, modifies adata in place.

Returns

If copy=true, returns a new AnnData object with normalized counts.
If copy=false, modifies adata in place and returns a dictionary containing:
- "X": The normalized count matrix.
- "norm_factor": The computed normalization factors.
- Additional layers if layers is specified.

Notes

If exclude_highly_expressed=true, genes that exceed max_fraction in any cell are ignored when computing normalization factors.
Cells with zero counts will generate a warning.
The function supports layer-wise normalization if layers is specified.

Example

adata = AnnData(X)
normalize_total(adata; target_sum=10000)

source

Juscan.Pp.normalize_total! — Function

normalize_total!(
    adata::AnnData;
    target_sum::Union{Real, Nothing}=nothing,
    exclude_highly_expressed::Bool=false,
    max_fraction::Float64=0.05,
    key_added::Union{String, Nothing}=nothing,
    layer::Union{String, Nothing}=nothing,
    layers::Union{String, Vector{String}, Nothing}=nothing
) -> Nothing

Normalize total counts per cell in-place.

Arguments

adata::AnnData: The single-cell dataset to be normalized.
target_sum::Union{Real, Nothing}=nothing: Target total count per cell. If nothing, normalization is performed to the median of counts per cell.
exclude_highly_expressed::Bool=false: If true, highly expressed genes are excluded from the normalization factor calculation.
max_fraction::Float64=0.05: A gene is considered highly expressed if it accounts for more than max_fraction of the total counts in any cell.
key_added::Union{String, Nothing}=nothing: If provided, the computed normalization factors are stored in adata.obs[key_added].
layer::Union{String, Nothing}=nothing: The layer to normalize. If nothing, the main count matrix (adata.X) is used.
layers::Union{String, Vector{String}, Nothing}=nothing: Specifies multiple layers to normalize. If "all", all layers in adata.layers are normalized.

Returns

This function modifies adata in place and does not return a new object.

Notes

If exclude_highly_expressed=true, genes that exceed max_fraction in any cell are ignored when computing normalization factors.
Cells with zero counts will generate a warning.
The function supports layer-wise normalization if layers is specified.

Example

adata = AnnData(X)
normalize_total!(adata; target_sum=10000)

source