Preprocessing Module
Preprocessing is a crucial step in single-cell and multi-omics data analysis. This module provides functions for quality control, filtering, and normalization to ensure that datasets are clean and ready for downstream analyses.
Quality control (QC) metrics help identify low-quality cells and genes, such as those with extremely high or low counts, high dropout rates, or disproportionate expression levels. By computing these metrics, users can apply appropriate thresholds to retain only reliable data.
Filtering enables users to remove unwanted cells or genes based on QC criteria, ensuring that low-quality features do not affect downstream analyses.
Normalization methods adjust for differences in sequencing depth and technical variation, allowing meaningful comparisons across cells and conditions.
These preprocessing functions ensure that data is well-structured and suitable for clustering, dimensionality reduction, and other analytical tasks.
Juscan.Pp.calculate_qc_metrics
Juscan.Pp.calculate_qc_metrics!
Juscan.Pp.describe_obs
Juscan.Pp.describe_obs!
Juscan.Pp.describe_var
Juscan.Pp.describe_var!
Juscan.Pp.filter_cells
Juscan.Pp.filter_cells!
Juscan.Pp.filter_genes
Juscan.Pp.filter_genes!
Juscan.Pp.normalize_total
Juscan.Pp.normalize_total!
quality control metrics
Juscan.Pp.calculate_qc_metrics
— Functioncalculate_qc_metrics!(adata; expr_type="counts", var_type="genes", qc_vars=String[],
percent_top=[50, 100, 200, 500], layer=nothing, use_raw=false,
use_log1p=true, parallel=nothing) -> Tuple{DataFrame, DataFrame}
Calculate quality control metrics and return.
Arguments
adata
:Muon.AnnData
Annotated data matrix containing single-cell or multi-omics data.expr_type
:String
(default:"counts"
) The type of expression data to use, such as raw counts or normalized values.var_type
:String
(default:"genes"
) Specifies the variable type, e.g.,"genes"
for gene expression.qc_vars
:Union{Vector{String}, String}
(default:String[]
) A list of quality control variables to consider. If a single string is provided, it will be converted to a vector.percent_top
:Union{Vector{Int}, Nothing}
(default:[50, 100, 200, 500]
) A list of top feature (e.g., gene) counts to compute the percentage of total counts for each cell.layer
:Union{String, Nothing}
(default:nothing
) Specifies the layer inadata
to use. Ifnothing
, the default matrix is used.use_raw
:Bool
(default:false
) Whether to use the raw counts matrix (adata.raw
) instead of the processed data.use_log1p
:Bool
(default:true
) Whether to applylog1p
transformation to the data before computing quality control metrics.parallel
:Union{Bool, Nothing}
(default:nothing
) (Deprecated) This argument is ignored but retained for backward compatibility.
Returns
- A tuple of two
DataFrames.DataFrame
:- The first
DataFrame
contains per-cell quality control metrics. - The second
DataFrame
contains per-gene quality control metrics.
- The first
Juscan.Pp.calculate_qc_metrics!
— Functioncalculate_qc_metrics!(adata; expr_type="counts", var_type="genes", qc_vars=String[],
percent_top=[50, 100, 200, 500], layer=nothing, use_raw=false,
use_log1p=true, parallel=nothing) -> Nothing
Calculate and store quality control (QC) metrics directly in the obs
and var
attributes of adata
.
Arguments
adata
:Muon.AnnData
Annotated data matrix containing single-cell or multi-omics data. The QC metrics will be stored inadata.obs
(per-cell) andadata.var
(per-gene).expr_type
:String
(default:"counts"
) The type of expression data to use, such as raw counts or normalized values.var_type
:String
(default:"genes"
) Specifies the variable type, e.g.,"genes"
for gene expression.qc_vars
:Union{Vector{String}, String}
(default:String[]
) A list of quality control variables to consider. If a single string is provided, it will be converted to a vector.percent_top
:Union{Vector{Int}, Nothing}
(default:[50, 100, 200, 500]
) A list of top feature (e.g., gene) counts to compute the percentage of total counts for each cell.layer
:Union{String, Nothing}
(default:nothing
) Specifies the layer inadata
to use. Ifnothing
, the default matrix is used.use_raw
:Bool
(default:false
) Whether to use the raw counts matrix (adata.raw
) instead of the processed data.use_log1p
:Bool
(default:true
) Whether to applylog1p
transformation to the data before computing quality control metrics.parallel
:Union{Bool, Nothing}
(default:nothing
) (Deprecated) This argument is ignored but retained for backward compatibility.
Returns
Nothing
: This function modifiesadata
in place by adding QC metrics toadata.obs
andadata.var
.
Notes
- Unlike
calculate_qc_metrics
, which returns QC metrics asDataFrame
objects, this function directly updatesadata
without returning any values.
sub functions
Juscan.Pp.describe_obs
— Functiondescribe_obs(adata; expr_type="counts", var_type="genes", qc_vars=String[],
percent_top=[50, 100, 200, 500], layer=nothing, use_raw=false,
use_log1p=true, X=nothing, parallel=nothing) -> DataFrames.DataFrame
Compute per-cell quality control (QC) metrics from the expression matrix and return them as a DataFrame
.
Arguments
adata
:Muon.AnnData
Annotated data matrix containing single-cell or multi-omics data.expr_type
:String
(default:"counts"
) The type of expression data to use, such as raw counts or normalized values.var_type
:String
(default:"genes"
) Specifies the variable type, e.g.,"genes"
for gene expression.qc_vars
:Union{Vector{String}, String}
(default:String[]
) A list of quality control variables to consider. If a single string is provided, it will be converted to a vector.percent_top
:Union{Vector{Int}, Nothing}
(default:[50, 100, 200, 500]
) A list of top feature (e.g., gene) counts to compute the percentage of total counts for each cell.layer
:Union{String, Nothing}
(default:nothing
) Specifies the layer inadata
to use. Ifnothing
, the default matrix is used.use_raw
:Bool
(default:false
) Whether to use the raw counts matrix (adata.raw
) instead of the processed data.use_log1p
:Bool
(default:true
) Whether to applylog1p
transformation to the data before computing quality control metrics.X
:Union{Nothing, SparseMatrixCSC}
(default:nothing
) The expression matrix to use. Ifnothing
, the matrix is determined by_choose_mtx_rep
.parallel
:Union{Bool, Nothing}
(default:nothing
) (Deprecated) This argument is ignored but retained for backward compatibility.
Returns
DataFrames.DataFrame
: ADataFrame
containing per-cell QC metrics, including:"n_$(var_type)_by_$(expr_type)"
: Number of detected features per cell."log1p_n_$(var_type)_by_$(expr_type)"
: Log-transformed number of features per cell."total_$(expr_type)"
: Total expression count per cell."log1p_total_$(expr_type)"
: Log-transformed total expression count per cell."pct_$(expr_type)_in_top_N_$(var_type)"
: Percentage of expression from the topN
features."total_$(expr_type)_$(qc_var)"
: Total expression for each QC variable per cell."log1p_total_$(expr_type)_$(qc_var)"
: Log-transformed total expression for each QC variable per cell."pct_$(expr_type)_$(qc_var)"
: Percentage of total expression contributed by each QC variable.
Notes
- This function computes per-cell QC metrics based on expression data. For per-gene metrics, see
describe_var
.
Juscan.Pp.describe_obs!
— Functiondescribe_obs!(adata; expr_type="counts", var_type="genes", qc_vars=String[],
percent_top=[50, 100, 200, 500], layer=nothing, use_raw=false,
use_log1p=true, X=nothing, parallel=nothing) -> Nothing
Compute per-cell quality control (QC) metrics and store them in adata.obs
.
Arguments
adata
:Muon.AnnData
Annotated data matrix containing single-cell or multi-omics data.expr_type
:String
(default:"counts"
) The type of expression data to use, such as raw counts or normalized values.var_type
:String
(default:"genes"
) Specifies the variable type, e.g.,"genes"
for gene expression.qc_vars
:Union{Vector{String}, String}
(default:String[]
) A list of quality control variables to consider. If a single string is provided, it will be converted to a vector.percent_top
:Union{Vector{Int}, Nothing}
(default:[50, 100, 200, 500]
) A list of top feature (e.g., gene) counts to compute the percentage of total counts for each cell.layer
:Union{String, Nothing}
(default:nothing
) Specifies the layer inadata
to use. Ifnothing
, the default matrix is used.use_raw
:Bool
(default:false
) Whether to use the raw counts matrix (adata.raw
) instead of the processed data.use_log1p
:Bool
(default:true
) Whether to applylog1p
transformation to the data before computing quality control metrics.X
:Union{Nothing, SparseMatrixCSC}
(default:nothing
) The expression matrix to use. Ifnothing
, the matrix is determined by_choose_mtx_rep
.parallel
:Union{Bool, Nothing}
(default:nothing
) (Deprecated) This argument is ignored but retained for backward compatibility.
Returns
Nothing
: This function modifiesadata
in place by adding per-cell QC metrics toadata.obs
.
Notes
- This function is similar to
describe_obs
but modifiesadata
directly. It computes QC metrics usingdescribe_obs
and stores the results inadata.obs
.- For per-gene QC metrics, see
describe_var!
.
- For per-gene QC metrics, see
Juscan.Pp.describe_var
— Functiondescribe_var(adata; expr_type="counts", var_type="genes", layer=nothing,
use_raw=false, use_log1p=true, X=nothing) -> DataFrame
Compute per-gene quality control (QC) metrics and return as a DataFrame.
Arguments
adata
:Muon.AnnData
Annotated data matrix containing single-cell or multi-omics data.expr_type
:String
(default:"counts"
) The type of expression data to use, such as raw counts or normalized values.var_type
:String
(default:"genes"
) Specifies the variable type, e.g.,"genes"
for gene expression.layer
:Union{String, Nothing}
(default:nothing
) Specifies the layer inadata
to use. Ifnothing
, the default matrix is used.use_raw
:Bool
(default:false
) Whether to use the raw counts matrix (adata.raw
) instead of the processed data.use_log1p
:Bool
(default:true
) Whether to applylog1p
transformation to the data before computing quality control metrics.X
:Union{Nothing, SparseMatrixCSC}
(default:nothing
) The expression matrix to use. Ifnothing
, the matrix is determined by_choose_mtx_rep
.
Returns
DataFrames.DataFrame
: A DataFrame where each row represents a gene (or variable), containing the following QC metrics:"n_cells_by_<expr_type>"
: Number of cells in which each gene is detected."mean_<expr_type>"
: Mean expression level of each gene."log1p_mean_<expr_type>"
: Log-transformed mean expression level."pct_dropout_by_<expr_type>"
: Percentage of cells in which each gene is not detected."total_<expr_type>"
: Total expression level of each gene."log1p_total_<expr_type>"
: Log-transformed total expression level.
Notes
- This function calculates quality control metrics per gene (or other variables).
- To store these metrics directly in
adata.var
, usedescribe_var!
. - To compute per-cell QC metrics, see
describe_obs
.
- To store these metrics directly in
Juscan.Pp.describe_var!
— Functiondescribe_var!(adata; expr_type="counts", var_type="genes", layer=nothing,
use_raw=false, use_log1p=true, X=nothing) -> Nothing
Compute per-gene quality control (QC) metrics and store them in adata.var
.
Arguments
adata
:Muon.AnnData
Annotated data matrix containing single-cell or multi-omics data.expr_type
:String
(default:"counts"
) The type of expression data to use, such as raw counts or normalized values.var_type
:String
(default:"genes"
) Specifies the variable type, e.g.,"genes"
for gene expression.layer
:Union{String, Nothing}
(default:nothing
) Specifies the layer inadata
to use. Ifnothing
, the default matrix is used.use_raw
:Bool
(default:false
) Whether to use the raw counts matrix (adata.raw
) instead of the processed data.use_log1p
:Bool
(default:true
) Whether to applylog1p
transformation to the data before computing quality control metrics.X
:Union{Nothing, SparseMatrixCSC}
(default:nothing
) The expression matrix to use. Ifnothing
, the matrix is determined by_choose_mtx_rep
.
Returns
Nothing
: This function modifiesadata
in place by adding per-gene QC metrics toadata.var
.
Notes
- This function is similar to
describe_var
but modifiesadata
directly. It computes QC metrics usingdescribe_var
and stores the results inadata.var
.- For per-cell QC metrics, see
describe_obs!
.
- For per-cell QC metrics, see
filter cells and genes
filter cells
Juscan.Pp.filter_cells
— Functionfilter_cells(
data::Muon.AnnData;
min_counts=nothing, min_genes=nothing,
max_counts=nothing, max_genes=nothing,
copy=false) -> Union{Tuple{BitVector, Vector}, Muon.AnnData}
Filters cells based on the given threshold criteria and returns either a subset mask and count vector or a new filtered Muon.AnnData
object.
Arguments
data::Muon.AnnData
: The input single-cell data.min_counts::Union{Int, Nothing}
: Minimum total counts per cell.min_genes::Union{Int, Nothing}
: Minimum number of genes expressed per cell.max_counts::Union{Int, Nothing}
: Maximum total counts per cell.max_genes::Union{Int, Nothing}
: Maximum number of genes expressed per cell.copy::Bool
: Iftrue
, returns a filtered copy; otherwise, returns a mask and count vector.
Returns
- If
copy == false
, returns a tuple(cell_subset::BitVector, number_per_cell::Vector)
. - If
copy == true
, returns a filteredMuon.AnnData
object.
filter_cells(
data::AbstractMatrix;
min_counts=nothing, min_genes=nothing,
max_counts=nothing, max_genes=nothing
) -> Tuple{BitVector, Vector}
Filters cells from a count matrix based on the given threshold criteria.
Arguments
data::AbstractMatrix
: Gene expression count matrix with cells as rows.min_counts::Union{Int, Nothing}
: Minimum total counts per cell.min_genes::Union{Int, Nothing}
: Minimum number of genes expressed per cell.max_counts::Union{Int, Nothing}
: Maximum total counts per cell.max_genes::Union{Int, Nothing}
: Maximum number of genes expressed per cell.
Returns
cell_subset::BitVector
: A mask indicating cells that pass the filter.number_per_cell::Vector
: A vector of counts or expressed gene numbers per cell.
Juscan.Pp.filter_cells!
— Functionfilter_cells!(
data::Muon.AnnData;
min_counts=nothing, min_genes=nothing,
max_counts=nothing, max_genes=nothing
) -> Nothing
Filters cells in-place in a Muon.AnnData
object.
Arguments
data::Muon.AnnData
: The input single-cell data.min_counts::Union{Int, Nothing}
: Minimum total counts per cell.min_genes::Union{Int, Nothing}
: Minimum number of genes expressed per cell.max_counts::Union{Int, Nothing}
: Maximum total counts per cell.max_genes::Union{Int, Nothing}
: Maximum number of genes expressed per cell.
Returns
Nothing
. The filtering is applied in-place.
filter genes
Juscan.Pp.filter_genes
— Functionfilter_genes(
data::Muon.AnnData;
min_counts=nothing, min_cells=nothing,
max_counts=nothing, max_cells=nothing,
copy=false
) -> Union{Tuple{BitVector, Vector}, Muon.AnnData}
Filters genes based on the given threshold criteria and returns either a subset mask and count vector or a new filtered Muon.AnnData
object.
Arguments
data::Muon.AnnData
: The input single-cell data.min_counts::Union{Int, Nothing}
: Minimum total counts per gene.min_cells::Union{Int, Nothing}
: Minimum number of cells expressing the gene.max_counts::Union{Int, Nothing}
: Maximum total counts per gene.max_cells::Union{Int, Nothing}
: Maximum number of cells expressing the gene.copy::Bool
: Iftrue
, returns a filtered copy; otherwise, returns a mask and count vector.
Returns
- If
copy == false
, returns a tuple(gene_subset::BitVector, number_per_gene::Vector)
. - If
copy == true
, returns a filteredMuon.AnnData
object.
filter_genes(
data::AbstractMatrix;
min_counts=nothing, min_cells=nothing,
max_counts=nothing, max_cells=nothing
) -> Tuple{BitVector, Vector}
Filters genes from a count matrix based on the given threshold criteria.
Arguments
data::AbstractMatrix
: Gene expression count matrix with genes as columns.min_counts::Union{Int, Nothing}
: Minimum total counts per gene.min_cells::Union{Int, Nothing}
: Minimum number of cells expressing the gene.max_counts::Union{Int, Nothing}
: Maximum total counts per gene.max_cells::Union{Int, Nothing}
: Maximum number of cells expressing the gene.
Returns
gene_subset::BitVector
: A mask indicating genes that pass the filter.number_per_gene::Vector
: A vector of counts or expressed cell numbers per gene.
Juscan.Pp.filter_genes!
— Functionfilter_genes!(
data::Muon.AnnData;
min_counts=nothing, min_cells=nothing,
max_counts=nothing, max_cells=nothing
) -> Nothing
Filters genes in-place in a Muon.AnnData
object.
Arguments
data::Muon.AnnData
: The input single-cell data.min_counts::Union{Int, Nothing}
: Minimum total counts per gene.min_cells::Union{Int, Nothing}
: Minimum number of cells expressing the gene.max_counts::Union{Int, Nothing}
: Maximum total counts per gene.max_cells::Union{Int, Nothing}
: Maximum number of cells expressing the gene.
Returns
Nothing
. The filtering is applied in-place.
normalization
Juscan.Pp.normalize_total
— Functionnormalize_total(
adata::AnnData;
target_sum::Union{Real, Nothing}=nothing,
exclude_highly_expressed::Bool=false,
max_fraction::Float64=0.05,
key_added::Union{String, Nothing}=nothing,
layer::Union{String, Nothing}=nothing,
layers::Union{String, Vector{String}, Nothing}=nothing,
layer_norm::Union{String, Nothing}=nothing,
copy::Bool=false
) -> Union{AnnData, Dict{String, Any}}
Normalize total counts per cell to a target sum.
Arguments
adata::AnnData
: The single-cell dataset to be normalized.target_sum::Union{Real, Nothing}=nothing
: Target total count per cell. Ifnothing
, normalization is performed to the median of counts per cell.exclude_highly_expressed::Bool=false
: Iftrue
, highly expressed genes are excluded from the normalization factor calculation.max_fraction::Float64=0.05
: A gene is considered highly expressed if it accounts for more thanmax_fraction
of the total counts in any cell.key_added::Union{String, Nothing}=nothing
: If provided, the computed normalization factors are stored inadata.obs[key_added]
.layer::Union{String, Nothing}=nothing
: The layer to normalize. Ifnothing
, the main count matrix (adata.X
) is used.layers::Union{String, Vector{String}, Nothing}=nothing
: Specifies multiple layers to normalize. If"all"
, all layers inadata.layers
are normalized.layer_norm::Union{String, Nothing}=nothing
: Defines the normalization reference for additional layers. Can be"after"
,"X"
, ornothing
.copy::Bool=false
: Iftrue
, returns a copy ofadata
with normalized counts; otherwise, modifiesadata
in place.
Returns
- If
copy=true
, returns a newAnnData
object with normalized counts. - If
copy=false
, modifiesadata
in place and returns a dictionary containing:"X"
: The normalized count matrix."norm_factor"
: The computed normalization factors.- Additional layers if
layers
is specified.
Notes
- If
exclude_highly_expressed=true
, genes that exceedmax_fraction
in any cell are ignored when computing normalization factors. - Cells with zero counts will generate a warning.
- The function supports layer-wise normalization if
layers
is specified.
Example
adata = AnnData(X)
normalize_total(adata; target_sum=10000)
Juscan.Pp.normalize_total!
— Functionnormalize_total!(
adata::AnnData;
target_sum::Union{Real, Nothing}=nothing,
exclude_highly_expressed::Bool=false,
max_fraction::Float64=0.05,
key_added::Union{String, Nothing}=nothing,
layer::Union{String, Nothing}=nothing,
layers::Union{String, Vector{String}, Nothing}=nothing
) -> Nothing
Normalize total counts per cell in-place.
Arguments
adata::AnnData
: The single-cell dataset to be normalized.target_sum::Union{Real, Nothing}=nothing
: Target total count per cell. Ifnothing
, normalization is performed to the median of counts per cell.exclude_highly_expressed::Bool=false
: Iftrue
, highly expressed genes are excluded from the normalization factor calculation.max_fraction::Float64=0.05
: A gene is considered highly expressed if it accounts for more thanmax_fraction
of the total counts in any cell.key_added::Union{String, Nothing}=nothing
: If provided, the computed normalization factors are stored inadata.obs[key_added]
.layer::Union{String, Nothing}=nothing
: The layer to normalize. Ifnothing
, the main count matrix (adata.X
) is used.layers::Union{String, Vector{String}, Nothing}=nothing
: Specifies multiple layers to normalize. If"all"
, all layers inadata.layers
are normalized.
Returns
- This function modifies
adata
in place and does not return a new object.
Notes
- If
exclude_highly_expressed=true
, genes that exceedmax_fraction
in any cell are ignored when computing normalization factors. - Cells with zero counts will generate a warning.
- The function supports layer-wise normalization if
layers
is specified.
Example
adata = AnnData(X)
normalize_total!(adata; target_sum=10000)