A critical step when analyzing RNA-seq data is normalization. Normalization removes systematic biases that arise from variability in experimental conditions, sample collection and preparation, and machine parameters, etc., while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription at the same time.
RNA-seq Normalization implements seven normalization methods, among which five are existing methods (DESeq, TMM, RPM, PoissonSeq and UQ) for fresh frozen (FF) RNA-seq data. Two novel normalization methods, MIXnorm and SMIXnorm, can be applied to both FF and formalin-fixed paraffin-embedded (FFPE) RNA-seq data.
All methods take raw count matrix as input and reture the normalized expression matrix as a downloadable csv file. Offline versions of MIXnorm and SMIXnom are available from the Download page. The probability of being expressed for each gene and the proportion of expressed genes are also returned from the offline version.
MIXnorm is based on a two-component mixture model, which intends to capture major characteristics of the FFPE RNA-seq data accurately. MIXnorm takes the raw count data as input and output the normalized expression matrix in log scale.
SMIXnorm is a simplified version of MIXnorm. We recommend using SMIXnorm for FFPE RNA-seq data normalization for faster computation when the number of samples is larger than 25.
DESeq normalizes the raw data by using (estimated) scaling factors and is implemented in R package DESeq. The scaling factors are calculated by the median ratio of gene counts to its geometric mean across samples.
TMM is implemented in edgeR package. The scaling factor is estimated by the weighted mean of M values after trimmed away the extreme M values.
PoissonSeq is implemented from PoissonSeq package. PoissonSeq models raw RNA-seq count data by a Poisson log-linear model. The normalization is performed implicitly.
The RPM is perhaps the most straight forward normalization method. RPM estimates the scaling factor by the total number of reads for each sample.
The UQ normalization estimates the scaling factor by the upper quartile of read counts across all genes for each sample.