SMART-total cfRNA-seq Data Analysis

1.1 Trim adaptor

cutadapt --pair-filter any  -q 30,30 \
            -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
            --trim-n -m 16  -o >(gzip -c > {output.fastq1}) -p >(gzip -c > {output.fastq2}) \
            {input.fastq1} {input.fastq2} > {log} 2>&1

1.2 Trim GC oligo introduced in template switching

For reverse stranded (first strand) libraries, set strandness to reverse, for forward stranded (second strand) libraries, set strandness to forward
Take a look at metadata.txt for strand information
Read pairs shorter than 30 nt are discarded

Input should be compressed by gzip, named as {sample_id}_{1,2}.fastq.gz

python bin/trimGC.py -s {strandness} -o {output_preffix} -i {input_preffix} > {log} 2>&1

2.1 Reads alignment

Sequentially align reads to spikeIn,UniVec,rRNA,hg38,and circRNA using the following command
- For spikeIn,UniVec,rRNA,and circRNA alignment, set seedPerWindowNmax to 20
- For hg38 alignment, set seedPerWindowNmax to 50
- Note unmapped reads of STAR 2.5.4a (also called STAR 2.5.3a_modified) may out of order in multi-thread alignment (https://github.com/alexdobin/STAR/issues/222), the latest version of STAR is recommended. Alternatively, unmapped fastq files should be repaired (using repair.sh in bbmap suite for example) before downstream analysis, or the genome alignment rate could be extremely low.
```
STAR --genomeDir {sequenceIndex} \
        --readFilesIn {input.reads1} {input.reads2} \
        --runThreadN 4 \
        --outFileNamePrefix {output_prefix} \
        --outSAMtype BAM Unsorted \
        --outReadsUnmapped Fastx \
        --readFilesCommand gzip -d -c \
        --outSAMmultNmax 1 \
        --seedPerWindowNmax {seedPerWindowNmax}
```

2.2 Remove duplication in circRNA.bam and genome.bam with picard tools

java -jar {picardDir}/picard.jar \
            MarkDuplicates REMOVE_DUPLICATES=true \
            ASSUME_SORT_ORDER=queryname \
            I={input.bam} \
            O={output.bam} \
            M={output.metrics} \
            READ_NAME_REGEX=null

2.3 Get intron spanning reads from genome aligned reads

If any mate in a read pair contains N in its CIGAR string, this fragment was considered as intron-spanning
```
bash bin/getIntron-spanning.sh {inbam} {outbam} > {log} 2>&1
```

3.1 Assign reads in genome bam file to certain genomic regions

The priority of the assignment is defined in config/priority.txt

bed files corresponding to regions in config/priority.txt should be present in {beddir}, named as {region}.bed, in bed6 format

# For forward stranded libraries, '-S' in bedtools coverage -counts -S -sorted -a - -b ${beddir}/${region}.bed should be replaced by '-s'
bash bin/sequential.assign.long.sh {bam} {outdir} {beddir}

3.2 Quantify gene and circRNA expression

# Quantify gene expression
# For forward stranded libraries,strandness=1, for reverse stranded libraries, strandness=2 
featureCounts -O -t exon -g gene_id -M -s {strandness} -p -a {gtf} -o {counts} {bam} > {log}
# Quantify circRNA expression
# Strandness: forward, reverse
bin/count_reads.py count_circrna -s {strandness} --paired-end -i {inbam} -o {output}

4.1 Calculate PSI scores

## Running rMATs (rMATS.4.0.2/rMATS-turbo-Linux-UCS4)
python2 {rmats_path} --b1 {pos_bam_paths} --b2 {neg_bam_paths} --gtf {gtfFile} --od {outdir} -t paired --readLength 150
## Summarize rMATs results for {splicing_type} (one of MXE A3SS A5SS SE RI)
python3 bin/summarize-splicing.py --input {splicing_type}.MATS.JC.txt  --outdir outdir --type {splicing_type} --method JC  --pos pos_ids.txt  --neg  neg_ids.txt

4.2 Calculate PDUI scores

## Sort bam file by coordinate
samtools sort  -o {bamUnsorted} {bamSorted}
samtools index  {bamSorted}

## Calculate coverage, strand in +,-
bedtools genomecov -strand {strand}  -split -ibam {bamSorted} \
            | LC_COLLATE=C sort -T {temp_dir} -k1,1 -k2,2n > {bedgraph}
bedGraphToBigWig {bedgraph} {chrom_sizes} {bigwig}

## Merge Coverage of + strand and - strand
bigWigMerge {bigwigPosStrand} {bigwigNegStrand} {bedgraphMerged}
sort -k1,1 -k2,2n {bedgraphMerged} > {bedgraphMergedSorted}
bedGraphToBigWig {bedgraphMergedSorted}  {chrom_sizes} {bigwigMerged}
bigWigToWig {bigwigMerged} {wigMerged}

## Run DaPar
{DaParsDir}/src/DaPars_main.py {config}

## Summarize results
python  bin/parseDapar.py -i {input} -c {config} -l {long} -s {short} -p {PDUI}

4.3 Calcualte RNA editing level by gene

## Identify recurrent RNA editing sites
## For each sample, run
{RNAEditorDir}/RNAEditor.py  -i  {fastq_1} {fastq_2} -c {config}

# Calculate detection recurrency for each editing sites:  bin/recurrent-editing.py

# Filter intergenic editing sites: bin/get-intragene-editing-sites.sh

## Calculate coverage for each sample at recurrently edited sites, include samples which no editing events were reported by RNAEditor
samtools mpileup -l  editing-sites-pos-list.txt -o ${output} --reference ${ref} ${bam}

# Parse pileup result (get reads number support editing/support not editing)
python bin/cal-coverage.py -i {pileup} -o {coverage}

# Summarize the coverage by genomic positions: bin/summarize-editing-coverage.py

# Calculate editing level for each gene and each editng site: bin/editing-level.py 

4.4 Classification of unmapped reads

## Classify unmapped reads with kraken2
kraken2 --db {kraken2db}  --unclassified-out {outprefix}  --report {report} --paired --use-names {unmapped_1} {unmapped_2}

## Summarize kraken2 result
## Genus level data was used for classification 
python bin/summarize-kraken.py -i {report} -l {taxoLevel} -o {output}

5.1 Filtering

See bin/filter.py

Gene expression data: retained genes with CPM > 2 for more than 50% of samples in at least one class

bin/filter.py -i {input} -o {output} --stratify metadata/sample_classes.txt --proportion 0.5 --stratify_key label --threshold 2 --pass_gene_ids  {detected_gene} --method by_value

PSI scores and PDUI scores: in each class, the values of more than 80% of samples should not be null

python bin/filter.py --input {input} -o {output} --stratify metadata/sample_classes.txt --stratify_key label --collapse intersection --pass_gene_ids passed-ids.txt --proportion 0.8 --method by_na

RNA editing level: genes with more than 50% of samples with fewer than 6 reads supporting the editing level calculation were filtered out.

bin/filter.py --input {coverage-by-gene} -o {coverage-by-gene-filtered} --stratify metadata/sample_classes.txt --stratify_key label --collapse intersection --pass_gene_ids {gene-passed-filter} --threshold 6 --proportion 0.5 --method by_value

5.2 Differential analysis:

For counts data ( expression counts and kraken2 counts ), see default method (edger-glmlrt) in bin/differential_expression.R
- –positive-ids/–negative-ids: path of file contain sample ids, one id per line
```
Rscript bin/differential_expression.R \
-i {input}   \
--method edger_glmlrt \
--positive-ids {pos-ids} \
--negative-ids {neg-ids} \
-o {output}
```

For analysis of ratio based data (psi, PDUI and editing level), see bin/ranksum.py

python bin/ranksum.py --input {matrix} --output {diff-table} --pos_ids {pos-ids} --neg_ids {neg-ids}