1.) Starting from the Cleveland dataset (the raw fastq files), preliminary QC was done with fastqc on the raw fastq files. link here http://intron.ucsc.edu/jen/jan2019/Cleveland/ that also has the raw fastq if we have to deposit them. 2.) bbmap clumpify was used to remove duplicates and clump the reads that script is here (essentially default parameters) http://intron.ucsc.edu/jen/april_23/clump bbmap version 37.90, that comes with clumpfy so i think its the same 3.) Cutadapt v1.11 was used to remove adapters and filter reads based on presence of adapter, error rate, minimum length, minimum overlap between read and adapter. the script for that is here http://intron.ucsc.edu/jen/april_23/cutadapt_smSMIT2.pl with parameters -O 8 -n 2 -m 23 -e 0.11 --discard-untrimmed 4.) Alignment was done using hisat2-2.1.0 with parameters --no-mixed --max-intronlen 10000 and aligned to sacCer3 Alignment statistics are here http://intron.ucsc.edu/jen/april_23/alignment_stats the script for that is here http://intron.ucsc.edu/jen/april_23/raphisat2 the aligned reads are here http://intron.ucsc.edu/jen/april_23/alignedreads/ 5.) The intron annotations were untouched from Tara and are contained here http://intron.ucsc.edu/jen/april_23/SMIT_accessory_files.zip Those annotation files anotate gene,intron and terminal exon length. The R script to generate the splicing files is here http://intron.ucsc.edu/jen/april_23/2SMIT_processing_local.R as described in Tara's paper. This produced the files with the splicing values as well as these plots http://intron.ucsc.edu/jen/jan2019/processed_data100/smit_curves/panels/ As well as the SMIT_accessory_files, the gene list of 62 relevent genes that were in the study is input. that is here http://intron.ucsc.edu/jen/april_23/genelist62 The input is the sacCer3 gene annotations , mapped read bed files , primed genes and intron less genes as controls. Only reads involved with the primed genes and intron less genes as controls are used. Spliced reads and unspliced reads are counted by position relative to the 3'ss. The raw splice value is (spliced count)/(unspliced count + spliced count). The distribution and probability of insert length is determined from the data. The raw splice value is then normalized using the position, insert length and lengths of the gene products (spliced and unspliced) as well as the insert length probabiltiy function. The script is in this one SMIT_accessory_files/smitData.R 6.) The splicing numbers were binned by 20 nt, plotted and used to calculate a wilcoxon paired p-value to determine if the difference was significant between conditions. This was also used to create a score using the difference between conditions for each gene for the region from start of signal to 200 nt after wilcoxon two sample test is also called Mann-Whitney that is the last step with the recent plots http://intron.ucsc.edu/jen/april_23/finorm2200/ this being the script http://intron.ucsc.edu/jen/april_23/finorm2200/normedaauc200.pl