nadia atallah purdue university center for cancer research · clear methods description ... to be...
TRANSCRIPT
Nadia AtallahPurdue University Center for Cancer Research
� Consulting� Project Work & Data Analysis� Method Development� Study Design� Integrate data with public domain data� Training� Aid in Grant Writing� Aid in manuscript Preparation
Services provided
� Ideally a bioinformatician working 40 hours a week will have 2-3 projects to work on at once.� Difficult to manage workload
� When is a project really finished?� Often a project will be “dormant” for months and then the PI will contact you upon beginning
the publication� Take excellent notes, this will make revisiting old projects far easier� Back up data� Manage expectations- very important!
� Do not make empty promises� Do not blindly say something can be done without knowing� Do not pretend to know everything� Give realistic timelines
� Communication is incredibly important – very important to maintaining positive relationships
Project Management
� After I get a request:1. Initial consultation2. Upon receiving data, put project in queue3. Begin analysis. Update PI after each major step is completed.4. Email results with brief description5. Type formal report6. Meet PI to go over report, address questions7. Perform any additional analyses8. Aid in writing of the manuscript9. Data Deposition
Project Management
� Communicating across disciplines� Giving accurate timelines� Communicating results in a manner which is
accurate and understandable� Explaining limitations of the technology� Dealing with poor experimental design� Fast moving field - must keep learning
� Read� Classes� Book study groups� Conferences – meet as many people in the field as
possible� Try new software for fun
� Software is often difficult to use
Challenges of the Job
� Clear methods description� Specific types of analyses� Whether a project can be completed by a certain date (ex: for a grant
submission)� Specific types of graphs, figures� For all data, including intermediate files� Help for you/your student/your postdoc – depends on the
bioinformatician� Scripts – depends on the bioinformatician
What is reasonable to ask from a Bioinformatician
� To move things around on your Excel spreadsheets for you� To be your graphic designer� To use an improper statistical design in a current analysis because it
will be consistent with your old dataset� To get something done with one day notice…� To try to “do something to find significant results”� To calculate significance without replicates� To fix poor experimental design
What to not ask me for….
� Have some level of background knowledge of the technology you are using (terminology, etc)
� Be able to clearly describe your experimental goals� Know the biological side of your experimental design� Know your budget� Understand that some relatively simple ideas can take time to perform� Have examples/pictures of specific graphs, custom analyses
Working with your Bioinformatician
� Standard Analysis� Often involves the use of standardized pipelines� Examples: bulk RNA-Seq analysis, ChIP-seq analysis� Even when running a pipeline, I have to modify code to fit the project at hand
� Semi-standard Analysis� No high-quality standardized pipelines� Often have to try multiple software packages, edit existing software, write small scripts� Examples: lncRNA identification, single-cell RNA-Seq
� Novel Analysis� Much greater time commitment� Writing code, figuring out best algorithms to use, optimizing running time� Can take months (or longer)
Various levels of engagement
Check data quality (R, unix)
Break reads into multiple segments (custom perl script)
Map using shell script Try using BBMap
Write custom script using Needleman-Wunsch algorithm
Write a quasi-global alignment
package
Novel Analysis – 3 steps out of 7Standard Analysis – all steps
Various levels of engagement
Checkdataquality(FastQC)
Trim&filterreads,removeadapters(Trimmomatic)
Alignreadstoreferencegenome
(Tophat)
Countreadsaligningtoeachgene(HTSeq)
UnsupervisedClustering
Differentialexpression
analysis(edgeR)
GOenrichmentanalysis(DAVID)
Pathwayanalysis(DAVID/IPA)
� Provide a quality analysis, often involving the development of novel methods
� Clear goal in mind, sometimes with the understanding that the goals may change
� Requires significant time, commitment (personal engagement), and interest
� Authorship is generally talked about up-front
Collaboration� Provide a quality, standard analysis� Clear deliverables� Most projects fall under this category� Authorship is not guaranteed – up to
the PI and also dependent on the bioinformatician’s contribution
Service
Service vs Collaboration
Using Supercomputers is Necessary for a Timely Analysis� To map all sequence reads from
a cell to the human genome: ~ 4.4 min on Conte versus 2 h 28 min on a MacBook Pro for 1 cell.
� One project had ~550 cells to process. On Conte: ~1.7 days. On MacBook Pro: 56 days
� Data should be stored in multiple places� Get the raw data and analysis files! It is risky to not have a copy of
your data!!!� Storing your data takes space and therefore money
� $150 per TB/year � Cost of service is not just time. We pay for storage, nodes, software, and IT
support.� For how long should a bioinformatics core/sequencing center store
data?
Storing Data
What should I tell the sequencing center I want?
� Depth, number of lanes� Multiplexing� Single-end versus paired end� Which RNA species am I interested in sequencing?� Paired-end or single-end?� Strand-specific?� Length of reads� Poly A selection or ribodepletion
� Quality control: quality measurements before sequencing, sometimes quality information from after sequencing
� Differs depending on where you get your data sequenced – make sure you know what they are giving you!� Trimmed? Adapters removed? Are reads aligned?
� Ask what kit was used to prepare the libraries� What instrument was used, what kind of selection was used to exclude
unwanted data (example polyA selection for RNA-seq)� This information is necessary for publication!
� Commands/parameters/software used in processing the data� Always get the raw data!
What to expect from your sequencing Center
RNA extraction, purification, and quality assessment
• RIN= RNA integrity number• Generally, RIN scores >8 are good, depending on the organism• Important to use high RIN score samples, particularly when sequencing small RNAs to be sure
you aren’t simply selecting degraded RNAs
18S 28S
Data Cleaning: a Multistep Process
Remove adapters
•• Remove contamination from fastq files
Remove contamination
••Removes adapter sequences
Trim reads••Trim reads based on
quality
Separate reads
••Separate reads into paired and unpaired
Make sure know where you are in the pipeline and what you have been given by your sequencing center!
After TrimmingBefore Trimming
Quality Control – Per Base Sequence Quality
File formats - FASTQ files – what we get back from the sequencing center
� This is usually the format your data is in when sequencing is complete
� Text files� Contains both sequence and base quality information
� Phred score = Q = -10log10P� P is base-calling error probability
� Integer scores converted to ASCII characters� Example:
@ILLUMINA:188:C03MYACXX:4:1101:3001:1999 1:N:0:CGATGTTACTTGTTACAGGCAATACGAGCAGCTTCCAAAGCTTCACTAGAGACATTTTCTTTCTCCCAACTCACAAGATGAACACAAAATGGAAACT+1=DDFFFHHHHHJJDGHHHIJIJIIJJIJIIIGIIGJIIIJCHEIIJGIJJIJIIJIJIFGGGGGIJIFFBEFDC>@@BB?A9@3;@(553>@>C(59:?
File formats: FASTA files� Text file with sequences (amino acid or nucleotides)� First line per sequence begins with > and
information about sequence� Example:
>comp2_c0_seq1GCGAGATGATTCTCCGGTTGAATCAGATCCAGAGGCATGTATATATCGTCTGCAAAATGCTAGAAACCCTCATGTGTGTAATGCAGTGCATTCATGAAAACCTTGTAAGCTCACGTGTCGCTGACTGTCTGAGAACCGACTCGCTAATGTTCCATGGAGTGGCTGCATACATCACAGATTGTGATTCCAGGTTGCGAGACTATTTGCAGGATGCATGCGAGCTGATTGCCTATTCCTTCTACTTCTTAAATAAAGTAAGAGC
File formats: BAM and SAM files� SAM file is a tab-delimited text file that contains sequence alignment
information� This is what you get after aligning reads to the genome� BAM files are simply the binary version (compressed and indexed
version )of SAM files à they are smaller� Example:
Header lines (begin with “@”)
Alignment section
� Background information about your system� Timeline� What are your goals? Have a clear idea of your goals/hypothesis� Is there a specific genome version you want us to use? � Is there old data you want to compare the current data with?� What comparisons do you want to make?� Experimental design:
� How many replicates? How were replicates treated/grown?� Is there any potential for batch effects?
What do you need to tell your bioinformatician?
� Understanding significance of results (or lack of)
� Understanding the analysis� Communication� Understanding what
bioinformaticians/statisticians do� Knowing what is/is not possible� What experiments to do/technology
to use� Limitations of the technology used� Experimental Design
Common Issues I see Amongst Users
� To pool or not to pool samples…..� No replication� How many reads (lanes) to sequence� Paired-end versus single end� What is a biological vs technical replicate. Can you have a true
biological replicate from a cell line?� ChIP protocol – be careful to perfect and optimize protocol for your
system and conditions� Common issues: Not enough sample used, poor antibody specificity,
sonication time, improper controls
Common Experimental Design Problems
� Ask sequencing center how many reads/lane they get per run
� Reads needed depends on experimental objectives� Differential gene expression? Get enough
counts of each transcript such that accurate statistical inferences can be made
� De novo transcriptome assembly? Maximize coverage of rare transcripts and transcriptional isoforms
� Annotation?� Alternative splicing analysis?
How many reads/lanes should I sequence?
1) Liu Y., et al., RNA-seq differential expression studies: more sequence or more replication? Bioinformatics 30(3):301-304 (2014) 2) Liu Y., et al., Evaluating the impact of sequencing depth on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008) 4) Rozowsky, J.et al., PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nature Biotech. 27, 65-75 (2009).
Keep in mind caveats…..� If you have zero counts it does not necessarily mean that a gene is not
expressed at all� Especially in single-cell RNA-seq
� RNA and protein expression profiles do not always correlate well� Correlations vary wildly between RNA and protein expression� Depends on category of gene� Correlation coefficient distributions were found to be bimodal between
gene expression and protein data (one group of gene products had a mean correlation of 0.71; the another had a mean correlation of 0.28) � Shankavaram et. al, 2007
� Don’t forget to include enough info to ensure reproducibility
� Don’t use incorrect terminology� Don’t publish plots that are peripheral
or that no one else publishes� Examples: FastQC plots� Note: there is nothing wrong with
making these available, but putting them in the manuscript is not advised
Don’ts� Include version numbers for all software
used� Look at other RNA-seq paper in similar tier
journals prior to publishing� Deposit all data� Include adequate experimental design
information� Can look at ENCODE standards for many
common analyseshttps://www.encodeproject.org/data-standards/
� Read number, Sequencing Platform, read length, genome version and build, kits used on library construction
Dos
Publishing Large Datasets
� Necessary for publishing in many journals� Necessary for NIH-supported studies� Good scientific practice!� Most commonly used databases to submit to:
� Gene Expression Omnibus (GEO) – any large-scale gene expression dataset� Short Read Archives (SRA) – high throughput sequencing data� dbGaP – microarray data from clinical studies; requires controlled access
� These databases are very useful for submitting data to as well as for data mining
� You can submit your data, obtain an accession number, and still delay making the data publicly available until publication of your manuscript
Data Deposition
GEO dataset� The GEO Accession display of a project
generally gives multiple types of information:� Status (when data became public)� Title� Organism� Experimental Type� Summary
� Background� Methods� Results� Conclusions
� Overall design� Contributors� Citation� Downloads
Data Download from GEO
MINiML filesarethesameasSOFT,butinXMLformat
SeriesmatrixTXTfilesaretab-delimitedvalue-matrixfiles.CanbeimportedintoExcel
� https://www.ncbi.nlm.nih.gov/geo/info/submission.html� Fill out a metadata spreadsheet (format will be dependent on the type
of data you plan to submit), then submit raw and processed datafilesusing an FTP server
� I use Filezilla� Instructions online
To submit to GEO
Questions?