In the pipeline – Part 1: ‘Plan, plan, and plan some more’

So you’ve decided it’s time to finally get around to starting that sequencing project. But before you aimlessly leap into it and generate terabytes of sequencing data, just STOP.

It’s far too tempting to rush into sequencing projects for a number of reasons. Maybe you need to get it done quickly to spend some left-over grant money or use up some reagents. Isn’t everyone doing genomics these days? How hard can it be? But trust me on this one – before breaking out the Qiagen columns (if you’re loaded), lung melting organic solvents (if you’re not), or partaking in the world’s most annoying game of -80 hide-and-seek with your new least favourite sample, take a step back.

This article will be the first in a series that will take you from planning your genomics project all the way through to analysing your sequence data and plotting some nice figures. One thing to bear in mind for this series is that as you might have already found out if you have spent any time on Biostars or stackoverflow, there are hundreds of ways to do even the most basic of genomics projects. What I hope to cover in this series is by no means the ‘best’ ways of doing things (not that this even exists), but rather one way of doing things, with the aim of pointing you in the right direction to finding something that works for your project.

What is the question?

I think it’s clear that if you were designing a study to how rapidly many different plants families across the world grow then you would need a different dataset than if you would want to determine the variation in the growth rate of a specific plant species in one location throughout the year. I mean, it’s important that you collect the right data to address your specific question, right? Well the idea is the same for genomics studies since we want to avoid the unnecessary, and to be honest, somewhat unscientific, collection of obsolete large and costly genomic datasets.

I know that lots of genomics work seems like post hoc investigative science, and in some ways that can’t be avoided, but starting out with an idea of the kind of question(s) you want to answer is really important. The type of question you want to ask is intrinsically linked to the structure of your study, such as the number samples from each species that you will need to sequence, and the way in which you sequence them. Many genomics studies can be placed along a spectrum from broad macroevolutionary studies aiming to investigate broad evolutionary patterns to more focussed investigations into the allele frequencies within a single population, family, or even individual. Here are some examples:

Macroevolution: Perhaps your question is about the evolution of a taxonomically broad and genetically divergent clade – you might want to understand: 1) how are species related to one another? 2) Has the speciation and diversification rate across the clade been consistent or heterogenous? To answer these questions, you need to think about how broadly you can sample the clade of interest. The number of samples per species might not be that important (since you’re trying to describe the overall evolutionary relationship between species and so one representative of each may suffice), but the taxonomic breadth of species in your dataset might be critical. Similarly, you might not need to use the whole genome of each species since comparing across a number of conserved regions might help address your question.

Some examples of macroevolution studies: A phylogeny of reef fishes (Westneat and Alfaro, 2005) or an investigation into the evolution of the Lake Malawi Cichlid radiation (Malinsky et al., 2018).

Microevolution: In contrast you might want to understand the population genetics of a specific species or population. In this scenario your study might aim to: 1) determine how different populations of the same species differ from one another, e.g. with respect to their genetic diversity or distinctiveness, 2) determine the degree of homogenization or gene flow between the populations, or 3) identify which regions of the genome are associated with specific traits? To understand these finer scale patterns, you should forgo taxonomic breadth of sampling in exchange for a higher number of samples within your focal species/population(s). After all, it is often harder to detect subtle changes between closely related species than it is to identify substantial genetic variation between distantly related taxa as in the example above.

Some examples of microevolution studies: Convergent evolution in adaptation to extreme environments in Poeciliid fishes (Greenway et al., 2020) or recombination rate variation and how it impacts introgression between butterflies (Martin et al., 2019).

Resources: Other projects aim to produce genetic resources to complement other investigations or to allow comparative studies. In this case you might want to use sequence data to 1) produce a linkage map (which allows us to understand the physical structure of a genome by estimating the rate of recombination between different genetic markers. This approach often requires a large lab cross to be carried out so that we can tell if markers recombine apart more frequently, and are located far apart, or are rarely recombined apart, and are located close to one another), 2) assemble the genome for a specific focal species, or to 3) do one/both of these for multiple taxa and compare them to find structural or functional differences. These types of projects are often very technical, and the analyses used to do them often develop quickly – so getting input from those with prior experience before you undertake a project like this can save a lot of pain later on. Also, resources, and what is needed to produce them, vary substantially by the organism and so thinking in advance about what is known about your focal study system is critical – for example you should really have an estimate of how big the genome is, what it’s karyotype is (the number of chromosomes), and whether you can produce crosses or inbred lines (which can be used to reduce heterozygosity and make genome assembly much easier). Here you might be talking about very few individuals that are needed (unless you’re doing a linkage map where the more offspring from your cross the better), but the quality of the sample (i.e. how degraded it might be), or the ability to carry out labwork in a specific way )such as not having to mechanically shear tissue to extract DNA) might be very important.

Some examples of resource-based studies: Genome assemblies for Atlantic salmon (Lien et al., 2016), Alpine whitefish (De‐Kayne, Zoller and Feulner, 2020), and strawberry (Edger et al., 2019).

What do other people do to answer similar questions?

At this point it’s time to hit the literature and see what summary statistics or analyses are available to address your questions. These might change, in some cases substantially, between you sequencing and actually analysing your data (the development of bioinformatics tools in some areas move quicker than others) but it will also help you understand how your sampling should work or whether some questions are just no-gos altogether. For example – if you wanted to understand which genetic variants/SNPs are associated with behaviour in an extremely rare species using some kind of GWAS approach then finding and sampling the many (>100 and often close to 1000) individuals that are typically used for these kinds of studies might not be feasible (it’s important to think hard about the ethics of sampling rare populations for genetic work). Alternatively, if you planned to do population genetics on a species with a massive genome (pine trees/salamanders I’m looking at you!) using whole-genome re-sequencing data, then it’s better to find out early on that it might be unfeasible and/or prohibitively expensive due to the massive amounts of data you would need to produce. Instead you might want to focus on methods that use reduced representation sequencing or exome capture.

Time to check your budget and pick a platform

Most of us are constrained by our budget in some way. Whilst the cost of sequencing has been dropping, the rate of decrease seems to be unfortunately slowing (Wetterstrand – NIH). I think that Illumina is largely responsible for this since they have monopolised the sequencing market by acquiring PacBio, so it’s important to try to get the best value possible when it comes to collecting this undeniably expensive data. This doesn’t always mean going for the cheapest option, because as I’ll outline, there are drawbacks to some of these approaches, but it means not sequencing for the sake of it. In terms of sequencing cost the main three variables are 1) number of individuals to be sequenced, 2) the sequencing depth, and 3) the sequencing technology used.

Since you often pay for a specific ‘lane’ of sequencing (a physical well on a flow cell that gets filled with a mix of your prepared DNA and is put into the sequencing machine) the number of individuals per se does not dictate the price, but the more individuals you mix together and sequence in a single lane will decrease the coverage of sequencing you get per individual (since the output in terms of DNA bases is consistent for a single lane). The sequencing depth used for different genomics projects varies massively (in case all this talk of depth/coverage is new – if your study organism’s genome is 1 gigabase in size and on average you produce 10 gigabases of sequence per individual across your sequencing run then you have sequenced at 10X depth/coverage). Whilst population genetics questions are tending towards using lower and lower coverage, even around 1X (Ros-Freixedes et al., 2020), at the other end of the spectrum, short read genome assemblies often use over 100X coverage for a single individual (which can be necessary for a species with a complex genome structure like the strawberry genome where a mind and wallet-blowing 455X coverage was used; Edger et al., 2019). It’s therefore important to check what kind of depth is needed for your specific question. Whilst hard and fast rules on coverage don’t really exist, a higher coverage is usually needed to make confident structural variant calls than confident SNP calls and the optimal depth even differs between genotyping tools. Additionally, higher coverage can offset the higher error rate that comes with specific sequencing technologies. This brings us on to the task of picking a sequencing platform. Again, reading the literature and see what other people did to address similar questions to yours is really useful, but before you do that here’s a short overview of the different options:

Summary of short and long read technologies - all information in the main text — A summary of the main sequencing platforms and their features

Short read: Illumina is the go-to short-read sequencing platform and is used regardless of whether you aim to sequence the whole genome, or only a proportion, using a reduced representation approach like restriction site associated DNA sequencing (RAD; where enzymes are used to cut DNA resulting in only a small fraction of the genome being sequenced). Illumina short reads range from 50-300bp long, are highly accurate (something in the order of 0.1% error) and options exist for both single end sequencing (where you sequence from only one end of a fragment) and paired-end sequencing (where sequencing is carried out from both ends of a fragment, leaving you with a ‘pair’ of reads with an unsequenced ‘insert’ between the two reads). A number of different Illumina platforms exist and vary in their output, cost, and, importantly, the actual technology used for sequencing. These platforms include MiniSeq, MiSeq, HiSeq, NextSeq, and NovaSeq (as Illumina develops new platforms others are somewhat discontinued but since sequencing centres lag behind I’m certain some are offering each of these services). Although each can be tailored to output different amounts of data (and therefore sequencing depth/coverage – remember that total base pairs sequenced/individual genome size = sequencing depth/coverage per individual) traditionally Mini/MiSeq runs are used to produce smaller data volumes and NovaSeq the largest. The MiniSeq platform can output up to 7.5 Gb of data whereas a full S4 NovaSeq run can output up to 3000 Gb (roughly 1 byte of data = 4 bases; Stephens et al., 2015). Picking the optimal Illumina platform is something of a balancing act but a discussion with your sequencing centre should give you a good idea since you can use the number of individuals you want to sequence and the sequencing depth/coverage you need to back calculate the total output necessary.

Although all of these platforms use dye-based sequencing (where different fluorescent molecules are attached to different nucleotides and as a complementary strand to your DNA fragment is synthesised each base emits a different light coloured signal as it is added) there are still some differences between them. Older platforms used a 4-colour system where each base had its own unique light signal. In contrast, newer Next and NovaSeq use a two-dye system, meaning that different bases are called for each of the two dyes (T or C), both dyes together (A), as well as the absence of a fluorescent signal (G). Although this generally holds up well and is destined to lower library prep complexity and cost, it can cause problems – especially if you plan to combine your new data with old data sequenced on a different platform. So when picking an Illumina platform it’s worth thinking about whether your data will stand alone (in which case it doesn’t really matter) or if you will definitely be combining it with older datasets (in which case you should be aware of the platforms involved and check out a more detailed look at the challenges posed by changing sequencing platforms in De-Kayne et al., 2020).

Long read: PacBio and Oxford Nanopore Technologies (ONT) are the two most common long read sequencing platforms. They are typically used when you are trying to investigate larger genomic features such as structural variants (which can span kilobases and therefore cannot be spanned by short reads) or for producing assemblies of whole genomes or regions of interest. Although the error rate of these technologies are traditionally higher than Illumina (around 15% for PacBio and ONT – although new PacBio HiFi and new ONT library prep/downstream analysis tools claiming to have reduced this error considerably), they are often used at higher depths or to address questions where base-level accuracy isn’t the key. Knowing whether an inversion is present or not or spanning a repetitive region between two fragments (contigs) in a genome assembly might not require high base-call accuracy, unlike when you are explicitly trying to identify nucleotide polymorphisms e.g. SNPs. PacBio runs are carried out using their ‘SMRT’ cells and although there are different library preps required for the different chemistries they offer, this, at least from my experience, is almost always done by the people operating the machine so there is a bit less to think about. The only thing you should discuss is how many SMRT cells you will need to use to get the output coverage per individual that you require (it’s also worth checking the actual amount of DNA you need to provide them with since these approaches sometimes need a lot).

For ONT this is similar and although a single cell can output around 30Gb of data you can use as many cells as needed to get the coverage. Interestingly although PacBio tends to be at the costly end of platforms, ONT can be very cheap to run and many of the people I know who use it own the sequencing device themselves so require no sequencing centre support at all. Reads for these platforms vary widely but can be up to 100s of kilobases and so higher quality, non-fragmented input DNA is required. Highly sheared DNA from challenging DNA extractions or sample degradation will massively impact the quality of input DNA (perhaps here is a good time to introduce the ‘junk in – junk out’ mantra for genomics projects which will be popping up throughout this series).

Don’t forget about experimental design!

It’s time to actually pick individuals for sequencing, and to divide them up (if you need to) across lanes/batches. For the individuals you select to sequence you might want to consider: 1) do they have corresponding metadata (this could include sampling location/time or corresponding phenotypic measurements – this data is sometimes not collected or lost, especially in big groups, but can be a massive asset to have later down the road to troubleshoot weird patterns in your data or to support other analyses), and 2) are the samples of sufficient quality/condition? Samples that have all the relevant metadata and have been either preserved in a way that allows you to get good quality DNA without degradation, or can be sampled fresh, will be the least painful to deal with (a bonus if they are large enough that you can do some trial extractions to assess DNA concentration – with a Qbit, DNA quality – I still think a 1% agarose gel is the easiest way of checking this, and protein contamination – with a Nanodrop). Again, these are personal quality checks, but trust me, if you have the option to replace a dodgy sample before sequencing then take it and save your future self the headache.

With regards to deciding who gets grouped together for sequencing it’s best to think about ‘real’ experiments – try not to confound your ‘treatments’ with any other variables. This is because sequencing runs can sometimes leave batch effects, for example a bias of certain bases, a biased quality score, or even shared gaps in sequences (I have been told that sequencing machine errors and/or even physical knocks to the machine can result in a missing, ‘N’, base call at the same spot in all reads). If you’re sequencing both older and newer samples (e.g. from 2000 and from 2020) to find genetic differences between these populations then don’t separate them and sequence samples from each year on different lanes since this would confound any sequencing batch effects (from the sequencing run) with the ‘treatment’ which here is the sampling year. If you’re sequencing two populations across two lanes of sequencing or two sequencing batches then split them randomly across the two batches rather than by population. This way if there is any substantial library effect you have a chance of finding the reads/loci causing the problems and removing them since they should be shared by individuals of each population. Remember to tell your sequencing centre about the specific samples you want to sequence together otherwise you might end up with them grouped alphabetically/numerically.

So, now’s the time to collect your samples, do your extractions, and check their quality. Then you’re ready for sequencing library preparation, or, if you’re lucky, to just send them to the sequencing centre – good luck!

For some additional reading on how each of the different sequencing technologies work then be sure to check out the Illumina, PacBio, and Oxford Nanopore Technologies company websites.

In the pipeline – Part 2: ‘So you got your sequence data back…now what’ will be coming out soon with tips on how to process and assess the quality of your newly sequenced DNA.

References:

De-Kayne, R. et al. (2020) ‘Sequencing platform shifts provide opportunities but pose challenges for combining genomic datasets’, Molecular ecology resources. doi: 10.1111/1755-0998.13309

De‐Kayne, R., Zoller, S. and Feulner, P. G. D. (2020) ‘A de novo chromosome‐level genome assembly of Coregonus sp. “ Balchen ”: One representative of the Swiss Alpine whitefish radiation’, Molecular ecology resources, 20(4), pp. 1093–1109

Edger, P. P. et al. (2019) ‘Origin and evolution of the octoploid strawberry genome’, Nature genetics, 51(3), pp. 541–547

Greenway, R. et al. (2020) ‘Convergent evolution of conserved mitochondrial pathways underlies repeated adaptation to extreme environments’, Proceedings of the National Academy of Sciences of the United States of America, 117(28), pp. 16424–16430

Lien, S. et al. (2016) ‘The Atlantic salmon genome provides insights into rediploidization’, Nature, 533(7602), pp. 200–205

Malinsky, M. et al. (2018) ‘Whole-genome sequences of Malawi cichlids reveal multiple radiations interconnected by gene flow’, Nature ecology & evolution, 2(12), pp. 1940–1955

Martin, S. H. et al. (2019) ‘Recombination rate variation shapes barriers to introgression across butterfly genomes’, PLoS biology, 17(2), p. e2006288

Ros-Freixedes, R. et al. (2020) ‘Accuracy of whole-genome sequence imputation using hybrid peeling in large pedigreed livestock populations’, Genetics, selection, evolution: GSE, 52(1), p. 17

Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., Iyer, R., Schatz, M.C., Sinha, S. and Robinson, G.E., 2015. Big data: astronomical or genomical? PLoS biology, 13(7), p.e1002195

Westneat, M. W. and Alfaro, M. E. (2005) ‘Phylogenetic relationships and evolutionary history of the reef fish family Labridae’, Molecular phylogenetics and evolution, 36(2), pp. 370–390

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcostsdata