At some point or another many of us have had to take the leap into the murky waters of bioinformatics. Whether it’s as an undergraduate, masters student, PhD student, or even a seasoned ecologist or evolutionary biologist, there comes a time when dabbling in the analysis of sequence data turns into a substantial bioinformatics analysis or project. Although many people have some degree of programming or statistical analysis experience before, maybe you took a stats class at uni that used R, the variety of languages, programs, and the different levels of tool documentation can make jumping into a bioinformatics project a daunting task. Here are 10 top tips for getting started in bioinformatics that I hope you, or someone you’re mentoring might find useful:
1. Talk to people about the question
One of my favourite things about academia is the chance to talk to so many people, all with different backgrounds and perspectives, about themselves and the research they do. In general, I think people in academia should talk more, whether it be in a coffee room, at lunch, at conferences and workshops, or over zoom. Making the effort to talk to as many people as possible has been particularly valuable for me during the planning phases of different projects when the experience of other people is invaluable. I personally don't think this should be limited to people in your group, and sometimes getting the perspective of someone completely disconnected to your field can be the most eye opening. You’ll be amazed at how many suggestions you’ll get for new analysis approaches, papers to read, or even things not to do (this can sometimes be the most helpful advice).
2. Ask how others did it
On top of chatting with anyone who will listen, going in search of people who already did similar analyses is a great way of getting more targeted analysis-specific advice. At some point once you’ve committed to an approach you maybe don’t want to hear "Oh I wouldn't have used long-read sequencing for that question" but you do want to know how to move forward. Looking up papers that did similar things to you and seeing if you can connect with the people behind them is the way to go. From my experience conferences are the place to do this most effectively but generally people are flattered if you reach out to them saying you loved their paper and would appreciate an opportunity to chat about their work. These conversations might get technical but you’ll be amazed how much of an insight you can get from someone who already ran the analyses/pipelines you want to run, especially if you can give them specifics about your data, such as the number of samples you have, how the populations are structured, or even any weird quirks of your system. Generally I’ve found people to be very helpful in this scenario and it can save you a lot of headaches down the line, including "oh it turns out you can't use this tool if you have less than 10 samples" or "ah, looks like you can't use the output of that program as input for this one".
3. Don’t reinvent the wheel
Although I still see it happening all the time I think it’s important to remember you don't win any prizes for re-writing a program that already exists. My general rule is that if a tool has been published and the approach does what you want then USE IT! Not only is this a nice way to avoid the pre-paper submission panic that you made a mistake in some code three years ago but in many cases it makes your research more reproducible (don't forget to cite the tool and version numbers though!) and makes it even easier to pass on this knowledge to others. It’s a lot easier to send someone a link to a GitHub page and maybe some example commands than having to panic-write documentation for a badly written script you whipped up in a hurry, initiating an endless back and forth of tweaks and bug fixes.
4. Read the manual
Once you have found the approach you will take and have installed the program(s) it may seem tempting to start throwing your data at it. But before you do I urge you to stop and take some time to read through the manual. I’m not saying you need to know every parameter you can tweak but rather to read through the underlying approach and the logic behind what the program does. It’s a lot better to find out the underlying assumptions and quirks of the tool and decide if these are appropriate for your analyses BEFORE showing your PI the graph they have always dreamed of - believe me, they will never unsee what you showed them! If there is a half-decent manual (and if there isn’t then you can move on to tip #3 - thoughts and prayers), you can also look for example commands. These are super useful and as well as helping you check that the program is installed correctly (don’t underestimate the number of times this is the first stumbling block) you can often use these with any test datasets or even your own data to give you an idea of the workflow and input/output formats (one of the most reliably frustrating parts of any bioinformatics project).
5. Check your code
Copy and paste is one of the tenets of programming whether people like it or not, but before you wildly go Ctrl+c/Ctrl+v-ing your way through your project make sure to check that the code examples you find on GitHub, BioStars, or StackOverflow actually do what you need/want. Sometimes example code can be very use-case specific and shoehorning them into sequence analysis can lead to some weird artefacts. For example, it's important to be careful when altering things like vcfs/sam/bam files all of which need specific formats and properties to be parsed correctly by other programs. You don't want to find out down the road that the script you used to filter some lines or change some sample names effectively corrupted your file.
6. One step at a time
If it’s your first time running through a pipeline then try to break it down into constituent parts. This way you can run through the most basic pipeline you can think of one step at a time before going back and then adding more realistic filtering/parameters. This approach of taking the pipeline/analysis one step at a time will let you work out any kinks or weird steps before you dive right in with a huge complex dataset. As with reading the manual this will also let you look for/write any necessary scripts for file conversion or filtering for example ahead of time, meaning that when you reach that step with your ‘real’ full dataset you can breeze on by.
7. Document your code
FOR THE LOVE OF BASH make sure you document your code well. You don't need to be a coding wizard or have a GitHub wiki with links to each part of the analysis (although this is great and will also help in the long run) but what is necessary is a well-maintained document of what you did. Not only is this a great learning tool for yourself since you’ll probably need to run similar things again and this way you can refer back but it makes interactions with other researchers, and even your future self much easier. There are few things more frustrating late into a project when you can't remember if you already tried running something on a specific sample or with a specific set of parameters. By maintaining good, commented, code this is as simple as Ctrl+f and lets you re-run important steps in a few minutes rather than having to code from scratch. It is also super useful for the inevitable moment in two years time when your boss asks you to add those 3 samples to your analysis and you’ll thank yourself 100 times over that you don't have to re-figure out where those files are and how you ran that analysis. A bonus tip here is to assume with every analysis that this will not be the last time you run it - I'm yet to work on a project where this isn't the case and thinking this way helps with the motivation to write well documented code.
8. Google it
Before firing off that email to your supervisor saying ‘I can't get the program to work because this weird “Error:...” message has popped up’ stop and google that exact error. You’ll probably find yourself on the software page, BioStars, StackOverflow or a GitHub issues page but in the best case scenario, someone already had that error and has a solution. In the worst case you can give the person who will help you solve the error way more information which will make them much happier to help. It is also good to check a few basic things when faced with an error in your bioinformatics project to make it easiest for the person helping you 1) are you in the right directory? 2) is the program installed properly/working - often you can check this buy trying to get the ‘help’ information to post e.g. with something like ‘program_name -h’ 3) is the error from the program/script you’re running or the computer/cluster? 4) is there anything in your input file - does it have a size > 0 bytes and have you looked inside it? Luckily in this day and age it's very unlikely on your first few forays into bioinformatics you’ll come up against a never-before-seen error so make use of your best friend google before asking someone else to google it for you!
9. Pick a couple of useful tools and learn them well
There are a load of common, but fussy, bioinformatics tasks like converting spaces to tabs, filtering out lines of a file, extracting columns or calculating basic pop-gen statistics that you’ll find yourself doing time and time again. Often there are 100s of ways to do these and it can be easy to get overwhelmed, especially if everyone you meet has a different approach. I think it’s a good idea to pick one or two ways of doing things and to learn how to use these languages/tools well - persevere with them until you know them in and out. Whether it’s AWK or sed one-liners or whole python scripts my advice would be to explore the options of whatever you're using since they are often very powerful (personally I think AWK is underrated) and will likely help you next time you get stuck. The better you understand a small set of tools, and how they work, the easier these small steps in your pipeline will be.
10. Keep the goal in mind and enjoy the pay-off
If you’re the kind of person who finds genuine excitement, enjoyment, and fulfilment from optimising your code to the extreme or automating that sleek pipeline then good for you, but this isn’t a reality for many people who do bioinformatics as part of their job. I think it's more common than people admit for people to struggle with bioinformatics, finding it unfulfilling and a frustrating time sink. If you’re one of these people then I think it's totally fine to see bioinformatics approaches as tools (and we don’t all need to marvel at the engineering of a pipette to use one effectively and make use of them). One thing that I have found useful when I am in particularly frustrating bits of an analysis pipeline is to focus on the end result - showing someone cool plots from my real dataset or thinking of the story of the paper that will result from the analysis. This way you’re not focussing on solving error after error but thinking of it as a means to an end. That also means celebrating and enjoying the pay off, when you finally get that program to run or being able to plot your results at the end of some analysis. Take time to appreciate that you were able to overcome a barrier and problem solve - this is always an achievement!
Well that wraps it up, if you are just getting started with bash scripting I wrote a bash refresher workshop which you can find HERE, and otherwise good luck!
Comments
Post a Comment