Contact Us

Use the form on the right to contact us.

You can edit the text in this area, and change where the contact form on the right submits to, by entering edit mode using the modes on the bottom right. 


Santa Cruz, CA

Vikram Baliga, PhD Candidate at UC Santa Cruz, member of the Mehta Lab. Areas of study: ecology, ontogeny, morphometrics, and comparative methods.

Using R to Download GenBank Sequences

Blog

Using R to Download GenBank Sequences

Vikram Baliga

In a recent project, I used gene sequences I found on GenBank to infer the phylogenetic relationships between a large assemblage of labrid fishes. In the nascent stages of this project, I remember being extremely frustrated that there was little documentation available online to show how sets of GenBank sequences could be downloaded into a single FASTA file. One case in which this might be useful is to combine sequences of a given gene for multiple species into the same file, and then use that file for alignment, modeling, and tree inference...etc.

In the interests of making life easier for anyone who wants to efficiently download multiple sequences from GenBank using R, I have made a short tutorial.

Files

readGenBankR.R - A script in R that contains all the necessary steps
COI_BaligaLaw2016.csv - A list of GenBank accession numbers for labrid COI sequences. This is what we used in Baliga & Law, 2016 (MPE)
CYTB_BaligaLaw2016.csv - As above, but for the CYTB gene

Tutorial

This tutorial requires that the user has already recorded the GenBank ID numbers for sequences in which he/she is interested. The tutorial makes use of the handy read.GenBank() and write.dna() functions available in the 'ape' package. I do not guarantee that this is the most efficient way to code all this in R, but that should hardly matter since the computation time is minimal.

The first step is to load (or install + load) the package 'ape' in R.

library(ape)
packageVersion("ape")

Since packages in R are constantly updated and re-tooled, I want to be clear that I used version 3.3 of ape to write all this. The packageVersion() function, built into R, is handy for assessing this.

Next, we'll import each sequence list from a pre-made .CSV (Comma-separated values) file. The GenBank sequences should be listed in one column, with a unique sequence number in each row. Rows can be blank (if they are pulled from an Excel sheet, as they commonly are) -- blank rows will just be skipped. For the labrid phylogeny paper, I created a separate CSV for each gene of interest. In this tutorial, I will provide the lists I used for obtaining the COI and CYTB genes.

Import each of these into R as separate objects. It is crucial to set stringsAsFactors=FALSE during the import. Modify the code below as necessary to read them from your working directory:

coi <- read.table("C:/MyWorkingDirectory/csv_lists/coi.csv", quote="\"", stringsAsFactors=FALSE)
cyt <- read.table("C:/MyWorkingDirectory/csv_lists/cytb.csv", quote="\"", stringsAsFactors=FALSE)

For simplicity, I named the COI list as "coi" and the CYTB list as "cyt" in my workspace. The next step is to convert these lists to character lists:

as.list(coi)$V1 -> coiL
as.list(cyt)$V1 -> cytL

Then use read.Genbank() to connect to the GenBank database and download the sequences. Set species.names=T to ensure the species name metadata is included.

coigen<-read.GenBank(coiL,species.names=T)
cytgen<-read.GenBank(cytL,species.names=T)

This will create two new objects, each with the class "DNAbin". You can type the name of each object (e.g. coigen) in R to view some interesting properties of each dataset (e.g. mean sequence length, base compositions...etc).

The problem now is that within each object, sequences are still listed by GenBank accession number. It would be much more convenient to replace the name of each list with the name of the species. This becomes even more crucial if you work with multiple gene sequences and want to use a supermatrix for tree inference: at some point you'll need to concatenate all the gene sequences for all the species (e.g. using a program like SequenceMatrix). This is really only feasible if each sequence is named according to the species from which it comes.

We'll now extract metadata (species' names and GenBank numbers) from each of these objects into user-friendly data.frame objects. From each new data.frame, we will extract the species' names and apply them to the sequences using the base R function attr():

names_coi <- data.frame(species = attr(coigen,"species"),
                      accs = names(coigen))
names(coigen) <- attr(coigen,"species")
names_cyt <- data.frame(species = attr(cytgen,"species"),
                      accs = names(cytgen))
names(cytgen) <- attr(cytgen,"species")

Now, 'coigen' and 'cytgen' are each modified to comprise lists of sequences that are named according to the species they represent. The two new objects, 'names_coi' and 'names_cyt' are pretty handy themselves if you'd like to verify that each sequence actually corresponds to the species of interest. Anyway, we're now ready to export each list into its own FASTA file (for further use in other programs) by using the write.dna() function in 'ape'.

setwd("C:/MyWorkingDirectory/FASTA_Files")
write.dna(coigen,"renamed_COI.fasta", format="fasta")
write.dna(cytgen,"renamed_CYTB.fasta", format="fasta")

And that's it. You should now have separate FASTA files for each gene. The FASTA files should now be ready for alignment using whichever program you prefer (For this project I used Geneious. Another cool program is MEGA). Within each FASTA file, sequences will now be named by the species from which they originate. I strongly recommend you scrutinize each file to verify that species' names are correct. Typos and/or synonymy of species' names can wreak havoc on your workflow, especially if you will be concatenating the sequences later on.