The community: BioPython overcomes limitations 

BioPython community strikes again!

Peter Cock (@pjacock) and the BioPython community (biopython[AT]lists[DOT]open-bio[DOT]org) helped me fixing record retrieval script from NCBI.

My “current” question was to get ESTs from nucest (“The collection of short single-read transcript sequences from GenBank”). I had one thought, how to get this (http://www.ncbi.nlm.nih.gov/nucest/?term=txid543769%5BOrganism:exp%5D) on my HDD!? I used BioPython Tutorial example to implement “History” feature of Entrez Utils to get them downloaded. I noticed that the output number of sequences is not consistent and that the error returns in XML tags within the output!

Then, I decided to send the whole list of ids via efetch, but I faced the 10,000-record limitation. Efetch is amazing, got me very consistent results, no XML error tags..

I tried to split the list into 10,000-id long lists, but wasn’t successful.. the same happened, I got only the first 10,000 records.

I emailed the gang, and Peter reminded me with the good old days (to map gi_taxid, loop over the taxids to get all child nodes, and get them from the 11G-EST fasta).

Peter went through my script and had an idea to send specific gis for the batches by pointing at gi_list indexes. He sent them using epost, then catched the WebEnv variable and sent it back through efetch. It’s perfect! I do appreciate.

#!/usr/bin/python
import sys
from Bio import Entrez
Entrez.email = “”
txid = int(sys.argv[1])
#get count
prim_handle = Entrez.esearch(db=”nucest”,term=”txid%i[Organism:exp]” %(txid), retmax=20)
prim_record = Entrez.read(prim_handle)
prim_count = prim_record[‘Count’]
#get max using history (Biopython tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc119)
search_handle = Entrez.esearch(db=”nucest”,term=”txid%s[Organism:exp]” %(txid), retmax=prim_count)#, usehistory=”y”) #retmax=count
search_results = Entrez.read(search_handle)
search_handle.close()
gi_list = search_results[“IdList”]
count = int(search_results[“Count”])
print count
assert count == len(gi_list), len(gi_list)
#webenv = search_results[“WebEnv”]
#query_key = search_results[“QueryKey”]
out_fasta = “%s_txid%i_ct%i.fasta” %(sys.argv[2], txid, count)
out_handle = open(out_fasta, “a”)
## Approach1: gets <XML> tags within the fasta file <ERROR>Unable to obtain query #1</ERROR> #Enhanced by Peter Cock
batch_size = 1000
for start in range(0,count,batch_size):
        end = min(count, start+batch_size)
        batch = gi_list[start:end]
        print “Going to download record %i to %i using epost+efetch” % (start+1, end)
        post_results = Entrez.read(Entrez.epost(“nucest”, id=”,”.join(batch)))
        webenv = post_results[“WebEnv”]
        query_key = post_results[“QueryKey”]
        fetch_handle = Entrez.efetch(db=”nucest”, rettype=”fasta”, retmode=”text”, webenv=webenv, query_key=query_key) #retstart=start, retmax=batch_size,
        data = fetch_handle.read()
        fetch_handle.close()
        out_handle.write(data)
## Approach2: split list [REF needed]
#def SplitList( list, chunk_size ) :
#        return [list[offs:offs+chunk_size] for offs in range(0, len(list), chunk_size)]
#z = SplitList(gi_list, 10000)
#for i in range(0, len(z)):
#       print len(z[i])
#       post_handle = Entrez.epost(“nucest”, rettype=”fasta”, retmode=”text”, id=”,”.join(z[1]))
#       webenv = search_results[“WebEnv”]
#       query_key = search_results[“QueryKey”]
#       fetch_handle = Entrez.efetch(db=”nucest”, rettype=”fasta”, retmode=”text”, webenv=webenv, query_key=query_key)
#       data = fetch_handle.read()
#       fetch_handle.close()
#       out_handle.write(data)
## Approach3: with most consistent retrieval but limited to 10000
#fetch_handle = Entrez.efetch(db=”nucest”, rettype=”fasta”, retmode=”text”, webenv=webenv, query_key=query_key)
#data = fetch_handle.read()
#fetch_handle.close()
#out_handle.write(data)
##
out_handle.close()

Advertisements