Tagged: biopython Toggle Comment Threads | Keyboard Shortcuts

  • Mariam Rizkallah 3:58 pm on April 16, 2012 Permalink | Reply
    Tags: biopython, entrez, ncbi   

    The community: BioPython overcomes limitations 

    BioPython community strikes again!

    Peter Cock (@pjacock) and the BioPython community (biopython[AT]lists[DOT]open-bio[DOT]org) helped me fixing record retrieval script from NCBI.

    My “current” question was to get ESTs from nucest (“The collection of short single-read transcript sequences from GenBank”). I had one thought, how to get this (http://www.ncbi.nlm.nih.gov/nucest/?term=txid543769%5BOrganism:exp%5D) on my HDD!? I used BioPython Tutorial example to implement “History” feature of Entrez Utils to get them downloaded. I noticed that the output number of sequences is not consistent and that the error returns in XML tags within the output!

    Then, I decided to send the whole list of ids via efetch, but I faced the 10,000-record limitation. Efetch is amazing, got me very consistent results, no XML error tags..

    I tried to split the list into 10,000-id long lists, but wasn’t successful.. the same happened, I got only the first 10,000 records.

    I emailed the gang, and Peter reminded me with the good old days (to map gi_taxid, loop over the taxids to get all child nodes, and get them from the 11G-EST fasta).

    Peter went through my script and had an idea to send specific gis for the batches by pointing at gi_list indexes. He sent them using epost, then catched the WebEnv variable and sent it back through efetch. It’s perfect! I do appreciate.

    #!/usr/bin/python
    import sys
    from Bio import Entrez
    Entrez.email = “”
    txid = int(sys.argv[1])
    #get count
    prim_handle = Entrez.esearch(db=”nucest”,term=”txid%i[Organism:exp]” %(txid), retmax=20)
    prim_record = Entrez.read(prim_handle)
    prim_count = prim_record[‘Count’]
    #get max using history (Biopython tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc119)
    search_handle = Entrez.esearch(db=”nucest”,term=”txid%s[Organism:exp]” %(txid), retmax=prim_count)#, usehistory=”y”) #retmax=count
    search_results = Entrez.read(search_handle)
    search_handle.close()
    gi_list = search_results[“IdList”]
    count = int(search_results[“Count”])
    print count
    assert count == len(gi_list), len(gi_list)
    #webenv = search_results[“WebEnv”]
    #query_key = search_results[“QueryKey”]
    out_fasta = “%s_txid%i_ct%i.fasta” %(sys.argv[2], txid, count)
    out_handle = open(out_fasta, “a”)
    ## Approach1: gets <XML> tags within the fasta file <ERROR>Unable to obtain query #1</ERROR> #Enhanced by Peter Cock
    batch_size = 1000
    for start in range(0,count,batch_size):
            end = min(count, start+batch_size)
            batch = gi_list[start:end]
            print “Going to download record %i to %i using epost+efetch” % (start+1, end)
            post_results = Entrez.read(Entrez.epost(“nucest”, id=”,”.join(batch)))
            webenv = post_results[“WebEnv”]
            query_key = post_results[“QueryKey”]
            fetch_handle = Entrez.efetch(db=”nucest”, rettype=”fasta”, retmode=”text”, webenv=webenv, query_key=query_key) #retstart=start, retmax=batch_size,
            data = fetch_handle.read()
            fetch_handle.close()
            out_handle.write(data)
    ## Approach2: split list [REF needed]
    #def SplitList( list, chunk_size ) :
    #        return [list[offs:offs+chunk_size] for offs in range(0, len(list), chunk_size)]
    #z = SplitList(gi_list, 10000)
    #for i in range(0, len(z)):
    #       print len(z[i])
    #       post_handle = Entrez.epost(“nucest”, rettype=”fasta”, retmode=”text”, id=”,”.join(z[1]))
    #       webenv = search_results[“WebEnv”]
    #       query_key = search_results[“QueryKey”]
    #       fetch_handle = Entrez.efetch(db=”nucest”, rettype=”fasta”, retmode=”text”, webenv=webenv, query_key=query_key)
    #       data = fetch_handle.read()
    #       fetch_handle.close()
    #       out_handle.write(data)
    ## Approach3: with most consistent retrieval but limited to 10000
    #fetch_handle = Entrez.efetch(db=”nucest”, rettype=”fasta”, retmode=”text”, webenv=webenv, query_key=query_key)
    #data = fetch_handle.read()
    #fetch_handle.close()
    #out_handle.write(data)
    ##
    out_handle.close()

    Advertisements
     
  • Mariam Rizkallah 2:50 pm on August 17, 2011 Permalink | Reply
    Tags: biopython, community-driven, ,   

    Extract sff for selected IDs into a child sff (version 2) 

    I thank Peter Cock, Brad Chapman, and Arnold Daniels for the development of the second version.

    #!/usr/bin/python
    
    from Bio import SeqIO
    import os, sys
    
    #Script to extract sequences into a child sff from a parent sff
    #Author: Mariam Rizkallah - August 16, 2011
    #argv[1] -> parent sff
    #argv[2] -> fasta or txt contains list of IDs
    #argv[3] -> child sff
    #Example: ./selectFromSff.py parent.sff new_fasta.fasta child.sff (p.s. chmod +x)
    
    #./selectFromSFF_args.py G5UAOIB07.sff seq.fna short_new.sff
    parent_sff = sys.argv[1]
    ext = os.path.splitext(sys.argv[2])[1] #Ref: http://www.jasny.net/articles/how-to-get-a-file-extension/
    if ext == ".fasta" or ext == ".fna":
    	selected_fasta = sys.argv[2]
    	#Ref: Peter Cock http://bit.ly/ngqPW7
    	idsList = set(r.id for r in
    			SeqIO.parse(selected_fasta, "fasta")
    			)
    else:
    	#Ref: http://permalink.gmane.org/gmane.comp.python.bio.general/5175
    	ids_list = sys.argv[2]
    	ids = open(ids_list, "r")
    	idsList = []
    	for i in ids:
    		idsList.append(i.rstrip()) #remove \n
    child_sff = sys.argv[3]
    
    #Ref: http://biopython.org/SRC/biopython/Bio/SeqIO/SffIO.py
    records = (record for record in
    		SeqIO.parse(parent_sff, "sff")
    		if record.id in idsList
    		)
    count = SeqIO.write(records, child_sff, "sff")
    print "Selected %i records" % count
     
  • Mariam Rizkallah 7:30 pm on August 16, 2011 Permalink | Reply
    Tags: biopython, extract, , , qual, sff   

    Extract certain sequences from a parent sff into a child one 

    After excluding certain sequences from my fasta file and generating a new fasta, I readlly needed all the info from the qual and the fasta to be gathered again into the sff. As a BioPython fan, I really found it all in their tutorial.

    If you find it useful, share it. If you have any comments or bugs, please contact me.

    #!/usr/bin/python
    
    from Bio import SeqIO
    import sys
    
    #Script to extract sequences into a child sff from a parent sff
    #Author: Mariam Rizkallah - August 16, 2011
    #argv[1] -> parent sff
    #argv[2] -> fasta
    #argv[3] -> child sff
    #Example: ./selectFromSff.py parent.sff new_fasta.fasta child.sff (p.s. chmod +x)
    
    parent_sff = sys.argv[1]
    selected_fasta = sys.argv[2]
    child_sff = sys.argv[3]
    
    records = (record for record in
    		SeqIO.parse(parent_sff, "sff")
    		if record.id in (r.id for r in
    			SeqIO.parse(selected_fasta, "fasta")
    			)
    		)
    count = SeqIO.write(records, child_sff, "sff")
    print "Selected %i records" % count
    print "Selected %i records" % count
     
    • Peter 12:34 am on August 17, 2011 Permalink | Reply

      Hi Mariam,

      This looks like it would work fine, but will re-load the FASTA file for every SFF entry. It would probably be much quicker to load the FASTA file once, building a list of IDs. Or better yet, make a set of IDs because checking set membership is faster than checking list membership. e.g.

      ids = set(r.id for r in SeqIO.parse(selected_fasta, “fasta”))
      records = (record for record in SeqIO.parse(parent_sff, “sff”) if record.id in ids)

      I haven’t tested that, but the principle is sound. I’m assuming neither the SFF file nor the FASTA file is extremely big.

      Peter

c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel