Tagged: python Toggle Comment Threads | Keyboard Shortcuts

  • Mariam Rizkallah 2:50 pm on August 17, 2011 Permalink | Reply
    Tags: , community-driven, , python   

    Extract sff for selected IDs into a child sff (version 2) 

    I thank Peter Cock, Brad Chapman, and Arnold Daniels for the development of the second version.

    #!/usr/bin/python
    
    from Bio import SeqIO
    import os, sys
    
    #Script to extract sequences into a child sff from a parent sff
    #Author: Mariam Rizkallah - August 16, 2011
    #argv[1] -> parent sff
    #argv[2] -> fasta or txt contains list of IDs
    #argv[3] -> child sff
    #Example: ./selectFromSff.py parent.sff new_fasta.fasta child.sff (p.s. chmod +x)
    
    #./selectFromSFF_args.py G5UAOIB07.sff seq.fna short_new.sff
    parent_sff = sys.argv[1]
    ext = os.path.splitext(sys.argv[2])[1] #Ref: http://www.jasny.net/articles/how-to-get-a-file-extension/
    if ext == ".fasta" or ext == ".fna":
    	selected_fasta = sys.argv[2]
    	#Ref: Peter Cock http://bit.ly/ngqPW7
    	idsList = set(r.id for r in
    			SeqIO.parse(selected_fasta, "fasta")
    			)
    else:
    	#Ref: http://permalink.gmane.org/gmane.comp.python.bio.general/5175
    	ids_list = sys.argv[2]
    	ids = open(ids_list, "r")
    	idsList = []
    	for i in ids:
    		idsList.append(i.rstrip()) #remove \n
    child_sff = sys.argv[3]
    
    #Ref: http://biopython.org/SRC/biopython/Bio/SeqIO/SffIO.py
    records = (record for record in
    		SeqIO.parse(parent_sff, "sff")
    		if record.id in idsList
    		)
    count = SeqIO.write(records, child_sff, "sff")
    print "Selected %i records" % count
    Advertisements
     
  • Mariam Rizkallah 7:30 pm on August 16, 2011 Permalink | Reply
    Tags: , extract, , python, qual, sff   

    Extract certain sequences from a parent sff into a child one 

    After excluding certain sequences from my fasta file and generating a new fasta, I readlly needed all the info from the qual and the fasta to be gathered again into the sff. As a BioPython fan, I really found it all in their tutorial.

    If you find it useful, share it. If you have any comments or bugs, please contact me.

    #!/usr/bin/python
    
    from Bio import SeqIO
    import sys
    
    #Script to extract sequences into a child sff from a parent sff
    #Author: Mariam Rizkallah - August 16, 2011
    #argv[1] -> parent sff
    #argv[2] -> fasta
    #argv[3] -> child sff
    #Example: ./selectFromSff.py parent.sff new_fasta.fasta child.sff (p.s. chmod +x)
    
    parent_sff = sys.argv[1]
    selected_fasta = sys.argv[2]
    child_sff = sys.argv[3]
    
    records = (record for record in
    		SeqIO.parse(parent_sff, "sff")
    		if record.id in (r.id for r in
    			SeqIO.parse(selected_fasta, "fasta")
    			)
    		)
    count = SeqIO.write(records, child_sff, "sff")
    print "Selected %i records" % count
    print "Selected %i records" % count
     
    • Peter 12:34 am on August 17, 2011 Permalink | Reply

      Hi Mariam,

      This looks like it would work fine, but will re-load the FASTA file for every SFF entry. It would probably be much quicker to load the FASTA file once, building a list of IDs. Or better yet, make a set of IDs because checking set membership is faster than checking list membership. e.g.

      ids = set(r.id for r in SeqIO.parse(selected_fasta, “fasta”))
      records = (record for record in SeqIO.parse(parent_sff, “sff”) if record.id in ids)

      I haven’t tested that, but the principle is sound. I’m assuming neither the SFF file nor the FASTA file is extremely big.

      Peter

c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel