Recent Updates Page 2 Toggle Comment Threads | Keyboard Shortcuts

  • Mariam Rizkallah 2:50 pm on August 17, 2011 Permalink | Reply
    Tags: , community-driven, ,   

    Extract sff for selected IDs into a child sff (version 2) 

    I thank Peter Cock, Brad Chapman, and Arnold Daniels for the development of the second version.

    #!/usr/bin/python
    
    from Bio import SeqIO
    import os, sys
    
    #Script to extract sequences into a child sff from a parent sff
    #Author: Mariam Rizkallah - August 16, 2011
    #argv[1] -> parent sff
    #argv[2] -> fasta or txt contains list of IDs
    #argv[3] -> child sff
    #Example: ./selectFromSff.py parent.sff new_fasta.fasta child.sff (p.s. chmod +x)
    
    #./selectFromSFF_args.py G5UAOIB07.sff seq.fna short_new.sff
    parent_sff = sys.argv[1]
    ext = os.path.splitext(sys.argv[2])[1] #Ref: http://www.jasny.net/articles/how-to-get-a-file-extension/
    if ext == ".fasta" or ext == ".fna":
    	selected_fasta = sys.argv[2]
    	#Ref: Peter Cock http://bit.ly/ngqPW7
    	idsList = set(r.id for r in
    			SeqIO.parse(selected_fasta, "fasta")
    			)
    else:
    	#Ref: http://permalink.gmane.org/gmane.comp.python.bio.general/5175
    	ids_list = sys.argv[2]
    	ids = open(ids_list, "r")
    	idsList = []
    	for i in ids:
    		idsList.append(i.rstrip()) #remove \n
    child_sff = sys.argv[3]
    
    #Ref: http://biopython.org/SRC/biopython/Bio/SeqIO/SffIO.py
    records = (record for record in
    		SeqIO.parse(parent_sff, "sff")
    		if record.id in idsList
    		)
    count = SeqIO.write(records, child_sff, "sff")
    print "Selected %i records" % count
    Advertisements
     
  • Mariam Rizkallah 7:30 pm on August 16, 2011 Permalink | Reply
    Tags: , extract, , , qual, sff   

    Extract certain sequences from a parent sff into a child one 

    After excluding certain sequences from my fasta file and generating a new fasta, I readlly needed all the info from the qual and the fasta to be gathered again into the sff. As a BioPython fan, I really found it all in their tutorial.

    If you find it useful, share it. If you have any comments or bugs, please contact me.

    #!/usr/bin/python
    
    from Bio import SeqIO
    import sys
    
    #Script to extract sequences into a child sff from a parent sff
    #Author: Mariam Rizkallah - August 16, 2011
    #argv[1] -> parent sff
    #argv[2] -> fasta
    #argv[3] -> child sff
    #Example: ./selectFromSff.py parent.sff new_fasta.fasta child.sff (p.s. chmod +x)
    
    parent_sff = sys.argv[1]
    selected_fasta = sys.argv[2]
    child_sff = sys.argv[3]
    
    records = (record for record in
    		SeqIO.parse(parent_sff, "sff")
    		if record.id in (r.id for r in
    			SeqIO.parse(selected_fasta, "fasta")
    			)
    		)
    count = SeqIO.write(records, child_sff, "sff")
    print "Selected %i records" % count
    print "Selected %i records" % count
     
    • Peter 12:34 am on August 17, 2011 Permalink | Reply

      Hi Mariam,

      This looks like it would work fine, but will re-load the FASTA file for every SFF entry. It would probably be much quicker to load the FASTA file once, building a list of IDs. Or better yet, make a set of IDs because checking set membership is faster than checking list membership. e.g.

      ids = set(r.id for r in SeqIO.parse(selected_fasta, “fasta”))
      records = (record for record in SeqIO.parse(parent_sff, “sff”) if record.id in ids)

      I haven’t tested that, but the principle is sound. I’m assuming neither the SFF file nor the FASTA file is extremely big.

      Peter

  • Mariam Rizkallah 1:57 pm on June 6, 2011 Permalink | Reply  

    I believe that I will run TY2842 once more in RAST and LB226692 as well. Then run them through phiSpy to check for new phages http://sourceforge.net/projects/phispy/

     
  • Mariam Rizkallah 2:15 am on June 5, 2011 Permalink | Reply
    Tags: assembly, bgi, coli, edwards, europe, , fastq, germany, O104:H4, outbreak, phage, rast, rtmg, sdsu, sequence   

    Can RTmg help annotating E. coli O104 (STEC O104:H4)? 

    June 2, 2011

    As soon as Beijing Genomics Institute has finished sequencing the E. coli strain that is responsible for the outbreak in Europe, BGI made the “draft assembly” available for download. I didn’t know the difference between a draft and the elegant assembled genome I used to work on while annotating phage capsid proteins, but I did believe that this outbreak idea is a phage thing, you know, “Cherchez le phage!” I wanted to annotate this E. coli.

    I suggested to Dr. Aziz that this “genome sequence” may benefit from RAST (Rapid Annotation using Subsystem Technology – http://rast.nmpdr.org), the tool developed by Dr. Edwards’ Lab and ANL for rapid and accurate annotation of bacterial and archaeal genomes. Dr. Aziz replied “Then what are you waiting for? Upload it to RAST”. Then he told me how the draft is in “FASTQ” format, a machine output format that can not be an input for RAST (and for Real-Time Metagenomics.web in particular http://edwards.sdsu.edu/rtmg/). Dr. Aziz explained to me that each “run/genome fragment” will be treated as metagenome, that’s why I should use RTMg.

    Step1: Convert sequences from fastq to fasta: I searched the web for the problem, then I found this one-line program in Perl by Robert Schmeider, PhD student at Dr. Edwards’ lab on his lab blog. I am very grateful, Robert!

    Step2: Uploading the draft in fasta format to RTMg as separate “runs”. I don’t know if it’s the right measure, or I should’ve collected all “runs” in one big fasta file and let RTMg process them.

    Step3:I made use of the “tab-separated” output of RTMg and saved it to a spreadsheet (I do apologize, but the spreadsheet on Google is not viewable, please download it to edit it). I saved the results sorted by “function” and by “3-level subsystem”.

    Step4: ??? I don’t know really. I do want to make use of RTMg data, but I don’t know where to start. I guess aligning the 5 runs as well as the strain vs other E. coli strains will be the answer. Community will take STEC O104:H4 down sooner than “it thought”.

    I quote Dr. Aziz here:

    “RTMg is mainly a tool for metagenomics analysis, and it treats every sequence read as an environmental gene tag (EGT) or simply as an independent fragment. It cannot be really used to explain a genome; however, it gives a really quick ideaof what each sequence looks like, of the ratio of different genes, of the most frequent protein-encoding genes in this strain, the density of phage proteins (for example, you can write a script to count the word phage and calculate its percent to the overall genome), etc.

    There are two main things needed though before this: 1) is to clean up the sequence fragments; 2) is to assemble them (using, for example, NEWBLER). To check the sequence and clean it up check: Prinseq and Tagcleaner then re-run. Another way to benefit from RTMg would be to compare that strain with another one sequenced exactly the same way, which is not relevant here.”

    June 4, 2011

    I thank Prof. Jonathan Eisen for sharing BGI assembly updates. Great effort by Nick Loman and many scientists as well.


     
  • Mariam Rizkallah 1:36 am on June 5, 2011 Permalink | Reply  

    Hello world! 

    Welcome to WordPress.com. After you read this, you should delete and write your own post, with a new title above. Or hit Add New on the left (of the admin dashboard) to start a fresh post.

    Here are some suggestions for your first post.

    1. You can find new ideas for what to blog about by reading the Daily Post.
    2. Add PressThis to your browser. It creates a new blog post for you about any interesting  page you read on the web.
    3. Make some changes to this page, and then hit preview on the right. You can alway preview any post or edit you before you share it to the world.
     
    • Mr WordPress 1:36 am on June 5, 2011 Permalink | Reply

      Hi, this is a comment.
      To delete a comment, just log in, and view the posts’ comments, there you will have the option to edit or delete them.

c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel