#!/usr/bin/python from Bio import SeqIO import os, sys #Script to extract sequences into a child sff from a parent sff #Author: Mariam Rizkallah - August 16, 2011 #argv -> parent sff #argv -> fasta or txt contains list of IDs #argv -> child sff #Example: ./selectFromSff.py parent.sff new_fasta.fasta child.sff (p.s. chmod +x) #./selectFromSFF_args.py G5UAOIB07.sff seq.fna short_new.sff parent_sff = sys.argv ext = os.path.splitext(sys.argv) #Ref: http://www.jasny.net/articles/how-to-get-a-file-extension/ if ext == ".fasta" or ext == ".fna": selected_fasta = sys.argv #Ref: Peter Cock http://bit.ly/ngqPW7 idsList = set(r.id for r in SeqIO.parse(selected_fasta, "fasta") ) else: #Ref: http://permalink.gmane.org/gmane.comp.python.bio.general/5175 ids_list = sys.argv ids = open(ids_list, "r") idsList =  for i in ids: idsList.append(i.rstrip()) #remove \n child_sff = sys.argv #Ref: http://biopython.org/SRC/biopython/Bio/SeqIO/SffIO.py records = (record for record in SeqIO.parse(parent_sff, "sff") if record.id in idsList ) count = SeqIO.write(records, child_sff, "sff") print "Selected %i records" % count
Tagged: fasta Toggle Comment Threads | Keyboard Shortcuts
After excluding certain sequences from my fasta file and generating a new fasta, I readlly needed all the info from the qual and the fasta to be gathered again into the sff. As a BioPython fan, I really found it all in their tutorial.
If you find it useful, share it. If you have any comments or bugs, please contact me.
#!/usr/bin/python from Bio import SeqIO import sys #Script to extract sequences into a child sff from a parent sff #Author: Mariam Rizkallah - August 16, 2011 #argv -> parent sff #argv -> fasta #argv -> child sff #Example: ./selectFromSff.py parent.sff new_fasta.fasta child.sff (p.s. chmod +x) parent_sff = sys.argv selected_fasta = sys.argv child_sff = sys.argv records = (record for record in SeqIO.parse(parent_sff, "sff") if record.id in (r.id for r in SeqIO.parse(selected_fasta, "fasta") ) ) count = SeqIO.write(records, child_sff, "sff") print "Selected %i records" % count print "Selected %i records" % count30.076292 31.208903
June 2, 2011
As soon as Beijing Genomics Institute has finished sequencing the E. coli strain that is responsible for the outbreak in Europe, BGI made the “draft assembly” available for download. I didn’t know the difference between a draft and the elegant assembled genome I used to work on while annotating phage capsid proteins, but I did believe that this outbreak idea is a phage thing, you know, “Cherchez le phage!” I wanted to annotate this E. coli.
I suggested to Dr. Aziz that this “genome sequence” may benefit from RAST (Rapid Annotation using Subsystem Technology – http://rast.nmpdr.org), the tool developed by Dr. Edwards’ Lab and ANL for rapid and accurate annotation of bacterial and archaeal genomes. Dr. Aziz replied “Then what are you waiting for? Upload it to RAST”. Then he told me how the draft is in “FASTQ” format, a machine output format that can not be an input for RAST (and for Real-Time Metagenomics.web in particular http://edwards.sdsu.edu/rtmg/). Dr. Aziz explained to me that each “run/genome fragment” will be treated as metagenome, that’s why I should use RTMg.
Step1: Convert sequences from fastq to fasta: I searched the web for the problem, then I found this one-line program in Perl by Robert Schmeider, PhD student at Dr. Edwards’ lab on his lab blog. I am very grateful, Robert!
Step2: Uploading the draft in fasta format to RTMg as separate “runs”. I don’t know if it’s the right measure, or I should’ve collected all “runs” in one big fasta file and let RTMg process them.
Step3:I made use of the “tab-separated” output of RTMg and saved it to a spreadsheet (I do apologize, but the spreadsheet on Google is not viewable, please download it to edit it). I saved the results sorted by “function” and by “3-level subsystem”.
Step4: ??? I don’t know really. I do want to make use of RTMg data, but I don’t know where to start. I guess aligning the 5 runs as well as the strain vs other E. coli strains will be the answer. Community will take STEC O104:H4 down sooner than “it thought”.
I quote Dr. Aziz here:
“RTMg is mainly a tool for metagenomics analysis, and it treats every sequence read as an environmental gene tag (EGT) or simply as an independent fragment. It cannot be really used to explain a genome; however, it gives a really quick ideaof what each sequence looks like, of the ratio of different genes, of the most frequent protein-encoding genes in this strain, the density of phage proteins (for example, you can write a script to count the word phage and calculate its percent to the overall genome), etc.
There are two main things needed though before this: 1) is to clean up the sequence fragments; 2) is to assemble them (using, for example, NEWBLER). To check the sequence and clean it up check: Prinseq and Tagcleaner then re-run. Another way to benefit from RTMg would be to compare that strain with another one sequenced exactly the same way, which is not relevant here.”
June 4, 2011