Updates from August, 2011 Toggle Comment Threads | Keyboard Shortcuts

  • Mariam Rizkallah 8:20 pm on August 26, 2011 Permalink | Reply
    Tags: NoteToSelf   

    It’s not about using the “hottest” “latest” update… it’s about “consistency” of software used by all parties.

     
  • Mariam Rizkallah 4:27 am on August 24, 2011 Permalink | Reply
    Tags: ctd, , shell   

    Perl vs shell scripting contest! 

    I have to write.. so I wouldn’t feel alone..

    Currently, we’re reading about the huge database “Comparative toxicogenomics database” (CTD). As an open source database, all tables are –kindly–available for download. We started to parse the comma-delimited tables whose fields are further delimited by “pipe”, where fields’ data may contain “,” also.

    As an expert in his field, our PI started to write a one-line program in perl to parse it, while an eager student like myself started to work using shell scripting. No need to say that our PI own the contest. But I want to share with you my trials.

    Dedicated to CTD!

    #first trial
    cut -f9 -d"," file.csv | sort | grep "\^" | cut -d"|" -f1,2 | uniq > file.txt #then I realized that there's >2 fields in delimited by "|", and "cut" is not really helping
    #successful trial
    awk -F"," '{for (i=1; i<NF; i++) {if ($i ~ /\^/) print $i;} }' file.csv | sort | uniq | awk -F"|" '{for (i=1; i<=NF; i++) print $i;}' | sort | uniq | sed -n 's/\^/ /gp' > file.txt
    
    #Update Sep 2, 2011
    #Parse the generated file:
    awk -F" " '{print $2;}' file.txt | sort | uniq | wc -l
    
    #Update Sep 3, 2011
    awk -F"," '{for (i=1; i<NF; i++) {if ($i ~ /\^/) print $i;} }' file.csv | sort | uniq | awk -F"|" '{for (i=1; i<=NF; i++) print $i;}' | sort | uniq > file.txt
    awk -F"^" '{print $2;}' file.txt | sort | uniq > file2.txt #Before replacement of "^", to preserve space-separated data
    awk '{print "("$i",",$i"),";}' file2.txt > file2_python_dict.dict #Wrong.. missing "'" #"," = space...
    awk '{print "("$i",",$i"),";}' file2.txt | sed -n "s/(/(\'/gp" | sed -n "s/)/\')/gp" | sed -n "s/,/\',/gp" | sed -n "s/)',/),/gp" | sed -n "s/ / '/gp" > python_dict.dict
    #There's a logical error -> '' before any space-containing line.
     
  • Mariam Rizkallah 2:50 pm on August 17, 2011 Permalink | Reply
    Tags: , community-driven, ,   

    Extract sff for selected IDs into a child sff (version 2) 

    I thank Peter Cock, Brad Chapman, and Arnold Daniels for the development of the second version.

    #!/usr/bin/python
    
    from Bio import SeqIO
    import os, sys
    
    #Script to extract sequences into a child sff from a parent sff
    #Author: Mariam Rizkallah - August 16, 2011
    #argv[1] -> parent sff
    #argv[2] -> fasta or txt contains list of IDs
    #argv[3] -> child sff
    #Example: ./selectFromSff.py parent.sff new_fasta.fasta child.sff (p.s. chmod +x)
    
    #./selectFromSFF_args.py G5UAOIB07.sff seq.fna short_new.sff
    parent_sff = sys.argv[1]
    ext = os.path.splitext(sys.argv[2])[1] #Ref: http://www.jasny.net/articles/how-to-get-a-file-extension/
    if ext == ".fasta" or ext == ".fna":
    	selected_fasta = sys.argv[2]
    	#Ref: Peter Cock http://bit.ly/ngqPW7
    	idsList = set(r.id for r in
    			SeqIO.parse(selected_fasta, "fasta")
    			)
    else:
    	#Ref: http://permalink.gmane.org/gmane.comp.python.bio.general/5175
    	ids_list = sys.argv[2]
    	ids = open(ids_list, "r")
    	idsList = []
    	for i in ids:
    		idsList.append(i.rstrip()) #remove \n
    child_sff = sys.argv[3]
    
    #Ref: http://biopython.org/SRC/biopython/Bio/SeqIO/SffIO.py
    records = (record for record in
    		SeqIO.parse(parent_sff, "sff")
    		if record.id in idsList
    		)
    count = SeqIO.write(records, child_sff, "sff")
    print "Selected %i records" % count
     
  • Mariam Rizkallah 7:30 pm on August 16, 2011 Permalink | Reply
    Tags: , extract, , , qual, sff   

    Extract certain sequences from a parent sff into a child one 

    After excluding certain sequences from my fasta file and generating a new fasta, I readlly needed all the info from the qual and the fasta to be gathered again into the sff. As a BioPython fan, I really found it all in their tutorial.

    If you find it useful, share it. If you have any comments or bugs, please contact me.

    #!/usr/bin/python
    
    from Bio import SeqIO
    import sys
    
    #Script to extract sequences into a child sff from a parent sff
    #Author: Mariam Rizkallah - August 16, 2011
    #argv[1] -> parent sff
    #argv[2] -> fasta
    #argv[3] -> child sff
    #Example: ./selectFromSff.py parent.sff new_fasta.fasta child.sff (p.s. chmod +x)
    
    parent_sff = sys.argv[1]
    selected_fasta = sys.argv[2]
    child_sff = sys.argv[3]
    
    records = (record for record in
    		SeqIO.parse(parent_sff, "sff")
    		if record.id in (r.id for r in
    			SeqIO.parse(selected_fasta, "fasta")
    			)
    		)
    count = SeqIO.write(records, child_sff, "sff")
    print "Selected %i records" % count
    print "Selected %i records" % count
     
    • Peter 12:34 am on August 17, 2011 Permalink | Reply

      Hi Mariam,

      This looks like it would work fine, but will re-load the FASTA file for every SFF entry. It would probably be much quicker to load the FASTA file once, building a list of IDs. Or better yet, make a set of IDs because checking set membership is faster than checking list membership. e.g.

      ids = set(r.id for r in SeqIO.parse(selected_fasta, “fasta”))
      records = (record for record in SeqIO.parse(parent_sff, “sff”) if record.id in ids)

      I haven’t tested that, but the principle is sound. I’m assuming neither the SFF file nor the FASTA file is extremely big.

      Peter

c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel