Updates from Mariam Rizkallah Toggle Comment Threads | Keyboard Shortcuts

  • Mariam Rizkallah 10:41 pm on September 29, 2013 Permalink | Reply
    Tags: anyconnect, cisco, openconnect, ubuntu   

    Free yourself and shift from Anyconnect to Openconnect!


  • Mariam Rizkallah 3:58 pm on April 16, 2012 Permalink | Reply
    Tags: , entrez, ncbi   

    The community: BioPython overcomes limitations 

    BioPython community strikes again!

    Peter Cock (@pjacock) and the BioPython community (biopython[AT]lists[DOT]open-bio[DOT]org) helped me fixing record retrieval script from NCBI.

    My “current” question was to get ESTs from nucest (“The collection of short single-read transcript sequences from GenBank”). I had one thought, how to get this (http://www.ncbi.nlm.nih.gov/nucest/?term=txid543769%5BOrganism:exp%5D) on my HDD!? I used BioPython Tutorial example to implement “History” feature of Entrez Utils to get them downloaded. I noticed that the output number of sequences is not consistent and that the error returns in XML tags within the output!

    Then, I decided to send the whole list of ids via efetch, but I faced the 10,000-record limitation. Efetch is amazing, got me very consistent results, no XML error tags..

    I tried to split the list into 10,000-id long lists, but wasn’t successful.. the same happened, I got only the first 10,000 records.

    I emailed the gang, and Peter reminded me with the good old days (to map gi_taxid, loop over the taxids to get all child nodes, and get them from the 11G-EST fasta).

    Peter went through my script and had an idea to send specific gis for the batches by pointing at gi_list indexes. He sent them using epost, then catched the WebEnv variable and sent it back through efetch. It’s perfect! I do appreciate.

    import sys
    from Bio import Entrez
    Entrez.email = “”
    txid = int(sys.argv[1])
    #get count
    prim_handle = Entrez.esearch(db=”nucest”,term=”txid%i[Organism:exp]” %(txid), retmax=20)
    prim_record = Entrez.read(prim_handle)
    prim_count = prim_record[‘Count’]
    #get max using history (Biopython tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc119)
    search_handle = Entrez.esearch(db=”nucest”,term=”txid%s[Organism:exp]” %(txid), retmax=prim_count)#, usehistory=”y”) #retmax=count
    search_results = Entrez.read(search_handle)
    gi_list = search_results[“IdList”]
    count = int(search_results[“Count”])
    print count
    assert count == len(gi_list), len(gi_list)
    #webenv = search_results[“WebEnv”]
    #query_key = search_results[“QueryKey”]
    out_fasta = “%s_txid%i_ct%i.fasta” %(sys.argv[2], txid, count)
    out_handle = open(out_fasta, “a”)
    ## Approach1: gets <XML> tags within the fasta file <ERROR>Unable to obtain query #1</ERROR> #Enhanced by Peter Cock
    batch_size = 1000
    for start in range(0,count,batch_size):
            end = min(count, start+batch_size)
            batch = gi_list[start:end]
            print “Going to download record %i to %i using epost+efetch” % (start+1, end)
            post_results = Entrez.read(Entrez.epost(“nucest”, id=”,”.join(batch)))
            webenv = post_results[“WebEnv”]
            query_key = post_results[“QueryKey”]
            fetch_handle = Entrez.efetch(db=”nucest”, rettype=”fasta”, retmode=”text”, webenv=webenv, query_key=query_key) #retstart=start, retmax=batch_size,
            data = fetch_handle.read()
    ## Approach2: split list [REF needed]
    #def SplitList( list, chunk_size ) :
    #        return [list[offs:offs+chunk_size] for offs in range(0, len(list), chunk_size)]
    #z = SplitList(gi_list, 10000)
    #for i in range(0, len(z)):
    #       print len(z[i])
    #       post_handle = Entrez.epost(“nucest”, rettype=”fasta”, retmode=”text”, id=”,”.join(z[1]))
    #       webenv = search_results[“WebEnv”]
    #       query_key = search_results[“QueryKey”]
    #       fetch_handle = Entrez.efetch(db=”nucest”, rettype=”fasta”, retmode=”text”, webenv=webenv, query_key=query_key)
    #       data = fetch_handle.read()
    #       fetch_handle.close()
    #       out_handle.write(data)
    ## Approach3: with most consistent retrieval but limited to 10000
    #fetch_handle = Entrez.efetch(db=”nucest”, rettype=”fasta”, retmode=”text”, webenv=webenv, query_key=query_key)
    #data = fetch_handle.read()

  • Mariam Rizkallah 12:14 pm on April 4, 2012 Permalink | Reply
    Tags: AUC, cycle, grads, mentorship, perish, publishing, research, thesis   

    Thesis Writing Process Workshop For students in Sciences and Engineering – AUC 

    The notes that I took at the “Thesis Writing Process Workshop” conducted by Drs. Sherif Aly and Ali Darwish at AUC.


    A) General tips:

    Why go for grad school, and which program to choose and which type of program (thesis vs. non-thesis)?

    The more you give, the more you receive

    Don’t work on your own.. try 3 topics, work in 2-researcher groups so whoever finishes first you two publish

    Zoom out – Understand the field at the fundamental level

    Intelligent creatures communicate with one way or another (Don’t live in isolation)

    B) Selecting a topic:

    Thesis: a dissertation in which you

    • independently address an important problem,
    • provide original contribution to knowledge
    • Re-Search: again (see what is obvious) and again (see the less obvious) and again (understand the phenomenon)…

    C) “If your mother tells you that she loves you, look for evidence!”

    Ibn-Al-Haitham, the father of scientific research, Make a hypothesis: Postulate -> test it and test the opposite->accept the results without assumptions

    Never accept facts or assumptions without verifying them.

    –          READ READ A LOT. Take notes.

    –          Start early about the topic

    –          Talk to others. Don’t insulate yourself, verbalize thoughts, logical arguments, questions.

    D) Where to look?

    • Look in dissertations rather than papers
    • Suggestions for future prospects
    • Conference presentations
    • Big scholars (key players in your field)
    • Landmark papers (close gap in your understanding, don’t reinvent the wheel)

    E) Factors affecting topic choice:

    • Your next phase (academia vs. industry)
    • Gut instinct (100% motivation vs. 100% intelligence) (completely dedicate yourself to science)
    • Don’t try for Nobel Prize!
    • Elegant and simple (simple things last)
    • Practice enlightened trial and error (Logical moves vs. fooling around “curiosity”)
    • Lose focus sometime
    • Advisor’s specialization
    • What’s hot is not in books
    • Discover what’s missing
    • Make analogy, borrow ideas from other domains (Cars and people)

    ** Robert Boice: “Write before you’re ready”

    G) How to pick a point:

    1- Understand existing system

    2- Decompose the problem

    3- Pick choices (mix and match, because you understand a problem at the fundamental level)

    H) Crash ideas:

    1-       Don’t shut idea because it’s insane (filter later)

    2-       Draw ideas -> physical form

    3-       Wish and wonder that you overcame a problem

    4-       Go to the beach.. relax

    5-       TIPS: Theory of inventive problem solving

    6-       Nature: “What kind of person would spend hours and hours thinking about electric field?” Maxwell

    7-      Have good command of tools

    8-      Read Bios (Wikipedia, what inspired El-Khawarizmi, what was unknown to him that made him go for the polynomial equations?)

    9-      Observe: what are people using and why?

    10-   Discover a gap in people’s knowledge

    11-    Intentionally screw equation up

    ** “Continue looking for needles in a haystack, don’t be satisfied with the first one”

    Choosing a topic is like choosing a partner (Love, commitment and dedication)!

    Proposing for your topic

    1) Formulate a proposal: Means that I am working on a justifiable topic.

    –          A research is as good as a proposal (committee, grant, conference …)

    –          What do you want to accomplish.. what are you after?

    –          Why I would like to change the way of things? Fire up the audience -> killer scenario

    –          How are you going to solve the problem? Don’t be worried that you don’t ready solution, this means that it is a “research topic”

    –          Audience: How scientifically you sound to tackle a problem? Read others work, homework, what’s been achieved? Focused? Very obvious things don’t need a scientific research (bang screen hard will break it #Fact)

    –          Look for gaps, break that assumptions

    2) Proposal:

    1-       Title: attractive, descriptive, suitable length

    2-       Introduction: put reader in context, funneling approach (1 paragraph for each level), give example to raise the question (data transfer), list key contributions, general methodology

    3-       Problem definition: 2 paragraphs, concise and clear

    4-       Thesis contribution statement (1-2 paragraphs, very accurate)

    5-       Literature review: recent, done homework [2012 published article = 2009 finished work], conference proceedings, look right, evaluate others work and classification.

    6-       Proposed approach: solid methodology, anyone else done this before, how are you going to evaluate myself (comparative, (automated design relatively to human designers), modifiable

    7-       Conclusion

    Research cycle

    ** One common thing among all great scientists: persistence, training, not intelligent: every time they fail, they re-try

    Being a disciple of good scientists -> tree of supervisors -> Master something

    Close the loops and complete the cycle:

    Crazy ideas/concepts (generated faster, but you need prioritization ) -> Plan/design -> Fund [non-profit]/pay students’ tuition -> implement ideas -> analyze idea (evaluate research, assess yourself, compare to other approaches) -> dissemination [publish, conference] [rejected from a certain journal != bad] -> when polluted with the ecosystem (take year-off, study abroad)  -> evaluate and recommend

    ** “Egypt’s past regime implemented the policy of “Man in the iron mask” locking creative people

    Last stage [Last month.. don’t leave the school/lab]: Play with parameter, graphs, and trial and error..

    If you’re running for PhD or postdoc, labs prefer to take a plug-and-play researcher (already had the experience, mastered the techniques, and published) ..

    True mentor says: “I will never put my name before your name unless I did the bulk of the work”

    Create a mind map: put what is in your head on a paper

    Thesis = proposal – a component (the not-sure-methodology)++

    The process of writing a thesis should be iterative, have a holistic view.. (Do your research and build on top of it a dissertation -> PhD)

    –          Arguments clear [enough information to reproduce and build on it]

    –          Tone appropriate

    –          Considerate to reads, explain to readers in details every step, Don’t shock the reader

    –          Follow a nice logical flow, roadmap Section1 (paragraph) 1.1…etc. Topic sentence..etc.

    Publishing your work [Publish or perish]

    1)      Make sure that your work is assess-able by others, and repeat-able

    3)       If you don’t publish, your thesis is a waste of time.. Great instructors were let when they don’t publish, you’re so weak, research was useless.

    4)       When: NOW.. write reviews about state-of-the-art.. mentors can say “write me something within the month and I send it out”

    5)       Where: Choosing a journal

    6)       Follow the instructions: Format…etc.

    ** Never say that authors work is incomplete, and here we …etc. -> Authors addressed those aspects, and we are addressing the following..

    “Neat work .. don’t get lost” “what is the significant if it’s done before”

    Plan (outline) -> very rough draft -> second draft -> send to advisor -> 3-4 iterations -> polished ->send to journal

    Tips to increase the publishability of your manuscript:

    –          Abstract: most important, written last (2 weeks vs. 2 months)

    –          Introduction: why, how and what

    –          Referencing milestones, clarity, advisor’s name/lab, sound conclusion and discussion

    –          State very clearly what’s novel

    –          Art of storytelling (consistency)

    –          Determination


    End of talk!

    What really make me feel extremely stupid is that most tips I get them all the time from mentors, friends.. Stupid! Why not implementing them!? Plus, speakers linked this to their own research experience, which added intimacy –sometimes despair (!)– to their talk.

  • Mariam Rizkallah 7:45 pm on February 24, 2012 Permalink | Reply  

    Is gi_taxid mapper by NCBI really enough!? 

    At some point of my career, I have to deal with NCBI raw data. Fortunately, I started this as soon as possible. What I really wanted to do, is to filter the gi’s of nr that have taxid in the gi_taxid_protein mapper created by NCBI.

    I tried this:

    awk -F”\t” ‘BEGIN {while ( i = getline < “gi.list”) ar[$i] = $1;} {if ($1 in ar) print $0;}’ gi_taxid_prot.dmp > gi_taxid_prot.filtered

    However, the numbers really concern me:

    16828865 nr entries, 47308513 gi_taxid_prot pairs, 16807310 gi_taxid_prot.filtered pairs. How come that nr has 21,555 entries with no gi_taxid mapping?

    I am not sure about that and I don’t know whether I can figure it out or not.

    • Mariam Rizkallah 8:52 pm on February 24, 2012 Permalink | Reply

      I filtered them the other way around, I extracted the gi entries with no matches, and tried to find out about couple of them from NCBI. It turned out that those sequences has been updated and got themselves new gis that happen to be in the gi_taxid mapper. Will now I can see that the mapper is updated while nr is not? I don’t know.
      awk -F”\t” ‘BEGIN {while ( i = getline “filtered”;} else {print $0 > “no_match”;}}’ gi &

  • Mariam Rizkallah 8:20 pm on February 17, 2012 Permalink | Reply
    Tags: blast, math101   

    Practicing exponents (http://www.aaamath.com/dec71ix2.htm).. I am into BLAST and have to know what’s the difference between Expectation value 10 and 1E-10? I got an e-val of 1e-148, is it good or bad?

    Now, I recalled what we had back in college. 0.00000304 = 3.04*10-6 (6 places).

    I will translate E =10 in my head for now 😉 That makes 3.04E-6 an alternative to the previous solutions.

    That makes me sort BLAST results like this: 1e-148 is better or worse than 8e-119? 1*10-148 or 8*10-119? I guess it’s about the exponent rather than the number itself (the base..? Can’t remember?)

    And for BLAST e-val cutoffs:10 = 1*10^1 = 1e+1
    however 1e-5 = 1*10^-5 = 0.00001

    Never to forget that the smaller is the most significant

  • Mariam Rizkallah 12:13 am on January 19, 2012 Permalink | Reply  


    Hello, BioPython World!
    The fact that BioPython 1.58 now supports abi format encouraged me to work on abi2fasta.py. I made 3 trials, one with SeqIO.convert(), but the problem is that it creates separate files for each abi, and I needed them in one fasta. Another way was to append on a single fasta. A third way is to parse the abi, get the info, then convert, but I was not sure whether it will return  a SeqRecord object or not. I tried SeqIO.to_dict() which is cool, but leaves you to deal with dicts. enumerate(SeqIO.parse()) is cool, however, I liked the to_dict() more! Yes! Then, I found that SeqIO.parse() already returns a SeqRecord object 🙂
    import sys
    import traceback #Dv
    import os
    from Bio import SeqIO
    from Bio.Seq import Seq
    from Bio.Alphabet import IUPAC
    from Bio.SeqRecord import SeqRecord
    __date__ =”$Jan 18, 2012 4:29:17 PM$”
        print “Welcome to abi2fasta.py”
        count = 0
        path = sys.argv[1]
        output_dir = sys.argv[2]
        output_file = sys.argv[3]
        handle_fasta = open(output_file+”.fasta”, “a”)
        handle_fastq = open(output_file+”.fastq”, “a”)
        handle_qual = open(output_file+”.qual”, “a”)
        ls = os.listdir(path)
        for file in ls:
            ext = os.path.splitext(file)[1]
            rt = os.path.splitext(file)[0]
            if ext == “.ab1”:
                for seq_record in SeqIO.parse(path+file, “abi”):
                    seq_record.id = path.strip(“/”)+’_’+seq_record.id+’_’+seq_record.name
                    #print seq_record.id, seq_record.name, seq_record.seq
                    #record = SeqRecord(seq=seq_record.seq, id=path.strip(“/”)+’_’+seq_record.id+’_’+seq_record.name, description=””)
                #Generator returns a dict {‘MySeq’: SeqRecord(seq=Seq(‘AATTGGCC’, IUPACUnambiguousDNA()), id=’MySeq’, name=’abi_name’, description=”, dbxrefs=[])}
                #record_dict = SeqIO.to_dict(SeqIO.parse(path+file, “abi”))
                #seq_record = str(record_dict.values()).strip(‘[]’)
                #for index, record in enumerate(SeqIO.parse(path+file, “abi”)):
                    #print “ID = %s, length %i, with %i features” % (record.id, len(record.seq), len(record.features))
                #append to one file
                #SeqIO.convert(path+file, “abi”, handle_fasta, “fasta”)
                #SeqIO.convert(path+file, “abi”, handle_fastq, “fastq”)
                #SeqIO.convert(path+file, “abi”, handle_qual, “qual”)
                #separate files
                #SeqIO.convert(path+file, “abi”, output_dir+rt+”.fastq”, “fastq”)
                #SeqIO.convert(path+file, “abi”, output_dir+rt+”.fasta”, “fasta”)
                count += 1
        print “Converted %i records” % count
    except traceback, t:
        print “[Usage]: ./abi2fasta abi_dir output_dir output_file”
        print print_tb(t)


  • Mariam Rizkallah 7:27 pm on January 3, 2012 Permalink | Reply  

    My second one-liner to change sequences identifiers 

    Well.. I should use biopython for that.

    First one-liner: sed ‘s/>/>region_/g’ region/assembly/454AllContigs.fna > 454AllContigs2.fna
    Second one-liner: awk ‘BEGIN {FS=”\|”} {if ($0 ~ “\>”) {print “>Ecoli_gi”$2;} else {print $0;}}’ Ecoli.fasta > Ecoli2.fasta

  • Mariam Rizkallah 5:44 pm on November 19, 2011 Permalink | Reply
    Tags: bioinformatics, GC content,   

    Welcome back, Perl (GC content) 

    I coded in Perl for 1-2 weeks in my life 7 months ago, then shifted to Python –and PHP for sometime– for the previous 7 months. Now, I am back to Perl –somehow! Started by this GC content calculator.

    #! /usr/bin/perl -w

    use strict;

    #Hadling args
    #print “My name is $0 \n”; #Script name
    #print “First arg is: $ARGV[0] \n”;
    #$num = $#ARGV + 1; print “How many args? $num \n”;
    #print “The full argument string was: @ARGV \n”;

    #Calc GC contenct

    sub calcgc {
    my $seq = $_[0];
    my $count = 0;
    $count++ while ($seq =~ m/[GC]/gi); #contains, case-insens-
    my $num = $count / length($seq);
    #my ($dec) = $num =~ /(\S{0,6})/;
    my $dec = sprintf(“%.3f”, $num);
    return $dec;

    #my $gc = calcgc($seq);

    print(“Welcome back, Perl\n”);

    my $a = $ARGV[0];
    open(IN, $a) || die “cannot open $a for reading: $!”;
    while (<IN>) { # read a line from file $a into $_
    #print $_; # print that line to STDOUT
    if (/^>/) {
    print $_;
    next if /^>/; #http://stackoverflow.com/questions/4119225/perl-regular-expressions-delete-line-if-it-starts-with
    #print $_;
    my $gc = calcgc($_);

  • Mariam Rizkallah 8:20 pm on August 26, 2011 Permalink | Reply
    Tags: NoteToSelf   

    It’s not about using the “hottest” “latest” update… it’s about “consistency” of software used by all parties.

  • Mariam Rizkallah 4:27 am on August 24, 2011 Permalink | Reply
    Tags: ctd, , shell   

    Perl vs shell scripting contest! 

    I have to write.. so I wouldn’t feel alone..

    Currently, we’re reading about the huge database “Comparative toxicogenomics database” (CTD). As an open source database, all tables are –kindly–available for download. We started to parse the comma-delimited tables whose fields are further delimited by “pipe”, where fields’ data may contain “,” also.

    As an expert in his field, our PI started to write a one-line program in perl to parse it, while an eager student like myself started to work using shell scripting. No need to say that our PI own the contest. But I want to share with you my trials.

    Dedicated to CTD!

    #first trial
    cut -f9 -d"," file.csv | sort | grep "\^" | cut -d"|" -f1,2 | uniq > file.txt #then I realized that there's >2 fields in delimited by "|", and "cut" is not really helping
    #successful trial
    awk -F"," '{for (i=1; i<NF; i++) {if ($i ~ /\^/) print $i;} }' file.csv | sort | uniq | awk -F"|" '{for (i=1; i<=NF; i++) print $i;}' | sort | uniq | sed -n 's/\^/ /gp' > file.txt
    #Update Sep 2, 2011
    #Parse the generated file:
    awk -F" " '{print $2;}' file.txt | sort | uniq | wc -l
    #Update Sep 3, 2011
    awk -F"," '{for (i=1; i<NF; i++) {if ($i ~ /\^/) print $i;} }' file.csv | sort | uniq | awk -F"|" '{for (i=1; i<=NF; i++) print $i;}' | sort | uniq > file.txt
    awk -F"^" '{print $2;}' file.txt | sort | uniq > file2.txt #Before replacement of "^", to preserve space-separated data
    awk '{print "("$i",",$i"),";}' file2.txt > file2_python_dict.dict #Wrong.. missing "'" #"," = space...
    awk '{print "("$i",",$i"),";}' file2.txt | sed -n "s/(/(\'/gp" | sed -n "s/)/\')/gp" | sed -n "s/,/\',/gp" | sed -n "s/)',/),/gp" | sed -n "s/ / '/gp" > python_dict.dict
    #There's a logical error -> '' before any space-containing line.
Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc