Perl vs shell scripting contest! 

I have to write.. so I wouldn’t feel alone..

Currently, we’re reading about the huge database “Comparative toxicogenomics database” (CTD). As an open source database, all tables are –kindly–available for download. We started to parse the comma-delimited tables whose fields are further delimited by “pipe”, where fields’ data may contain “,” also.

As an expert in his field, our PI started to write a one-line program in perl to parse it, while an eager student like myself started to work using shell scripting. No need to say that our PI own the contest. But I want to share with you my trials.

Dedicated to CTD!

#first trial
cut -f9 -d"," file.csv | sort | grep "\^" | cut -d"|" -f1,2 | uniq > file.txt #then I realized that there's >2 fields in delimited by "|", and "cut" is not really helping
#successful trial
awk -F"," '{for (i=1; i<NF; i++) {if ($i ~ /\^/) print $i;} }' file.csv | sort | uniq | awk -F"|" '{for (i=1; i<=NF; i++) print $i;}' | sort | uniq | sed -n 's/\^/ /gp' > file.txt

#Update Sep 2, 2011
#Parse the generated file:
awk -F" " '{print $2;}' file.txt | sort | uniq | wc -l

#Update Sep 3, 2011
awk -F"," '{for (i=1; i<NF; i++) {if ($i ~ /\^/) print $i;} }' file.csv | sort | uniq | awk -F"|" '{for (i=1; i<=NF; i++) print $i;}' | sort | uniq > file.txt
awk -F"^" '{print $2;}' file.txt | sort | uniq > file2.txt #Before replacement of "^", to preserve space-separated data
awk '{print "("$i",",$i"),";}' file2.txt > file2_python_dict.dict #Wrong.. missing "'" #"," = space...
awk '{print "("$i",",$i"),";}' file2.txt | sed -n "s/(/(\'/gp" | sed -n "s/)/\')/gp" | sed -n "s/,/\',/gp" | sed -n "s/)',/),/gp" | sed -n "s/ / '/gp" > python_dict.dict
#There's a logical error -> '' before any space-containing line.
Advertisements