Is gi_taxid mapper by NCBI really enough!?

At some point of my career, I have to deal with NCBI raw data. Fortunately, I started this as soon as possible. What I really wanted to do, is to filter the gi’s of nr that have taxid in the gi_taxid_protein mapper created by NCBI.

I tried this:

awk -F”\t” ‘BEGIN {while ( i = getline < “gi.list”) ar[$i] = $1;} {if ($1 in ar) print $0;}’ gi_taxid_prot.dmp > gi_taxid_prot.filtered

However, the numbers really concern me:

16828865 nr entries, 47308513 gi_taxid_prot pairs, 16807310 gi_taxid_prot.filtered pairs. How come that nr has 21,555 entries with no gi_taxid mapping?

I am not sure about that and I don’t know whether I can figure it out or not.

Advertisements