unBLASTable sequence?- additional question

Francois Jeanmougin pingouin at crystal.u-strasbg.fr
Thu Aug 5 03:02:02 EST 1999

In article <852567C3.00695290.00 at 7crmta_md.ms.bd.com>,
	Bill_A_Nussbaumer at ms.bd.com writes:

> I hope I'm not hi-jacking this post, but I'm somewhat unfamiliar with the topic
> and have been following along.  Could someone explain to me what exactly defines
> a "low complexity" protein sequence.  Is it just the short length or the
> repetitious nature of the amino acids contained?

	According to : http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html#LCR
whose english is much more understandabel :

<<Q: What is low-complexity sequence?

Regions with low-complexity sequence have an unusual composition and 
this can create problems in sequence similarity searching
(Wootton & Federhen, 1996). Low-complexity sequence can often be 
recognized by visual inspection. For example, the protein sequence
PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence
used to remove low-complexity sequence because it can cause artifactual hits
(please see Q: After running a search why do I see a string
of "X"s (or "N"s) in my query sequence that I did not put there? 

In BLAST searches performed without a filter, often certain hits will be
reported with high scores only because of the presence of a
low-complexity region. Most often, this type of match cannot be thought
of as the result of homology shared by the sequences. Rather, it is
as if the low-complexity region is "sticky" and is pulling out many
sequences that are not truly related.  >>

Much more details in Methods Enzymol 1996;266:554-71 I think.

Filers used by blast are Seg/Xnu or Dust. You could probably find the
corresponding documentation around on NCBI site. Seg for proteins
and Dust for nucleotides. See the "Filter" section of :

Anyway, this one more euristic added to Blast, as pointed by Andrew
this could be sometimes unappropriate. Such sequence could be
statistically unusable or should be analyzed by other means.


