unBLASTable sequence?

Alessandro Guffanti ag3 at sanger.ac.uk
Thu Aug 5 08:08:41 EST 1999

Hallo Kenny. You sequence is a classic example of "repetitive"
or "low entropy" region which is filtered away by specific programs
*before* the actual Blast database search takes place. This is to avoid too
many spurious matches in your output, for reasons which you may find
explained below for what pertains protein seqs.
There is always a way to turn this filtering option off and it is suggested
you try and search also with the unfiltered sequence.

Best Wishes,


Paraic Kenny wrote:

> Hi all,
> I was trying to BLAST a protein sequence today at the NCBI blast server
> and in the results, part of my query sequence came up as a row of
> XXXXXXXXs even though it is directly identical to a sequence in the
> databases.

> Interspersed local regions of very simple amino acid composition are
> surprisingly abundant in protein sequences. These regions include different
> types of residue clusters, some of which contain homopolymers, short period
> repeats or a periodic mosaic of a few residue types. More than half of the
> sequences in the database contain at least one such region, and 14% of the
> amino acids occur in clusters of highly biased composition, called
> "low-complexity regions" (for a review, see Wootton, J. & Federhen, S.
> Computers Chem.  17, 149-163 ).
> Low complexity segments confound database search algorithms in two ways.
> First, most of these segments do not generally give meaningful alignments
> in ways that reflect actual structure and mutational history: they
> evidently evolved relatively rapidly by processes such as replication
> slippage and repeat expansion. Second, the residue composition of
> low-complexity segments is very different from that of the database as a
> whole. This is evident if all low-complexity segments in the database are
> grouped into a single class: a strong excess of alanine, glycine, proline,
> serine, glutamate and glutamine results. These statistical biases contrast
> with those that charachterize the bulk of most query and database
> sequences, and on which score-based alignment statistics are founded. Thus
> the high scores of alignments of low-complexity segments are due primarily
> to their compositional biases and do not necessarily reflect significant
> position similarity.
> In the case of programs for database search by homology, the presence of
> low complexity regions or of repetitive segments in your query sequence
> causes an enormous output or the output size limits can cut off your output
> long before all the significative segments are displayed. You can avoid
> this by choosing one of the protein filters SEG (eliminates low-complexity
> regions from your query sequence) or XNU (eliminates statistically
> significant short tandem repeats). The ALU-Protein filter eliminates
> eventual Alu sequences present in your protein. The presence of Alu
> elements in coding regions has been a controversial issue for sometime, but
> it seems established that sequences containing such elements come from
> cloning artifacts or, more rarely, expressed pseudogenes (Claverie,J.-M. &
> Makalowsky, W. Nature  371, 752). So, if the filtering highlights the
> presence of an ALU repeat in the coding part of your sequence, you should
> be extremely cautious with your data.
       Alessandro Guffanti - Informatics
The Sanger Centre, Wellcome Trust Genome Campus
  Hinxton, Cambridge CB10 1SA, United Kingdom
    phone: +1223-834244 * fax: +1223-494919

More information about the Bio-soft mailing list