Alu sequences within dbEST entries

Laurent Duret duret at misa.univ-lyon1.fr
Thu Dec 19 03:59:35 EST 1996


In article <598pas$1s2$1 at mhafc.production.compuserve.com>, Bob Obar <102063.2640 at CompuServe.COM> writes:
>  Several of the EST sequences I've been analyzing contain Alu 
>sequences (specifically, the warning that shows up in the Definition
>field is something like "similar to contains Alu repetitive 
>element;contains element L1 repetitive element."
>
>  Can anyone explain: 
>  A) Why these sequences should be present AT AL in ESTs, which are 
>supposed to represent cDNAs;  

[...]
> Is there any meaning in the 
>presence of these sequences in some cDNAs but not others?


Repeats such as Alu have been found inserted not only in intergenic
sequences, but also within genes: within introns, 5'UTRs, 3'UTRs
and even - albeit less frequently - in coding regions. 
A simple (simplistic) model of selection can explain the distribution 
of repeated elements in a genome: repeated elements may insert themselves 
anywhere, but insertions that disrupt an essential function are
eliminated by selection. In other words, any insertion that does not
disrupt an essential function can be tolerated and fixed in the
population. This model can explain why repeated
elements are more common in intergenic regions or introns than
in UTRs and even more than in coding regions. Hence if you find
an Alu repeat within a gene, then it probably just means that
it does not affect the function of this gene (although there are also a
few cases where repeated elements have been shown to be involved
in the regulation of a gene... evolution is opportunistic:
if an insertion appears be useful then it may be positively selected).
Eventually, it is not surprising to find Alu repeats within
some mRNAs. 

Moreover, EST sequences do not all correspond to functional mRNAs.
Any polyadenylated transcript can be found among ESTs (e.g. it is
likely that some pseudogenes are still transcribed). So it is
not surprising to find "junk transcripts" (by analogy to junk DNA)
among EST sequences.




>  B) What to do about them when aligning the ESTs and e.g. trying to
>make contigs from them?  

For similarity searches (BLAST, FASTA, ...), you can use the XBLAST
program to mask Alu (or other) repeats within sequences and thus
avoid the spurious matches with the thousands of Alu-containing
sequences. XBLAST is available by anonymous FTP at ncbi.nlm.nih.gov in 
/pub/jmc/xblast. For a discussion of this problem see Claverie & States
1993  Comput. Chem. 17:191-201 or Altschul et al. 1994 Nature 
genet. 6:119-129

This solution could be suitable for contiging if the repeat is
not too long. Otherwise I don't know if there is any
simple solution.


Hope this helps,

Laurent Duret

__________________________________________________________________________
Laurent Duret                           
Laboratoire BGBP - UMR CNRS 5558     Phone  : +33 472 44 80 00  p.34 39
Universite Claude Bernard - Lyon 1   FAX    : +33 478 89 27 19
43 Bd du 11 Novembre 1918            e-mail : duret at biomserv.univ-lyon1.fr
F-69622 Villeurbanne Cedex           ========================
France
__________________________________________________________________________





More information about the Mol-evol mailing list