In article <598pas$1s2$1 at mhafc.production.compuserve.com>, Bob Obar <102063.2640 at CompuServe.COM> writes:
> Several of the EST sequences I've been analyzing contain Alu
>sequences (specifically, the warning that shows up in the Definition
>field is something like "similar to contains Alu repetitive
>element;contains element L1 repetitive element."
>> Can anyone explain:
> A) Why these sequences should be present AT AL in ESTs, which are
>supposed to represent cDNAs;
[...]
> Is there any meaning in the
>presence of these sequences in some cDNAs but not others?
Repeats such as Alu have been found inserted not only in intergenic
sequences, but also within genes: within introns, 5'UTRs, 3'UTRs
and even - albeit less frequently - in coding regions.
A simple (simplistic) model of selection can explain the distribution
of repeated elements in a genome: repeated elements may insert themselves
anywhere, but insertions that disrupt an essential function are
eliminated by selection. In other words, any insertion that does not
disrupt an essential function can be tolerated and fixed in the
population. This model can explain why repeated
elements are more common in intergenic regions or introns than
in UTRs and even more than in coding regions. Hence if you find
an Alu repeat within a gene, then it probably just means that
it does not affect the function of this gene (although there are also a
few cases where repeated elements have been shown to be involved
in the regulation of a gene... evolution is opportunistic:
if an insertion appears be useful then it may be positively selected).
Eventually, it is not surprising to find Alu repeats within
some mRNAs.
Moreover, EST sequences do not all correspond to functional mRNAs.
Any polyadenylated transcript can be found among ESTs (e.g. it is
likely that some pseudogenes are still transcribed). So it is
not surprising to find "junk transcripts" (by analogy to junk DNA)
among EST sequences.
> B) What to do about them when aligning the ESTs and e.g. trying to
>make contigs from them?
For similarity searches (BLAST, FASTA, ...), you can use the XBLAST
program to mask Alu (or other) repeats within sequences and thus
avoid the spurious matches with the thousands of Alu-containing
sequences. XBLAST is available by anonymous FTP at ncbi.nlm.nih.gov in
/pub/jmc/xblast. For a discussion of this problem see Claverie & States
1993 Comput. Chem. 17:191-201 or Altschul et al. 1994 Nature
genet. 6:119-129
This solution could be suitable for contiging if the repeat is
not too long. Otherwise I don't know if there is any
simple solution.
Hope this helps,
Laurent Duret
__________________________________________________________________________
Laurent Duret
Laboratoire BGBP - UMR CNRS 5558 Phone : +33 472 44 80 00 p.34 39
Universite Claude Bernard - Lyon 1 FAX : +33 478 89 27 19
43 Bd du 11 Novembre 1918 e-mail : duret at biomserv.univ-lyon1.fr
F-69622 Villeurbanne Cedex ========================
France
__________________________________________________________________________