Random nucleotide sequence

Martin CHAN chi-wai genix at cuhk.edu.hk
Mon Sep 23 09:54:21 EST 2002


Simon Andrews wrote:
> Steve Hobbs wrote:
> 
>>Hello All,
>>
>>Does anyone know of a programme that will generate random nucleotide
>>sequences of defined length and %GC content within set limits?
> 
> 
> OK, I'll bite.  I'll probably find a use for this sometime in the future
> anyhow...
> 
> Just cut between the lines and put onto a system with Perl installed -
> watch out for any long lines which may be wrapped by either my or your
> newsreader.  
> 
> The command line is
> 
> program [uppergc] [lowergc] [upperlen] [lowerlen] [number of seqs]
> 
> and it will generate and appropriate number of sequences in multi-Fasta
> format.  Modifications to limits should be pretty straightforward.
> 
> Hope this helps
> 
> Simon.
> 
> ###### CUT HERE #######################
> #!/usr/bin/perl -w
> use strict;
> 
> my ($uppergc,$lowergc,$upperlen,$lowerlen,$seqno) = @ARGV;
> 
> # Make sure we've got all the command line options
> unless ($uppergc and $lowergc and $upperlen and $lowerlen and $seqno) {
>   die "Commandline is randomseq [uppergc] [lowergc] [upperlen]
> [lowerlen] [number of seqs]\n";
> }
> 
> # Check that all variables contain what
> # they should
> 
> unless (($uppergc =~ /^\d+/) and ($uppergc >=2) and ($uppergc <=100)){
>   die "UpperGC must be an integer between 2 and 100\n";
> }
> 
> unless (($lowergc =~ /^\d+/) and ($lowergc >=1) and ($lowergc <=99)){
>   die "LowerGC must be an integer between 1 and 99\n";
> }
> 
> unless ($uppergc > $lowergc) {
>   die "UpperGC must be greater than LowerGC\n";
> }
> 
> unless (($upperlen =~ /^\d+/) and ($upperlen >=20) and ($upperlen
> <=100000)){
>   die "UpperLen must be an integer between 20 and 100000\n";
> }
> 
> unless (($lowerlen =~ /^\d+/) and ($lowerlen >=10) and ($lowerlen
> <=99990)){
>   die "LowerLen must be an integer between 10 and 99990\n";
> }
> 
> unless ($upperlen > $lowerlen) {
>   die "UpperLen must be greater than LowerLen";
> }
> 
> unless (($seqno =~ /^\d+$/) and ($seqno >=1) and ($seqno <=100)) {
>   die "Number of seqs must be an integer between 1 and 100";
> }
> 
> # Set a loop going to produce the right
> # number of sequences
> 
> for (my $current_seq=1;$current_seq <=$seqno;++$current_seq){
> 
>   # Pick an actual GC
>   my $gc = int(rand($uppergc-$lowergc))+$lowergc;
> 
>   # and an actual length
>   my $length = int(rand($upperlen-$lowerlen))+$lowerlen;
> 
>   # So doing the maths
>   my $gc_bases = int(($length/100)*$gc);
>   my $at_bases = $length-$gc_bases;
> 
>   # Make and array with the correct number
>   # of bases and GC content (1=GC 0=AT)
>   my @sequence=((1)x$gc_bases,(0)x$at_bases);
> 
>   # Then shuffle it
>   shuffle(\@sequence);
> 
>   # Now put out a fasta sequence header
>   print ">sequence_$current_seq\n";
> 
>   # Keep a count so we can put line breaks
>   # in every 50bp
>   my $count = 0;
> 
> 
>   # Go through the sequence array randomly
>   # substituting G/C or A/T as appropriate
>   foreach (@sequence) {
>     if ($count == 50) {
>       print "\n";
>       $count=0;
>     }
> 
>     if ($_) {
>       if (rand >= 0.5) {print "G";}
>       else {print "C";}
>     }
> 
>     else {
>       if (rand >= 0.5) {print "A";}
>       else {print "T";}
>     }
> 
>     ++$count
>   }
> 
>   # Always put a break at the end
>   # of a sequence.
>   print "\n";
> 
> 
> }
> 
> # Just a standard Fisher-Yates shuffle
> sub shuffle {
>   my $array = shift;
>   my $i;
>   for ($i = @$array; --$i; ) {
>     my $j = int rand ($i+1);
>     next if $i == $j;
>     @$array[$i,$j] = @$array[$j,$i];
>   }
> }
> ############## CUT HERE #############################

The sub shuffle can be replaced by the module Algorithm::Numerical::Shuffle




More information about the Methods mailing list