self-modifying code [(Re: TOP 10 (names in comp.sci)]

Terje Mathisen Terje.Mathisen at
Sun Aug 9 13:49:57 EST 1998

Jeffrey S. Dutky wrote:
> Andrea Chen wrote:
> > > The Word Count utility I wrote in asm about 8 years ago (486-
> > > optimized) could handle a user-specified set of word and line
> > > separator characters, and it still ran faster than the disk
> > > transfer speed from a RAM-disk, i.e. more than 10MB/s
> >
> > This is interesting.  I'm not a hardware expert, but I thought
> > 8 years ago the memory cycle was something like 70 nanoseconds.
> > Assuming (and I believe it an incorrect assumption especially
> > with RAM disk) that you could get a 4 byte word up every cycle
> > this gives me something in the range 6  megabytes per second.
> Check your math: At 70ns per access you can get 14,257,714
> accesses per second, at 4 bytes per access yields about 57MB/s.
> Actually, the 70ns figure refers to the full cycle time to get
> a random word from memory. In systems with caches (which the
> 486-33 is probably NOT) you would be making page mode accesses

The 486 has an on-chip 8KB (unified C/D) cache, so I did get burst

> where you would get 4 words from memory in something like 130ns
> (70ns for the first word, 20ns for each of the remain words) so
> your peak bandwidth would be something like 123MB/s, with the
> average falling somewhere in between.
> > I haven't studied 486 architecture but unless it has something
> > like I described the operations to examine a dozen or so
> > delimiters would take something like a dozen CPU cycles.  Not
> > to mention the costs of noting the results and putting them
> > someplace useful (eg. back in memory.)
> >
> The 486 had a pretty low IPC rating, so you are probably off on
> that cycles figure by a factor of 5 or so. Of course this probably

No, that's the 386: The 486 could almost be treated like a 'classic
RISC' chip, because the core instruction set, including alu/load/store
all ran in a single cycle, and the load-use penalty was just a single
cycle (instruction).

It was actually quite simple to get close to 1.0 IPC on the 486.

> helps your point. The actual numbers should look something like
> 2 instructions for the delimiter lookup (we just use a 256 entry
> lookup table filled with boolean values to determine if a char is

Right, except text has enough correlation between successive characters,
that the 8K cache actually got very good hit rates when accessing a 64K
table. I.e. I could lookup a pair of characters at the same time.

> a delimiter) 1 instr. for the conditional branch, 1 instr. for

No tests, no branch, so word/line counts were incremented (mostly by
zero of course!) on every iteration.

> the recording (only needed about 1/5 of the time in english text)
> 1 instr. for getting the next char. and 2 instr. for the loop.
> This gives us 5.2 instructions per iteration, which would have
> been something like 25 cycles on a 486. On a modern processor

Your approach would take more instructions:

  mov bl,[si]
  inc si
  mov bl,separatorSet[bx]
  test bl,bl
   jnz not_in_word

; Start of a new word or guard at the end of the buffer?
  cmp si,di		; BufferEnd
   jae done

  inc cx		; Inc word count
  mov bl,[si]
  inc si
  mov bl,separatorSet[bx]
  test bl,bl
   jz not_in_word
 ENDM			; This macro will take 5 cycles/iter when inside a word
  mov bl,[si]
  inc si
  mov bl,separatorSet[bx]
  test bl,bl
   jz not_in_word
   jmp in_word

The code above hasn't been tested, but it should run in about 6-7
cycles/char, assuming there's normally just a single space char between
each word.

Counting lines as well would make the code quite a bit more complicated,
esp. if you want to increment the line count on either CR, or LF, or the
CRLF combination, i.e. do not count CRLF as two lines, but LFCR would
count as two lines.

The easiest idea is to use just LF as line separator, and then set the
top bit in that table entry. This would allow me to process up to 255
characters in a block, with just a single cycle overhead/char.

> it would be more like 6 or 7 cycles. If you unroll the loop you
> can get the number of instructions down much closer to 3.

As I show above, this isn't really possible, but OTOH the 486 gets 1.0
IPC on my code so the speed would still be OK. (I.e. probably about
4MB/s on a 486-33).

The WC code I finally ended up with got the basic iteration down to 4
instructions per pair of input chars, i.e. just 2 ops/char.

One of the instructions suffered a load penalty though, due to the need
for a segmented address, so theoretical speed would be 2.5 cycles/char.
I believe Mike Abrash measured actual throughput (including cache misses
etc) at a little under 3 cycles/char, or about 11-12 MB/s.

> > >
> > > The only real bottleneck on problems like these is the need
> > > for a (small) lookup table to do text classification.
> > Which would be an argument for at least some programmer
> > controlled associative memory.
> Replacing a simple lookup table (which executes in O(1) for all
> data) with an expensive associative memory is pure lunacy. The
> associative memory couldn't provide the result any faster than
> the lookup table (1 memory access) and it would cost much more
> than a much larger amount of normal memory. For the classes of


The lookup table can even give you most of the effect of associative
memory, by setting up a hash table. This is something which works _very
nicely_ in perl. :-)

- <Terje.Mathisen at>
Using self-discipline, see
"almost all programming can be viewed as an exercise in caching"

More information about the Neur-sci mailing list