[++] Re: self-modifying code [(Re: TOP 10 (names in comp.sci)]

Terje Mathisen Terje.Mathisen at hda.hydro.com
Sun Aug 9 10:33:58 EST 1998


Andrea Chen wrote:
> 
> Terje Mathisen wrote:
> > Sorry, no again: The really fast TPC test results all come from very
> > well-balanced systems, speeding up just one part of it, like the
> > compilation of SQL statements, will not result in significant
> > improvements.
> 
>         People buy multiprocessors solely to run databases.  They don't do
> graphics, they don't do much numerical calculation (and if so decimal is
> more practical since it doesn't require 2 conversions and can be done
> relatively quickly with tables.)  My belief is that computers will (in
> the long run) evolve to increasingly resemble the human brain with many
> specialized parts each assigned to various tasks.

Sorry, I did not mean balanced as in 'can do all kinds of computing
equally well', rather the opposite:

A high-end TCP benchmark machine depends upon disk speed, nr of disk
spindles, io channel bandwidth, ram size & speed, cpu speed, cpu cache
size(s) and speed(s) etc, etc.

All these parameters will be more or less equally saturated when running
the benchmark: If there was just one parameter which was the bottleneck,
i.e. too slow disks/disk controllers, then you can rest assured that the
vendor would detect this, and try to upgrade this parameter.

This makes sense because if would definitely reduce the cost per
operation per second.

> > The Word Count utility I wrote in asm about 8 years ago (486-optimized)
> > could handle a user-specified set of word and line separator characters,
> > and it still ran faster than the disk transfer speed from a RAM-disk,
> > i.e. more than 10MB/s on a 486-33.
> 
>         This is interesting.  I'm not a hardware expert, but I thought 8 years
> ago the memory cycle was something like 70 nanoseconds.  Assuming (and I
> believe it an incorrect assumption especially with RAM disk) that you
> could get a 4 byte word up every cycle this gives me something in the
> range 6  megabytes per second.

A 486-33 did run it's memory at 33MHz, so at 2 cycles/dword (cache line
burst transfer) you would get a peak rate of 66MB/s.

Using more normal ram chips, with 3 cycles/transfer reduces this to 44
MB/s.

Since the 486 used a write-through cache, and didn't allocate on write,
a REP MOVSD block move would do exactly two bus transactions per dword
copied.

This would give a theoretical max speed of 22 MB/s, what I measured was
10 MB/s, as I stated previously.

>         I haven't studied 486 architecture but unless it has something like I
> described the operations to examine a dozen or so delimiters would take
> something like a dozen CPU cycles.  Not to mention the costs of noting
> the results and putting them someplace useful (eg. back in memory.)

It seems like you haven't written too much hard-core optimized asm code:
I did state that I used lookup tables, and all intermediate results were
kept in register of course, so the algorithm ran like this:

  unroll 128 times:
    load the next two chars.
    lookup two 2-bit values (in a 64K lookup table) classifying
     each character in the pair (inside a word, separator, CR, LF)
    use the combination of the previous classification pair and
     the current to lookup a pair of increments in an 8K table
    increment both the line and the word count simultaneously
  end unroll

(Actually, the code was more or less the reverse of the snippet above,
by working on 3-4 pairs at the same time, I got rid of all Address
Generation Interlocks, which would otherwise be caused by all the
chained table lookups.)

Terje

PS. On a 386, the best (optimal?) code I could come up with used
self-modifying code all over the place:

I generated on the fly 512 snippets of code, each part was aligned on a
128-byte boundary, filling a 64K segment exactly (remember, 16-bit Dos
code).

This way the top 8 bits of the address corresponded to the current byte
from the input stream, and the bit below that signified the current
inside/outside a word status.

On 486 and later cpus this version looses badly, due to code cache
misses: Even though the amount of code in each part was less than 20
bytes, since they were all 128-byte aligned, they could only use a small
subset of the 8K code cache.

-- 
- <Terje.Mathisen at hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



More information about the Neur-sci mailing list