Performance tips for DEC ALPHA machines
Peter Rensis David
pdavid at leland.Stanford.EDU
Wed Jun 23 14:56:14 EST 1993
Since there has been some discussion of how to get the most out of
the DEC ALPHA machines, I am posting a set of tips from DEC on
how to get the most from your ALPHA. I haven't tried all of them,
so I can't comment on their utility. Good luck!
Peter David
------------------------------------------------------------------------------
Here are some tips for getting the most out of Alpha
----------------------------------------------------
Compiled by Andy Thomas and edited by Don Dossa
General Tips (Apply to BOTH OpenVMS and OSF/1)
- ----------------------------------------------
o Get a pass 3 chip !
Alpha OSF/1 and OpenVMS have their run time libraries written for
pass 3 Alpha chips. As a result, the RTL does not perform as well on
the pass 2 chip as the pass 3 chip.
Check the version of the alpha chip from the console:
>>> Show conf
CPU ................. DECchip 21064 P2.1
^
|---- This is pass 2.1
If you have a pass 2 chip, call field service and ask for
FCO KN15-AA-O001 (part no. EQ-01659-01).
o Latest compilers
Always check with Digital to insure you have the latest releases of
the compilers and operating system.
o Alignment
It pays to avoid using a Fortran INTEGER*2 and C short data types on
Alpha. As the smallest unit of access in Alpha assembly is 32 bits,
accessing a 16bit (or 8bit) data type causes Alpha to issue LDx_U
instructions and some bitmasking instructions to get at the data,
rather than a single LDL instruction for the 32 bit datatype.
o Data structure alignment
Naturally align your data within structues (common blocks) to at least
64 bit boundaries. Also try to ensure that alignment within the common
block is proper. DEC Fortran has an option to align fields within a
COMMON block, and will give you a warning at compile time if fields
are not aligned.
o Multiple source files
When using multiple source files consider using:
$ fortran main.for + sub2.for + sub3.for (VMS)
# f77 -o main.exe main.f sub2.f sub3.f (OSF/1)
This allows the compiler to get at a bigger chunk of code to optimize
and reduces the amount of linkage generated which improves image
activation since less disk I/O and page faulting are required as the
image is fixed up.
o Integer division
Integer division is achieved by converting the integer to a float and
dividing and converting back to an integer. If possible, recode the
integer to a floating point data type to avoid the conversion or
just avoid division.
This isn't as bad as it sounds since the compiler will perform tricks
for certain compile-time known integers that don't require more bits
of precision than the floating point format can provide. In the
general case, the code calls a libots routine written in MACRO-64 for
performance.
o Integer to Float (& vice versa) casting.
It is a good idea to avoid integer <-> float conversions. The current
Alpha CPU chip has no direct connection between its floating point
unit and its integer unit, which mean such conversions require a
load/store operation and are thus less efficient.
o KAP Pre-Processors
Use the KAP pre-processors for Alpha; they are available for Fortran
and C on VMS and OSF/1 from your Digital sales representative. These
are the most cost-effective tools to improve your performance.
It's worth trying the KAP pre-processor without any qualifiers, which
will optimize away any seriously bad programming and acheive dramatic
results.
Please refer to the KAP manuals for more details. To use KAP
effectively on ALPHA with the GEM compilers:
1) Make sure you are running on a pass 3 chip in your system; the KAP
2-d unrolling will only "kick in" on a pass 3 chip.
2) FKAP by default will not do 1-d unrolling since GEM already does it.
If results for GEM unrolling are not sufficient try:
a) FKAP/ur=n/ur2=m (where n=4/8/16, and m=200/300/400)
Use FKAP/ur=12/ur2=320 to utilize all 32 registers in
its 2-D unrolling.
b) FKAP/lc=blas will attempt to make DDOT calls to DXML.
You must install DXML V2.0.
c) FKAP/inline/inll=4/ind=2 will inline small routines.
d) FKAP/ag=a is used to stagger arrays near a power of 2 to
avoid cache collisions.
All points are important and try them cumulatively; leave the switches
on for cases that KAP helped.
Consider inlining certain subroutines in deeply nested loops which
may be costing a lot in performance.
With C code use the C pre-processor first.
VMS
$ CC/preprocess_only=xxx.pre xxx.c
$ kapc/<your_switches> xxx.pre
$ CC/<your_switches> xxx.cmp
OSF
# cpp89 xxx.c xxx.pre (see cpp or cpp89 for DEC C man page)
o Fast Math Libraries
Consider using on VMS /math=fast
OSF/1 -math_library fast (DEC Fortran)
These libraries give faster routines for many common mathematical
functions like sqrt.
o DXML libraries
Install the DXML libraries. The KAP pre-processors will pick up the
more efficient DXML routines automatically. Tremendous performance
improvements have been seen using DXML.
o Avoid sharable images
Link files against ".OLB" object libraries rather than shareable
images whenever possible.
For a VMS fortran program do:
$ link/nosysshr mycode
rather than just:
$ link my_code
If you do use shareables, consider linking some of them /SECTION and
installing them /RESIDENT. This will not improve activation
performance but it should improve run time performance. You may want
your RTLs to be installed /RESIDENT if they aren't already.
OSF/1
# f77 -o myprog -O4 -non_shared myprog.f (-O4 and -non_shared should
be used togeather anyway)
The linker cannot do instruction replacement on calls to shareable
images. There is more linkage to read on activation and more Icache
misses because the top of the Icache tends to be more heavily utilized
than the bottom as the number of shareable images increases.
o Scrolling text
When possible, re-direct sys$output to a file. Code that causes a lot
of text output will slow you down because scrolling text in a terminal
window uses compute cycles.
A better thing to do is:
VMS OSF/1
$ define sys$output results.lis # myprog > results.lis
$ run myprog # more results.lis
$ deassign sys$output
$ type results.lis
o Multi Dimensional array access
Make sure multi-dimensional arrays are traversed in 'natural' order
(column major) for FORTRAN. Code written without this in mind can often
trash the CPU's cache.
o Tracking down LDx_U & STx_U instructions
Track these down by inserting assembly code into the listing file:
$ FORTRAN/LIST/MACHINE/NOOPT file.for (VMS)
# f77 -V -machine_code (OSF/1)
Are there any Unaligned instructions?
$ search file.lis _U /stat (VMS)
# grep _u file.l (note: lowercase 'u') (OSF/1)
If this search returns anything then look at the listing file and
back track from the LDx_U instruction to the problematic source
code line and then figure out what data or data structure is causing
the problem. This procedure is highly recommended, even if you think
the code is OK.
o Use a RAM disk to hold any files your code reads or writes.
If you are primarily concerned with the CPU performance of the
system, then it makes sense to read data and write results to a
RAM disk created by DECram (VMS) or look at
# man mfs (OSF/1)
Here is an example demonstrating how to generate a 6MByte /tmp area:
> mfs -s 12288 /dev/rz8a /tmp
> df /tmp
Filesystem 512-blks used avail capacity Mounted on
mfs:27903 11710 2 10536 0% /tmp
o Use a profiling tool
You may want to use VMS Monitor for system resources and PCA for
profiling the code under VMS. These tools may give you more information
on where to concentrate your tuning efforts.
Under OSF/1 you can use vmstat for system resources (a limited version
of monitor for OSF/1 is available on the Alpha Freeware disk), prof,
and pixie for profiles and cord for feedback. You should be able to
drive a lot of these tools using DEC Fuse.
VMS specific tips
- -------------------
o FORTRAN compiler qualifiers
Some recommended fortran command qualifiers are:
$ fortran == "fortran/align=none/warn=(gen,align)/novms"+ -
"/assume=(noaccuracy)/math_library=fast"
This command sets up a global symbol for the fortran command that
includes some of the recommended qualifiers.
/novms -- don't assume vax fortran behaviour.
/assume=(noaccuracy). The compiler is free to reorder floating-point
operations based on algebraic identities (inverses, associativity,
and distribution). This allows the compiler to move divide operations
outside of loops and thus improves performance.
The default, ACCURACY_SENSITIVE, means the compiler uses only
naive scalar rules for calculations. This setting can prevent
some optimizations.
The /align=none and /warn=align are for detecting alignment problems.
Consider changing these qualifiers to /align=all instead.
DEC C use:
/assume=noaccuracy_sensitive -
/math_library=fast -
/ansi_alias - ! Specify that the code does not violate Ansi
C rules for aliasing.
/prefix=all - ! Forces compiler to prefix all functions
for the linker to properly load system
functions.
/reentrancy=none - ! Do not perform re-entrancy checks on RTL.
/plus_list_optimize -
/unsigned_char - ! Change default type from char to un char.
/extern=strict_refdef - ! Affects the way externs are allocated in
memory. Strict puts them in same PSECT.
o Multi-User performance & Quantum
If you are running OpenVMS on a multi-user system and have compute
bound processes hanging around (say in a batch queue), you may notice
sluggish interactive responsiveness. In such cases it has been observed
that lowering the system parameter QUANTUM, helps improve the
situation dramatically.
o SYSGEN and AUTHORIZE quotas
Ensure you have enough working set quota for the job you are trying
to run (AUTHORIZE wsquota,wsextent).
Be sure
More information about the Xtal-log
mailing list