Performance tips for DEC ALPHA machines

Peter Rensis David pdavid at leland.Stanford.EDU
Wed Jun 23 14:56:14 EST 1993


Since there has been some discussion of how to get the most out of 
the DEC ALPHA machines, I am posting a set of tips from DEC on 
how to get the most from your ALPHA.  I haven't tried all of them, 
so I can't comment on their utility.  Good luck!

Peter David


------------------------------------------------------------------------------



	Here are some tips for getting the most out of Alpha 
	----------------------------------------------------

		Compiled by Andy Thomas	and edited by Don Dossa


General Tips (Apply to BOTH OpenVMS and OSF/1)
- ----------------------------------------------

o Get a pass 3 chip !
	
	Alpha OSF/1 and OpenVMS have their run time libraries written for
	pass 3 Alpha chips. As a result, the RTL does not perform as well on 
	the pass 2 chip as the pass 3 chip.

	Check the version of the alpha chip from the console:
	>>> Show conf 

	CPU   ................. DECchip 21064 P2.1
						^
						|---- This is pass 2.1

	If you have a pass 2 chip, call field service and ask for 
	FCO KN15-AA-O001 (part no. EQ-01659-01).

o Latest compilers 

	Always check with Digital to insure you have the latest releases of 
	the compilers and operating system.


o Alignment

	It pays to avoid using a Fortran INTEGER*2 and C short data types on 
	Alpha. As the smallest unit of access in Alpha assembly	is 32 bits, 
	accessing a 16bit (or 8bit) data type causes Alpha to issue LDx_U 
	instructions and some bitmasking instructions to get at the data, 
	rather than a single LDL instruction for the 32 bit datatype.

o Data structure alignment

	Naturally align your data within structues (common blocks) to at least
	64 bit boundaries. Also try to ensure that alignment within the common
	block is proper. DEC Fortran has an option to align fields within a 
	COMMON block, and will give you a warning at compile time if fields 
	are not aligned.
 
o Multiple source files

	When using multiple source files consider using:

	$ fortran main.for + sub2.for + sub3.for    (VMS)
	# f77 -o main.exe main.f  sub2.f  sub3.f    (OSF/1)

	This allows the compiler to get at a bigger chunk of code to optimize
        and reduces the amount of linkage generated which improves image
    	activation since less disk I/O and page faulting are required as the 
	image is fixed up.

o Integer division

	Integer division is achieved by converting the integer to a float and 
	dividing and converting back to an integer. If possible, recode the 
	integer to a floating point data type to avoid the conversion or 
	just avoid division. 

	This isn't as bad as it sounds since the compiler will perform tricks
	for certain compile-time known integers that don't require more bits 
	of precision than the floating point format can provide.  In the 
	general case, the code calls a libots routine written in MACRO-64 for 
	performance.

 
o Integer to Float (& vice versa) casting.
 
	It is a good idea to avoid integer <-> float conversions. The current 
	Alpha CPU chip has no direct connection between its floating point 
	unit and its integer unit, which mean such conversions require a 
	load/store operation and are thus less efficient.  

o KAP Pre-Processors

	Use the KAP pre-processors for Alpha; they are available for Fortran 
	and C on VMS and OSF/1 from your Digital sales representative. These
	are the most cost-effective tools to improve your performance.

	It's worth trying the KAP pre-processor	without any qualifiers, which 
	will optimize away any seriously bad programming and acheive dramatic 
	results.

	Please refer to the KAP manuals for more details. To use KAP 
	effectively on ALPHA with the GEM compilers:

	1) Make sure you are running on a pass 3 chip in your system; the KAP
	   2-d unrolling will only "kick in" on a pass 3 chip.

	2) FKAP by default will not do 1-d unrolling since GEM already does it.
	   If results for GEM unrolling are not sufficient try:

		a) FKAP/ur=n/ur2=m (where n=4/8/16, and m=200/300/400)
		   Use FKAP/ur=12/ur2=320 to utilize all 32 registers in 
		   its 2-D unrolling.
		b) FKAP/lc=blas will attempt to make DDOT calls to DXML.
		   You must install DXML V2.0.
		c) FKAP/inline/inll=4/ind=2 will inline small routines.
		d) FKAP/ag=a is used to stagger arrays near a power of 2 to
	  	   avoid cache collisions.
	
	All points are important and try them cumulatively; leave the switches
	on for cases that KAP helped.

	Consider inlining certain subroutines in deeply nested loops which 
	may be costing a lot in performance.

	With C code use the C pre-processor first.
	
	VMS
	
	$ CC/preprocess_only=xxx.pre xxx.c
	$ kapc/<your_switches> xxx.pre
	$ CC/<your_switches> xxx.cmp
	
        OSF

	# cpp89 xxx.c xxx.pre     (see cpp or cpp89 for DEC C man page)


o Fast Math Libraries

	Consider using on VMS 	/math=fast
			  OSF/1 -math_library fast (DEC Fortran)

	These libraries give faster routines for many common mathematical
	functions like sqrt.


o DXML libraries

	Install the DXML libraries. The KAP pre-processors will pick up the 
	more efficient DXML routines automatically. Tremendous performance
	improvements have been seen using DXML.


o Avoid sharable images

 	Link files against ".OLB" object libraries rather than shareable
	images whenever possible.

	For a VMS fortran program do:

		$ link/nosysshr mycode 
	
	rather than just:

		$ link my_code	

    	If you do use shareables, consider linking some of them /SECTION and
    	installing them /RESIDENT.  This will not improve activation
    	performance but it should improve run time performance.  You may want 
	your RTLs to be installed /RESIDENT if they aren't already.  


	OSF/1

	# f77 -o myprog -O4 -non_shared myprog.f  (-O4 and -non_shared should
						  be used togeather anyway) 

    	The linker cannot do instruction replacement on calls to shareable
    	images.  There is more linkage to read on activation and more Icache 
	misses because the top of the Icache tends to be more heavily utilized
	than the bottom as the number of shareable images increases.


o Scrolling text

	When possible, re-direct sys$output to a file. Code that causes a lot
	of text output will slow you down because scrolling text in a terminal
	window uses compute cycles.

	A better thing to do is:

              VMS    				OSF/1

	$ define sys$output results.lis     # myprog > results.lis
	$ run myprog                        # more results.lis 
	$ deassign sys$output
	$ type results.lis


o Multi Dimensional array access

    	Make sure multi-dimensional arrays are traversed in 'natural' order
      	(column major) for FORTRAN. Code written without this in mind can often
	trash the CPU's cache.


o Tracking down LDx_U & STx_U instructions

	Track these down by inserting assembly code into the listing file:

	$ FORTRAN/LIST/MACHINE/NOOPT file.for    (VMS)
	# f77 -V -machine_code                   (OSF/1)


	Are there any Unaligned instructions?

	$ search file.lis _U /stat               (VMS)
	# grep _u file.l   (note: lowercase 'u') (OSF/1)

	If this search returns anything then look at the listing file and 
	back track from the LDx_U instruction to the problematic source 
	code line and then figure out what data or data structure is causing
	the problem. This procedure is highly recommended, even if you think 
	the code is OK.


o Use a RAM disk to hold any files your code reads or writes.

	If you are primarily concerned with the CPU performance of the 
	system, then it makes sense to read data and write results to a 
	RAM disk created by DECram (VMS) or look at

	# man mfs   (OSF/1)

	Here is an example demonstrating how to generate a 6MByte /tmp area:

	> mfs -s 12288 /dev/rz8a /tmp
	> df /tmp
	Filesystem     512-blks        used       avail capacity  Mounted on
	mfs:27903         11710           2       10536     0%    /tmp


o Use a profiling tool

	You may want to use VMS Monitor for system resources and PCA for 
	profiling the code under VMS. These tools may give you more information
	on where to concentrate your tuning efforts.

	Under OSF/1 you can use vmstat for system resources (a limited version
	of monitor for OSF/1 is available on the Alpha Freeware disk), prof,
	and pixie for profiles and cord for feedback. You should be able to 
	drive a lot of these tools using DEC Fuse.


VMS specific tips
- -------------------


o FORTRAN compiler qualifiers

	Some recommended fortran command qualifiers are:

	$ fortran == "fortran/align=none/warn=(gen,align)/novms"+ -
        "/assume=(noaccuracy)/math_library=fast"

	This command sets up a global symbol for the fortran command that 
	includes some of the recommended qualifiers.

	/novms -- don't assume vax fortran behaviour.

	/assume=(noaccuracy). The compiler is free to reorder floating-point 
	operations based on algebraic identities (inverses, associativity, 
	and distribution).  This allows the compiler to move divide operations 
	outside of loops and thus improves performance.

	The default, ACCURACY_SENSITIVE, means the compiler uses only
	naive scalar rules for calculations.  This setting can prevent
	some optimizations.

	The /align=none and /warn=align are for detecting alignment problems.
	Consider changing these qualifiers to /align=all instead.

	DEC C use:

	/assume=noaccuracy_sensitive -
	/math_library=fast -
	/ansi_alias -            ! Specify that the code does not violate Ansi
				   C rules for aliasing.
	/prefix=all -            ! Forces compiler to prefix all functions
				   for the linker to properly load system 
				   functions.
	/reentrancy=none -       ! Do not perform re-entrancy checks on RTL.
	/plus_list_optimize -
	/unsigned_char -         ! Change default type from char to un char.
	/extern=strict_refdef -  ! Affects the way externs are allocated in
				   memory. Strict puts them in same PSECT.


o Multi-User performance & Quantum

	If you are running OpenVMS on a multi-user system and have compute
	bound processes hanging around (say in a batch queue), you may notice
	sluggish interactive responsiveness. In such cases it has been observed
	that lowering the system parameter QUANTUM, helps improve the 
	situation dramatically.
	

 o SYSGEN and AUTHORIZE quotas

	Ensure you have enough working set quota for the job you are trying 
	to run (AUTHORIZE  wsquota,wsextent). 

	Be sure



More information about the Xtal-log mailing list