FOO[1,LMM] - www.SailDart.org

perm filename FOO[1,LMM] blob sn#635610 filedate 1982-01-18 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00002 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002				LISP EVALUATION AND TIMING
C00029 ENDMK
C⊗;
			LISP EVALUATION AND TIMING

			    Larry M. Masinter
		     Xerox Palto Alto Research Center

                           Richard P. Gabriel
     Stanford University and Lawrence Livermore National Laboratory

Introduction

This paper describes the issues involved in evaluating and timing Lisp
systems. We explore the various levels at which quantitative statements can
be made about the performance of a Lisp system, giving examples from
existing implementations wherever possible. Our thesis is that `raw'
benchmarks are nearly useless without understanding what it is that each
benchmark tests. That is, the famous benchmark, the Takeuchi function [1],
tests function call, some control structure, and simple arithmetic.
This benchmark has, on occasion, been used to rank Lisp systems on
the basis of performance. However, this function hardly exercises the cases
of interest to almost all Lisp programmers, though function calling
is one of the key functionalities of Lisp.

We address `performance' evaluation as a sequence of statements about
an implementation on a number of distinct, but related, levels. We show
how statements on each level can have an effect on the evaluation on
a given Lisp implementation. We also discuss those aspects of a Lisp
implementation that we feel are important for evaluation, though different
users will weight different aspects more heavily than others.

A. Hardware level

At the lowest level, of course, are things like the machine clock speed and
memory bandwidth. You can get these out of most product description sheets.
These figures are not very helpful in determining the performance of a Lisp on
a machine, especially in a microcoded implementation, since the number of
cycles is as important as the cycle time.

I only have two comments to make at this level. First, the Dolphin processor has
some anomolies in its performance profile. For example, error correction is done
on a 64 bit quantity, so that storing a 2 word quantity (as is done, for example, to
"push" a value on the stack) takes significantly longer than to store a 4 word
quantity. We are arranging things throughout the system to quad-word align
data as appropriate to take advantage of the higher memory bandwidth possible
when the quad-word alignment is guaranteed. Second, in the current
architecture, a significant proportion of machine cycles (perhaps as high as 30%)
are spent refreshing the display and sampling the keyboard. This percentage will
be substantially reduced by realigning some of the display-controller data
structure and a change in the microcode/lisp interface to the keyboard handler.

B. Lisp "instruction" level

At the next level, I think you can make measurements which ARE indicative
of Lisp performance, for a number of operations. All of these numbers are
rough, but at least they are the right order of magnitude:

1. Function call + return

This is very important in determining Lisp performance. I estimate that
call+return accounts for about 1/3 of time spent on the Dolphins currently. The
current time for call+return on a Dolphin is about 100 microseconds average,
with a minimum of around 60 usec. This contrasts with a range in Interlisp-10
on a 2060 of about 3 usec for an internal call in a block (PUSHJ, POPJ +) to
around 30 usec for the shortest non-block compiled call (builds a frame in about
60 instructions) to around 100 usec (swapped function call to swapped function).
I've been quoted a 17 usec figure for Franz on a 11/780. Interlisp-VAX currently
is at around 30-40. 

We are currently revising the format of the stack and the function call
algorithms. We expect that the time for call will go from 60-100 usec
(minimum-average) down to the order of 15-20 usec, i.e., a factor of 5.

2. Variable/constant reference

In this I include things like time to pass a variable down as an argument,
reference constants, etc. For example, (SETQ X Y ) counts as 2 variable
references. It is hard to give more than order of magnitude figures for these sorts
of things, but this time is most directly correlated to the "MIPS" of the
instruction set. On a 2060, that seems to be on the order of .5 usec for Lisps
which have a good register allocator and 1 usec for those that use the stack a lot.
Figures for an 11/780 seem to be on the order of twice that. In the current
implementation of Interlisp-D, this seems to be very roughly around 5 usec.

One motivation for revising the stack format has been to allow use of the
hardware stack register to cache the top of the stack (previously difficult because
of non-alignment of stack blocks on quadword boundaries.) We expect that the
average time for these sorts of operations will go down to order 2-3 usec, i.e.,
factor of 2.

3. Data structure manipulation

There are three parts to this in Lisp: accessing data, storing into it, and making
new. For list cells, that is CAR, RPLACA/D and CONS. Many of the Interlisp
programs we are concerned with (e.g., KL-ONE and things like the window
manager), use user-defined datatypes more heavily is list structures, and
datatype access fits into this as well. In Interlisp-10 on a 2060, times for these are
as follows: CAR compiles into a HRRZ, which I believe is on the order of .5
usec.  RPLACA is either .5 usec (for FRPLACA) or 40-50 usec (function call +
type test).  CONS comes in at around 10 usec (average 20 PDP-10 instructions). I
imagine MacLisp timings are the same for CAR and RPLACA, but faster for
CONS. I imagine Franz doesn't do any type tests and does a move indirect (1-2
usec?). I don't know about Interlisp-VAX.

Times for CAR and CDR on a Dolphin are around 5 usec. RPLACA/D and CONS
are currently written in Lisp using arithmetic and memory reference
instructions. Time for CONS is on the order of 800 usec. (Could Pratt really have
believed that a CONS took a second when his "benchmark" did many conses and
also took a second?) Stack caching will improve CAR and CDR by probably
ONLY 30%. We are putting CONS and RPLACA/D into microcode, and expect
these times to go down to order 20 usec, more than an order of magnitude better. 

4. Arithmetic

Arithmetic in Lisp is complicated by several factors: the number representation
and box/unbox algorithms, "small" number range, etc. For example, Interlisp-10
has a particularly slow number-unboxing algorithm (it often does a UUO call to
the unboxer, which is 5-10 instructions). MacLisp is noted for its handling of
arithmetic on the 10, mainly because of PDL-numbers and the better
small-number-scheme. I have no idea about Franz; I imagine that Interlisp-VAX
won't be to bad for integer arithmetic, given their use of 31-bit immediate
integers with the numeric test on the sign bit.

The Dolphins have a very different performance profile in arithmetic, partly
because we have been moving things from BCPL to Lisp to microcode and are in
the middle of the transition. Integer addition, subtraction and multiplication in
the range +/-2↑16 is in microcode and is currently in on the order of 5 usec,
with the stack-caching buying about 30%.

DIVISION (i.e., IQUOTIENT and IREMAINDER) is currently implemented in Lisp
using integer subtraction! These were definitely unimportant in any of the
benchmarks we had run, and so we had allocated them a fairly low priority in
microcode-implementation-order. Putting them in microcode will speed them up
by orders of magnitude.

Mixed-mode arithmetic (TIMES rather than ITIMES etc.) currently go through a
function call and some additional mechanism, since they do not compile directly
into Lisp instructions.  We are planning to allocate opcodes for those, since it is
inexpensive in microcode space to do, and some folks apparently use them. This
would improve those functions by as much as a factor of five.

Floating point arithmetic is currently very slow in Interlisp-D, because it is
written in Lisp using integer arithmetic in the "small number" range to do
large-number-range mantissa arithmetic. We eventually may integrate the
Dolphin floating point microcode into Interlisp-D depending on the availability
of floating point hardware or microcode space to handle floating point, and
requirements. Few enough of the programs by our internal users depend heavily
on floating point that this has been low priority; we converted Interlisp-D to use
the IEEE floating point format to be compatible with future hardware or
microcode, so that the integration would not be difficult. Floating point
microcode would make at least an order of magnitude difference in floating
point operations; floating point hardware would be even faster.

5. Free variable lookup and binding of special variables

This is tricky because the performance profiles are very different depending on
whether you have deep or shallow binding. In shallow implementations, function
call and internal binding of SPECVAR variables is inflated because of the
additional work of swapping bindings. In deep binding implementations, there is
little or no additional overhead for specvar bindings, but free variable references
(which includes all variable references from the interpreter) are slower. Free
variable reference from compiled code can be improved substantially in two
ways: users explicitly adding GLOBALVARS declarations, and by caching free
variable pointers on the stack.

There are several reasons that I believe that deep binding is the right choice
(i.e., gives the best performance) for Interlisp, although there may be some
programs and benchmarks for which it is a loss. Interlisp by default compiles
user programs with all variables SPECIAL, but that free variable access is
relatively infrequent except from interpreted code. Most folks, if they care about
performance, compile their programs. A reasonable microcoded implementation of
free variable lookup at first reference and caching of a pointer to the actual
binding location, along with judicious use of GLOBALVARS can substantially
reduce the overhead of the free variable references that are found.

I don't have analytic figures for free variable lookup in Interlisp-D, except to say
that with the revised format of the stack and variable name-tables pointed to
from there that we expect at least a factor of two to four improvment in
the time for free variable lookup.

C. Major Lisp facilities

1. Garbage collection

There are three major variations on reclaiming unused storage in Lisp systems,
and many minor variations. 

The "mark and sweep" variety chases all active pointers, marking them as active,
and then put back on the free list any pointers which weren't marked.
Variations segment the address space in various ways and make optimizations
based on arguments about what segments can point to what others. For example,
Interlisp-10 (and I think MacLisp) keeps pages marked as "pure", "dirty", "user"
and can avoid chasing pointers from "system" pages which have not been
written on, or sweeping system pages.

The "(stop and) copy" variety copies active pointers from old-space to new-space,
and then swaps old and new. (Interlisp-VAX and ELISP use stop-and-copy). In
addition to segmentation schemes, there are copying garbage collectors which can
work continuously, given some hardware/microcode assist when doing memory
references. (Notably, the MIT Lisp machines and descendents use this variation.)

The "reference count" variety counts the number of references to each datum,
and when a reference count goes to 0, the datum can be immediately collected.
Interlisp-D uses a modified reference-count algorithm, where reference counts
are kept in a separate hash tables rather than with the datum and the case where
reference-count-equals-1 need is not stored; in addition, references from the
stack (e.g., as the result of a variable reference) , and to immediate data (atoms
and integers) are not counted. This requires periodically sweeping the stack to
find pointers from the stack, and then sweeping the reference count table to find
data whose reference count is 0. 

Performance profiles and other issues:

Both mark-and-sweep and stop-and-copy suffer in that the cost of collecting
garbage is proportional to the size of the address space rather than the amount of
garbage. The various segmentation schemes help out a bit, but the heuristics for
separating the address space work only in limited contexts. At least for large
Interlisp-10 programs, garbage collection time is quite significant not because
much garbage is being collected but because the active address space is large.

I do not understand exactly the heuristic used in MIT Lisp Machine Lisp, but it
is my belief that most users and benchmarks run with the garbage collector
turned off. 

Reference counting schemes suffer in that circular structures are not collected,
causing one periodially to have to perform a stop-and-copy operation; this hasn't
been necessary in any systems we've seen. I think this is because it is possible to
explicitly control the allocation of circular storage, and because "storage leaks"
are just as likely from sources which are shared among all gc algorithms, e.g.,
CONSing data onto the property list of a permanent atom.

Interlisp-D does a periodic background scan-and-free space while not otherwise
occupied, so that our experience with interactive use is that it is almost never
necessary to wait for a garbage collection. With the introduction of the
multiprocessing system next year, this will become even more true.

The current times for garbage collecting on a Dolphin is about .25 to .5 seconds
fixed overhead (to sweep the stack and the tables), plus about 800 microseconds
percell collected. We expect both "fixed" overhead to decrease by a factor of 4
and the time/cell to decrease by an order of magnitude, with a change in the
format of the reference count table and movement of the function which puts
cells back onto the free list into microcode. 

3. Interpreter

Performance in interpreters are determined by the variable binding mechanism
and the implementation of the interpreter. Interlisp-D currently implements the
interpreter in Lisp with little microcode assist. In addition, the current
frame-allocation algorithm invokes a serious performance penalty for calling
interpreted routines with too few or too many arguments. 

In addition to the general system performance improvement engendered by
call/return + variable reference, we have redesigned the interface to microcoded
free variable lookup for an additional factor of two performance
improvement for running interpreted code. The performance penalty for
too-many or too-few arguments will go away. 

4. Time for file access in lisp (READ/PRINT/LOAD) 

This has increasingly become apparent as an important piece of the "feel" of
Interlisp, and its importance is rising in our list of things to tune.  The following
items show up in our statistics on loading:

Getting characters from buffer (BIN): we plan to put this in microcode
eventually. This will result in an additional 10% improvement.

CONS: loading in fact does do much CONSing, so that the CONS improvements
will help LOAD by about 20%.

Disk scheduling and/or Leaf server response:  we plan to change the interaction
with Leaf  to improve sequential access to files by pre-requesting pages N+1 ...
when page N is read in sequential mode. The revised disk layout (i.e., without
partitions) not only makes better use of the disk, but will allow us to revise the
disk-read algorithm to better schedule disk arm movement. This changes, while
not scheduled to be complete before the end of the year, will probably result in
a performance improvement of at least a factor of 2 in the 50% of LOAD which
currently goes into waiting for Leaf and/or the Disk. This means a total
improvement of LOAD of a FACTOR OF 2-4. 

5. Compiler

The Interlisp-D compiler (also used by Interlisp-VAX) has a fairly involved
optimization phase, which unfortunately takes longer than the Interlisp-10
compiler. Those who work with benchmarks or hand-tuned code will only notice
that the compiler runs slower, but for the programs which I analyzed, which
used macros, records, etc considerably, the compiler's optimization pass was quite
important. 

That is, the compiler doesn't do as much for little benchmarks as it does for big
application programs, where coding style is considered more important than
strange constructs.

Improvements in the compiler algorithms are scheduled for later next year. Until
then, one must be content with the overall factor of 2-4 resulting from the new
call/variable reference formats.