frivolous exercise: lmbench numbers :>

mol-general@lists.maconlinux.org mol-general@lists.maconlinux.org
Sun, 15 Sep 2002 17:16:21 +0200


On Sat, Sep 14, 2002 at 04:44:31PM -0500, Rob Latham wrote:
> So, being a curious type, i wanted to see what effect MOL had on
> lmbench.  I collected some lmbench numbers from mac os 10.1.5 running
> natively and under linux (2.4.19-benh0).  The hardware is a g4-400MHz
> tibook with 640 MB RAM (512 MB allocated to os x). 
> 
> http://terizla.org/~robl/pbook/lmbench_mol.1
> 
> ( 'localhost' is os x running directly
>   'osx-mol'   is os x running via MOL )

lmbench is a quite useful tool. However, one should keep in mind
that what lmbench measures is essentially the overhead of various
kinds of context-switches. That is, one should not confuse lmbench
with real-world benchmarking (like kernel compilation time, start up
time of netscape etc). The number of context switches for a typical
application is quite small, so it does not matter much (from the
perspective of the user) whether each context switch takes 1us
or 10us.

There are of course situations when low-overhead a context switches
are essential. I would say that is quite important for web servers,
for instance.

>   . os X *already* does really badly on lmbench: it's system call
>     overhead is just flat out high.

If ones compares Linux with OS X, then there are significant differences,
yes (often closer to a factor 10 than to a factor 1).

A rather fun test is running

	time dd if=/dev/rdisk0 of=/dev/null bs=512 count=102400.

in both Linux and Darwin (well with /dev/rdisk0 replaced by /dev/hda
in Linux).

With respect to MOL, one should bear in mind that what lmbench measures
is precisely those situations where MOL has extra overhead compared 
to non-MOL OS X. MOL runs user-level PowerPC instructions at full speed
but the supervisor-level instructions (used exclusively by the Darwin kernel)
are sometimes emulated.

Having instrumented MOL quite extensively during the past weeks, I know
quite well the causes of the overhead. Some of my findings:

- The overhead from the MMU code is quite small. Further optimizations
in this area are not justified from a performance point of view.

- The decoding and emulation of supervisor-level instructions are
by far the biggest source of overhead in MOL/OSX.

- The useage of MMU-splitmod used to be the biggest source
of overhead but it is completely avoided now through on-the-fly
modifications of the Darwin kernel (the 'Acceleration for
MacOS X 10.x' lines that show up in the log).

I have also taken some measures to reduce the overhead in MOL due
to the second point:

i) MOL maps in the emulated supervisor registers into the Darwin kernel
space in order to replace privileged instructions which just reads
a privileged register (like mfmsr) with a simple load instruction.

ii) The 'mtmsr' implementation has been optimized, optimized and
reoptimized several times.

In conclusion the primary remaining source of overhead is due to the
emulation of the 'mtmsr' instruction. I have thought about replacing
it with a store and a 'msr_altered' call (thus eliminating the
decoding step). The problem is that it is difficult to find
a free GPR (and saving a GPR register will typcially lead to a
race).

Most 'mtmsr' instructions simply flips the EE bit in order to
prevent an exception in a critical section. One solution would be
to use a custom Darwin kernel where those usages could be
replaced with a fast, MOL-specific solution.

In short, it is perfectly possible to remove most of the overhead
if one uses a custom version of the Darwin kernel. There is the
problem of maintainability though (and I'm not sure Apple's shipped
kernels are exactly reproducible through a compilation).

Cheers,

/Samuel