Comments
Richard Davies wrote: The UK has a good crop of technology pioneers in cloud computing - for example ElasticHosts, FlexiScale, Flexiant, OnApp - and also some strong government initiatives such as G-Cloud. We will have to see whether this kind of technical leadership converts into swift mass-market adoption or not.
Cloud Expo on Google News


2008 West
DIAMOND SPONSOR:
Data Direct
SOA, WOA and Cloud Computing: The New Frontier for Data Services
PLATINUM SPONSORS:
Red Hat
The Opening of Virtualization
GOLD SPONSORS:
Appsense
User Environment Management – The Third Layer of the Desktop
Cordys
Cloud Computing for Business Agility
EMC
CMIS: A Multi-Vendor Proposal for a Service-Based Content Management Interoperability Standard
Freedom OSS
Practical SOA” Max Yankelevich
Intel
Architecting an Enterprise Service Router (ESR) – A Cost-Effective Way to Scale SOA Across the Enterprise
Sensedia
Return on Assests: Bringing Visibility to your SOA Strategy
Symantec
Managing Hybrid Endpoint Environments
VMWare
Game-Changing Technology for Enterprise Clouds and Applications
Click For 2008 West
Event Webcasts

2008 West
PLATINUM SPONSORS:
Appcelerator
Get ‘Rich’ Quick: Rapid Prototyping for RIA with ZERO Server Code
Keynote Systems
Designing for and Managing Performance in the New Frontier of Rich Internet Applications
GOLD SPONSORS:
ICEsoft
How Can AJAX Improve Homeland Security?
Isomorphic
Beyond Widgets: What a RIA Platform Should Offer
Oracle
REAs: Rich Enterprise Applications
Click For 2008 Event Webcasts
SYS-CON.TV
Top Links You Must Click On


Is Your JIT Telling You Lies?
Even the most sophisticated developer should avoid using micro-benchmarks as predictors of application performance

Java Development on Ulitzer

Writing meaningful Java benchmarks is a tricky business. It's well known that the Java Virtual Machine's just in time (JIT) compilation process means that running an application for a few seconds won't let you predict the performance of the application over hours or days of uptime. In spite of this, developers often rely on micro-benchmarks to set performance SLAs for their applications.

Micro-benchmarks test some small, discrete component of an application. They're usually written in an effort to benchmark a component considered critical to the app's overall performance. Here's a typical example, summing all the numbers from one to a specified limit:

long accumulatedTotal(int limit) {
long result=0L;
for (int i=1; i<=limit; i++) {
result += i;
}
return result;
}

How long does this method take to execute for different values of limit? On my 32-bit OpenSolaris machine using Java 6u16 the first call to this method shows a non-linear runtime for small values of limit:

The method is initially executed by an interpreter and the JIT compiler doesn't kick in until the JVM detects that the method is doing a lot of looping. It then takes some time to compile the method, during which time the loop continues to run under the interpreter. Once this process is complete, the JVM can jump into the compiled version of the method and the average runtime for the remaining iterations is greatly reduced. The upshot is that for small values of limit, the runtime is non-linear, but as limit gets much larger it becomes much more predictable.

Benchmarks like SPECjvm2008 know about this quirk in the behavior of the JVM and try to account for it. SPECjvm2008 repeatedly runs its benchmark code for a fixed "warm-up" period - typically 120 seconds. The warm-up time must be long enough to allow compilation of all the performance-critical methods in the benchmark code. You can observe the compilation process in the JVM by adding the Java command-line option -XX:+PrintCompilation. The benchmark then continues to execute for a "run" period - for SPECjvm2008 the default duration is 240 seconds - and the benchmark result is the number of iterations of the benchmark that were completed per minute of the run period.

But does this approach really predict application performance? As so often with the JVM, it turns out to be more complicated than you expect.

A modern JVM such as Sun's Java 6 contains a JIT compiler that is capable of generating native code every bit as efficient as that produced by an ahead-of-time (AOT) C++ compiler. Among the many optimizations performed by the JIT is method inlining. Calling small methods can incur a large overhead - for example, Java classes routinely contain getter and setter methods that simply read or write the value of a variable. The cost of calling a trivial method can be high relative to the work performed by the method, so the JIT will attempt to copy the body of a small method into its caller.

Consider a simple getter method like this:

Class Limit {
private int limit;
public Limit(int limit) {
this.limit = limit;
}
public int getLimit() {
return limit;
}
}

If this is called from a loop like this, where "o" is an instance of class Limit:

for (int i=0; i<o.getLimit(); i++) {
// Do something
}
then it would be nice to inline the getLimit() method so that the resulting code looks like this:
for (int i=0; i<o.limit; i++) {
// Do something
}

In an AOT compiler, this inlining might not be possible. For example, the object "o" might not be an instance of Limit but of some subclass of Limit that overrides the getLimit() method. Because it compiles with knowledge of runtime behavior and can throw away compiled code if necessary, the JIT can decide to inline Limit.getLimit(). If "o" is ever seen to be an instance of a subclass of Limit, then the compiled code can be discarded and the loop recompiled appropriately.

Let's take a look at the Arrays.sort(Object[] a) method. The javadocs say that all elements in the array must implement the Comparable interface. Arrays.sort() make many calls to the compareTo() method of the elements it is sorting. The compareTo() method in class Integer is very simple:

public int compareTo(Integer anotherInteger) {
int thisVal = this.value;
int anotherVal = anotherInteger.value;
return (thisVal<anotherVal ? -1 : (thisVal==anotherVal ? 0 : 1));
}

If you only ever call Arrays.sort() on arrays consisting entirely of Integers, the JIT compiler can detect this and inline the Integer.compareTo() method into the sort().

What happens if you later call Arrays.sort() to sort an array of instances of class Short? The compiled code performs a quick check on each array element to ensure that it is an Integer. When this check fails, the compiled code is discarded and Arrays.sort() is recompiled to account for the new set of observed array element types.

The updated compiled code will now inline both Integer.compareTo() and Short.compareTo(). The decision about which inlined method to execute will be taken by explicitly checking whether the array element is of class Integer or Short. Even with two different implementors of the Comparable interface, the JIT compiler is still able to perform method inlining.

Let's go even further and call Arrays.sort() on an array of instances of class Byte. If the JIT compiler now inlines Byte.compareTo() as well, we'll have three different inlined versions of this method in the Arrays.sort() code. This is all starting to get out of hand!

Now the JIT compiler changes strategy. It throws away all its previous inlining of the compareTo() method and does a traditional vtable dispatch just like a static compiler would do. For very small methods such as Integer.compareTo(), this can have a significant performance impact as the time spent calling and returning from such a tiny method may be far greater than the time spent executing its code.

The limitation that only two receivers of a virtual method dispatch can be inlined is known as the bimorphic inlining policy and is a poorly understood but significant limitation on JIT performance.

Listing 1 contains a benchmark program that shows this effect in action.

The test code calls Arrays.sort() five times. First it calls it on an array of Byte objects. This is simply a "warm up" and the results are ignored. It then calls Arrays.sort() on an array of Bytes, then on an array of Shorts, then an array of Integers, and finally Arrays.sort() is called for a second time on an array of Bytes.

Here's the result of running this benchmark using 32-bit Java 6u16 on my OpenSolaris machine. For this test the arrays to be sorted have 100,000 elements:

Java SortBench 100000
(WARMUP) BYTE SORT PERFORMANCE = 1778 operations per minute
(FIRST PASS) BYTE SORT PERFORMANCE = 2034 operations per minute
SHORT SORT PERFORMANCE = 1623 operations per minute
INTEGER SORT PERFORMANCE = 1261 operations per minute
(SECOND PASS) BYTE SORT PERFORMANCE = 1307 operations per minute

As you can see, the warm-up performance is - as expected - a bit lower than for the first pass of the byte sort test. However the performance of subsequent sort tests falls away quite sharply and the performance of the second byte sort test is only 64% of the performance seen first time round.

For complex server-side applications where it's likely that Arrays.sort() will be called on by many different classes, the real performance when sorting 100,000 random byte instances is likely to be about 1,307 operations per minute, not 2,034 operations per minute as the "standard" benchmarking approach suggests.

This limitation is also important in attempting to benchmark Scala code. Scalac - the Scala compiler - rewrites Scala code blocks as standard Java classes that implement a special Scala interface. This interface requires a method called apply() that executes the code block. As you create code blocks in Scala, you are actually creating tiny Java classes that implement an interface. This is the same as the scenario when calling Arrays.sort() with different implementors of Comparable.

Take a look at the Scala benchmark in Listing 2. The benchmark repeatedly sums all the numbers from 1 to 1 million for a specified period of time and then prints out the number of iterations completed and the accumulated result.

Notice that the same code block is used for all executions of the benchmark. This looks more straightforward than the Java case seen previously. But here's the output from this program:

scala Test
WARM-UP RUN : 7628 operations per minute
FIRST RUN : 7713 operations per minute
SECOND RUN : 6185 operations per minute
THIRD RUN : 6134 operations per minute

Even though the code block is identical in all four cases, each code block is handled inside its own unique class. Once again the effect of the bimorphic inlining policy in the JIT compiler is to generate more performant code for the runBench method when only one or two implementors of the interface have been seen.

The bimorphic inlining policy is a great example of why micro-benchmarking isn't just hard in Java - it's all but impossible. Even if you know about this issue and work around it, the probability is that there will be something else in the JVM to trip you up and give you phony results. Generating masses of micro-benchmark numbers can be worse than useless - they give the erroneous impression of scientific accuracy that can lead to performance SLAs that can't be met in production.

Bottom line - don't use micro-benchmarks as predictors of application performance. Run your app for real and profile, then profile again, and then profile some more!

About Paul Murray
Paul Murray is a freelance specialist in Java performance analysis. He was previously a Technology Fellow at a leading Wall Street Investment Bank where he provided Java performance and scalability consultancy to several thousand Java developers. Paul was also a Member of Technical Staff at Silicon Graphics, working on dynamic compiler implementation and porting Sun's JRE to the SGI IRIX platform.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

Enterprise Open Source Magazine Latest Stories . . .
Apache Deltacloud, the Red Hat-contributed ReSTful API that abstracts differences between clouds so services on any cloud can be managed – provided of course there’s a driver – has graduated from the Apache Foundation’s incubator and is now a full-fledged Top-Level Project (TLP). The...
With Cloud Expo 2012 New York (10th Cloud Expo) just four months away, what better time to start introducing you in greater detail to the distinguished individuals in our incredible Speaker Faculty for the technical and strategy sessions at the conference... We have technical and st...
AMD said late Tuesday that its chief sales officer Emilio Ghilardi had left the company and that CEO and president Rory Read is going to do his job while a replacement is sought. AMD didn’t say why Ghilardi left but it’s assumed Read wants his own people. Read is relatively new to th...
During the lifespan of M3 (Monitis Monitor Manager) there has always been something lacking – timers. M3 execution procedure was outlined in this previous article. The execution mentioned in the latter was a one-time-execution, whereas server monitoring requires periodic invocati...
Red Hat is putting its bought-in Gluster scale-out NAS storage technology, acquired in October, on the Amazon cloud. It’s styled Red Hat Virtual Storage Appliance for Amazon Web Services and other clouds are supposed to follow in short order.
A new episode of the screencast series is now available at the OpenNebula YouTube Channel. This screencast demonstrates the new easily-customizable self-service portal for cloud consumers. Its aim is to offer a simplified access to shared infrastructure for non-IT end users. The scree...
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


SYS-CON Featured Whitepapers
ADS BY GOOGLE