We are currently formalizing our escape analysis algorithm for trace-trees with the help of Christian Stork. As part of this work Chris observed that we are overly pessimistic when deciding which loads we can eliminate through load propagation. So far we first ran an escape analysis to find out which allocations are contained (are not leaked), and then performed load propagation only for loads from these local allocations.This is sub-optimal for several reasons. First of all, the escape analysis pass has to detect two scenarios that permit values to escape our loop scope. On the one hand we have to track stores to non-captured memory (memory escape), and we have to flag values escaping into future loop iterations (loop escape). Both scenarios cause the allocation to be flagged as escaped. However, only the memory escape situation actually inhibits load propagation. While loop escaping allocations cannot be optimized through allocation hoisting, there is no reason not to perform load propagation on them.Secondly, and independently of the above observation, there is also no need to stop load propagation altogether just because a reference escapes to memory somewhere in the loop. Due to the tree shape of our intermediate representation we can actually keep performing load propagation until we see a reference escaping to memory, at which point we have to inhibit any further load propagations for this reference or any reference stored in the associated objects (and recursively all references stored there). This essentially kills load propagation for the rest of this trace, but for any trace “higher up” in the tree is not affected.Its important to kill not only the current reference, but any other allocation that can escape through it as well. For this every time we see a load we explicitly check whether the base reference is another load and follow this chain of loads all the way to the source. If any of this hops is an escaped object, we cannot perform load propagation. Previously we were able to cache this lookup. The slightly worse performance is well worth in this case though.In addition to being more precise, the new load propagation pass is also no longer dependent on escape analysis and we can move it to a much earlier point in the compilation pipeline. During escape analysis almost all loads are eliminated now, which also increases the quality of the escape analysis results.
November 8, 2007
September 20, 2007
Improvements to the Singleton Analysis Pass
Mason found a problem with the Singleton Analysis Pass. Objects that were allocated and then written to an array were not correctly flagged as non-singleton. Besides this rather obvious bug I also fixed the way context slots are handled. When trying to write up a formal proof of the algorithm for the PLDI paper I realized that a non-eliminated load from a context kills any and all allocations that flow into that contexts. Previously I was checking at the end of each trace whether allocations flow into a context and killed the allocation there. However, this is not sufficient. Instead, we have to collect for each context all allocations flowing into it along the loop edge and the kill all allocations if we observe a load from the context slot. To do this in linear time we collect a set of allocations flowing into to each context and a list of loads that touch contexts and then resolve and kill allocations after we have seen the entire tree.
The reason that a load kills the singleton properties is that it might see a previous state instead of a freshly initialized value (i.e. 0/0.0/null). The LoadElimination phase eliminates almost all same-iteration reads from a newly allocated objects, so as long an object is only used within the current iteration this will not inhibit the allocation from being detected as singleton. Only reading from it in the next iteration prevents this (because the LoadElimination doesn’t try to eliminate loads from context slots because it doesn’t track which allocations flow into context slots).
September 8, 2007
Singleton Allocation Analysis
I finished today the Singleton Allocation Analysis phase which was the last building block missing to enable Invariant Allocation Hoising (move NEWs out of loops, which is our equivalent of escape-analysis based stack allocation). The SAA pass goes over the tree in forward order and builds an interference graph between context slots and allocation instructions. For this we track all live allocation instructions in a set and add that set to the interference set of a context slot every time it is read. The resulting interference set denots for each slot the allocation instructions that context slot interfers with.
To decide which allocation is singleton (only one instance of it lives in the loop) we have to check two properties. First, if an allocation flows back along the backedge, the context slot it flows into must not flow into another context slot along the backedge in the subsequent iteration. This ensures that we don’t “carry over” instances into future loop iterations. In a second step we then check for each context slot where an allocation flows into along the backedge whether that context slot interfers with said allocation (using the interference graph above). If there is an interference it means that the newly allocated object and the object allocated in the previous iteration are alive at the same time, which precludes the allocation site from being singleton.
Allocations can be moved out of the loop if they are singleton and captured. I updated the Invariant Code Motion phase to do so accordingly. This should significantly speed up our JavaScript JIT which dynamically allocates a ton of small objects inside loops.
August 30, 2007
sun.misc.Unsafe fixed in J9
John Duimovich from IBM contacted me shortly after I posted here about the bug we found in J9’s implementation of the sun.misc.Unsafe interface. We were able to narrow down the issue to memory accesses though sun.misc.Unsafe with an odd offset and provided this description to John the same night. The next morning John and the J9 team identified and fixed the problem (a wrong code path was taken based on the least significant bit). It took IBM less than 24h to fix this issue, measured from the time I mentioned it on my blog. Thats pretty impressive I would say :)
John also reported of a vivid discussion amongst the J9 team whether using sun.misc.Unsafe is a good idea in the first place. The main reason for us to use it was that we independently optimize null, type and bounds checks and we wanted the HostVM’s JIT to not emit these checks if our optimizer already ensured them. Both, Hotspot and J9, breaking on the generated code forced us to reconsider. We are now using regular xALOAD/xASTORE and GET/PUT/FIELD/STATIC instructions to access memory instead of sun.misc.Unsafe. This incurs quite some overhead, unfortunately. If we are able to obtain a developer snapshot of J9 that contains the fix described above, we might be able to benchmark with J9. In parallel we are trying to nail down the issue in Hotspot. If neither option is available by the deadline we will go with the pure bytecode backend, which is not ideal but better than nothing.
August 28, 2007
sun.misc.Unsafe broken in Hotspot (and J9)
We went through a great deal of pain trying to get sun.misc.Unsafe to work properly in our VM. We use it extensively for the interpretative backends (JPL and BPL). It turns out that the C2 compiler (server JIT) of Hotspot generates incorrect machine code for sun.misc.Unsafe.putByte in certain conditions. We searched the Hotspot bug database and couldn’t find any entries on this issue yet. Bernd offered to put us in touch with someone from the C2 engineering team if we can reduce the problem enough to post it to the bug database. Right now the issue is reproducible by running a certain test case through our VM. To file a report we have to find out under what conditions exactly C2 fails.
We have observed the same bug on our main platform (Apple’s Hotspot 6.0preview and 5.0 ports), but also on Linux (Sun’s Hotspot 6.0). For the latter 32-bit and 64-bit x86 are affected. IBM’s J9 VM that ships with 5.0 and 6.0earlyaccess does not break, which increases our confidence that our use of sun.misc.Unsafe is correct, and the bug is in C2. On the downside, even J9 seems to break in certain cases (i.e. accessing a large array via sun.misc.Unsafe). To make a long story short, we had to part with the idea of using sun.misc.Unsafe. It works in the Hotspot interpreter and C1, but is pretty slow. Its broken in Hotspot C2, which is the only JIT that is reasonably fast when it comes to sun.misc.Unsafe. Even IBM’s J9 is pretty slow since it probably doesn’t support sun.misc.Unsafe as builtin intrensics.
We have revamped our backend to use xALOAD and GET/PUT/FIELD/STATIC for memory access. This does introduce some extra overhead since null and typechecks are now done redundantly, once by our guard, and then again by the memory access operation.
While rewriting the backend we also stumbled over a gigantic security loophole. By injecting a public version of sun.reflect.MagicAccessorImpl into the classpath before the standard library JDK jars we are able to selectively disable the verifier for certain classes. We don’t think that this works for applets since class loading is controlled more thoroughly in that environment.
August 19, 2007
Trace-wide Register Caching and Long Integers on x86
With its tiny register set of 7 32-bit integer registers, long integers are particularily difficult to implement efficiently on the x86. I tried a couple different implementations and settled for the following approach for now: A first level register allocator assigns virtual registers to instructions in the trace as we go bottom up to the root in reverse recording order. Each virtual register corresponds to an 8-byte cell on the stack addressed via ESP. Virtual registers are cached in physical registers in a round-robin scheme. An age counter is used to determine the oldest register when we need to spill one. This simplifies the code generator logic since there will be no competition for recently allocated registers (except in case of explicit spilling of specific registers, in which case a recently allocated register might be spilled again).
Just as we ensure that virtual registers match across trace boundaries, we have to do the same for the register cache that maps virtual registers to physical registers. For this I use a preference mechanism. If a virtual register needs a physical register and it has a preference register, we try to put it back into that register. Such a preference can be caused by the instruction form (i.e. a division always produces the result in EAX or EDX), or it can be that an instruction and the corresponding virtual register was assigned a particular register in the other trace. In this case we try to get it back to the same register by the time we arrive at the merge point. If that failed we need to spill misaligned virtual/physical register pairs, and they are reloaded into the right register.
For longs I settled on 3 sets of explicit register pairs (EAX:EDX, EBX:EBP, ESI:EDI) instead of arbitrary register pairs. This smaller set of registers reduces the number of potential combinations and increases the chance that we will manage to end up with the same cache configuration at merge points, reducing the amount of spill code. Also, the observant reader will have noticed that ECX is absent from these sets. This was chosen to aid code generation for 64-bit shifts, which need the shift amount to be in ECX. By never using ECX for long integers ECX can always safely be spilled without clobbering the 64-bit value itself, which would force a reload (we can’t operate on memory results in our compiler as a code generation simplification).
August 17, 2007
x86 backend
We have started working on a x86 backend for our compiler. Initially the backend will emit 32-bit x86 code, because our default VM (Apple’s Java 5 VM) is 32-bit only. Even the preview of Java 6 seems to be 32-bit only on Mac OSX, which is pretty disappointing for a supposedly 64-bit OS. We will switch over to 64-bit generation once we found a better host VM. We are still trying to figure out the best approach to do register allocation on traces with the very limited set of physical registers available on x86 (especially 32-bit). Our current backend assigns a virtual register to each instruction the trace using a simple virtual register allocator. On top of this first level allocation scheme we run a second register allocator that keeps up to 7 (EAX, ECX, EDX, EBX, EBP, ESI, EDI) virtual registers in physical registers. As for an x86 assembler, after initial attempts to roll our own we decided to use the Maxwell Assembler System from Sun Research. I don’t like the one-function-one-instruction-format API of Maxwell, but its a good start and since its open source maybe we will end up re-writing it to fit our needs.
July 11, 2007
Side Exit Handling
Michael and I were chasing an obscure bug the last two days that produced incorrect results in certain side exit conditions when the guard elimination pass was enabled. We ended up changing the side exit code to always re-execute any conditional branch that causes a side exit. This was already the case for implicit conditional checks that are the result of i.e. an array load instruction. If the null check or range check guards fail the instruction is re-executed by the interpreter. For conditional branches we previously simply side exited to the target of the conditonal branch. Instead we now always re-executed the conditional branch in question just as we do for loads and other instructions that have an implied check (i.e. null check).
June 29, 2007
Elimination of Initial Loads from Local Allocations
Allocating objects with NEW initializes the object state and sets all object variables to 0/0.0/null. When performing load elimination (where we match up stores into captured objects with subsequent loads), we previously didn’t eliminate these types of loads because they don’t have a matching store. I changed the load elimination pass to properly replace these loads with a CONSTANT instruction.
With this latest change all captured allocations should no longer be loaded from as long we can properly analyze them. We can’t always analyse array accesses, i.e. if the index used to access the array is variant. As long the latter is not the case, captured allocation sites will only be pointed to by stores, no reads. This means that if the allocation is singleton, we can actually hoist it out of the loop.
PPPJ’07 Paper Accepted
A paper describing the Java Virtual Machine Language (JVML) interpreter we have implemented as part of our trace tree just-in-time (JIT) compiler has been accepted to the ACM Conference on Principles and Practice of Programming in Java (PPPJ’07). What makes our JVML interpreter unique is the fact that its implemented in Java itself, just as the rest of our JIT compiler. Despite the use of a high-language such as Java, our interpreter is reasonably fast (slowdown 4-9 over the hand-optimized assembly language implementation used by Sun’s Hotspot VM). In addition, our interpreter is entirely type-safe and thus is not part of the trusted code base. This is an ideal approach to add an interpreter to a Java VM system that already supports compiled code execution but doesn’t have a builtin Java interpreter (such as IBM’s Jikes RVM).