If you've ever used anything that uses Java, you are probably well aware of its miserable startup performance. To take a quick example, running "javac" (which is largely implemented in java) takes almost half a second just to display its usage instructions, even if you didn't give it any files to process, whereas gcc might compile your entire C program in a fraction of that time.
Further, if you have large numbers of Java applications running on a single computer, you will find that the memory usage of the system increases much more quickly than having the same number of (or even orders of magnitude greater) native applications. Despite occasional claims that this is a "non-issue", there is a very simple reason for this, and systems like Android and .NET go to great lengths to solve it.
To understand this, let's first look at how a native application starts: its executable, and (transitively) all of its shared libraries, are "memory mapped" into the new process's virtual address space. The concept of "memory mapped I/O" is strongly wedded to that of operating systems with "virtual memory", and allows this operation to be "instantaneous".
So, as opposed to actually reading this file from disk, the pages of memory that will store that file are marked in such a way as to associate them with the files they represent on disk. In this way, when these pages of memory are accessed, a "page fault" will occur in the operating system, and only then will that specific page of memory be loaded.
This also has a serious advantage with respect to memory usage: as the source of the memory is quite clear (the file on disk) if two processes memory map the same file (even two different addresses in their private address space) they will be able to share the same physical memory pages. Further, if the system needs more memory, it can simply delete the pages and reload them if needed later.
It additionally solves some of the most dire problems with respect to load time: code often has large numbers of global data structures, and native code compilers are able to statically arrange this data in the executables on disk so that, when mapped into memory, the process is "ready to go". Changes made in each process to this data will then "copy-on-write" and become private to the process.
To make this work efficiently, it is important that as many parts of the binary need to be "unchanged" as possible when the binary loads. This means that it is a serious problem when a binary that contains numerous function calls or global references, represented by addresses of these data and code areas in memory, is loaded at an address other than the one it was compiled for, which is difficult to guarantee.
When this happens, the dynamic loader will need to "relocate" the binary by iterating something called the "relocation table", making changes to the code as required to adjust addresses by the "slide" from the old location to the new one. Sometimes relocations are just "change these addresses", but more complex systems can support "modify these assembly instructions".
The result is that the careful preparation of the binary to be memory mapped and shared between processes is ruined: this relocated binary will contain numerous modifications and will be forced to not only get loaded immediately (so the relocation logic can process it) but also to be copied nearly in its entirety to this new process's address space.
While some systems try to solve this by attempting global "whole-system" layout optimizations (to minimize the number of relocations that might have to be done at runtime), the solution typically used has been to modify the compiler to not rely on fixed addressing at all: instead, "program counter" (or "instruction pointer") relative addressing is used.
In these modes, function calls and data references are made using an offset from the currently executing code, rather than as an absolute address from the beginning of memory. The result is a binary that is "position-independent" and can be loaded at an arbitrary location with no changes at all. This feature is activated in gcc (a common C compiler) using -fPIC.
There is, however, a tradeoff: position-independent code is often slightly slower to execute, as it must use slightly more verbose machine code in order to perform this magic: rather than simply loading an address from memory, you will often see that load be of an offset, followed by the addition of the current code execution address before usage.
Another problem that must be overcome is that of supporting shared libraries that can be independently updated and loaded at addresses unknown until runtime. This requires the dynamic loader to do symbol resolution, looking up (by string name) all cross-library function and data references and updating the code to reference the runtime-computed address.
The first problem is that these function calls (especially ones to very common libraries, such as libc) are scattered through the code, causing yet another situation where the entire binary would seemingly need to be loaded at startup and then copied into each process's address space to allow for the process-specific modifications.
This is instead solved by adding a level of indirection to the code: rather than directly jumping to the address of the other function (or directly loading from the address of shared data) you instead store all the addresses next to each other (possibly in the form of densely packed jump instructions) so that only a few pages of memory can be copied for thousands of symbol references.
Unfortunately, even just looking up all of these symbol references takes time: for each name, you must look through the symbol exports of other libraries to find a match. In an ideal universe, you wouldn't do this unless you actually needed to: modern binaries contain large amounts of functionality that are unlikely to be accessed on any particular run of the program.
Therefore, many systems support "lazy-linking", where the procedure linkage table is setup in a manner where, if the entry in the table is currently blank, the dynamic linker is quickly asked to fill in the entry. Once the entry has been located, it is updated in the table.
Further, in environments where the procedure linkage table is organized as the aforementioned densely-packed region of jumps, the jump itself is updated, allowing the code to not even have to check if the entry is blank the next time it is executed: jumping to the entry in the table will then jump directly to the function referred to by the symbol reference.
The next problem people have tried to solve is that of slow performance caused by looking up these names in the various symbol tables. While it is "easy" to see ways of increasing the performance when linking doing the dynamic linking of the entire binary at once (alphabetize both lists and do a merge), with lazy-linking this problem becomes more intricate.
A common solution to this is simply to "cheat": in most circumstances, the exact libraries that are being used at runtime are either known when the binary is being compiled or when the binary is installed. You can then "pre-link" the binary, providing "hints" as offsets into the symbol tables for where you "expect" to find the symbol: you then only need to search if the hint is incorrect.
Now, with all of this in mind, we are ready to see how Apple decided to improve the load time of iOS. As the full set of libraries is known to Apple when they ship a build of iOS (developers are not allowed to ship their own with their applications, and updates to the OS are "all-or-nothing"), Apple takes pre-linking to the extreme, and ships a single massive library.
This library is called the "dyld shared cache" ("dyld" being the name of the dynamic loader on Darwin). It is a single >100MB binary that contains every shared library concatenated and pre-linked into a single file, stored in /System/Library/Caches/com.apple.dyld/.
Due to this performance enhancement, that single massive file can be loaded by the dynamic loader with all of its internal symbols optimized to require no relocations or symbol resolution. Only the symbols used by the specific application executable (as downloaded from the App Store) will need to be resolved, which is often a small fraction of the amount of code being executed.
Java takes a drastically different approach to how its applications are distributed. As there is no way to know ahead of time what kind of system a program will be run on, Java instead compiles to "bytecode": an instruction set designed for Java itself that must be compiled at runtime to the machine-specific instruction set for execution.
The result of this is that before you can use any code, you must first perform much of the work often done by a static compiler ahead of time for native binaries. Work is constantly being done to figure out how to improve the performance of this runtime compilation, with a weighting on "how long it takes to do the compile": a 10% increase in compile time for a faster result is a costly choice.
A typical solution to this is to use multiple configurations for the compiler: when initially running the code, to compile it using a very fast algorithm (that often generates a slow-yet-reasonable result) and then only to put more effort into compiling that specific region of code later if it is being run often enough to warrant it (as in, are a "hot spot" in the code).
This scheme offers other benefits: during the initial runs of the purposely unoptimized build (which may even simply be to interpret the code rather than to compile it), performance statistics can be kept about how the code is being used at runtime, allowing the resulting compile to tailor the code to the specific behavior of that device and input.
Unfortunately, this takes its toll on startup performance: while the eventual runtime performance may be quite fast (and could theoretically be even faster than a statically compiled binary), the cost of these compilations (or the related cost of not compiling and switching to an interpreter) causes the program to start in a state where it is sluggish and only warm into performance.
Meanwhile, there is a more insidious problem: as the resulting compiled code is process-specific (even if it happens to yield the same result), it will not be shared between multiple Java applications. This is a serious problem in environments where more than a small handful of Java processes will be running (such as on a handset whose operating system is designed to run all Java applications).
To help with both of these problems, .NET (an environment very similar to Java) offers a feature called "pre-JIT", where as a program is installed it is run through a more lengthy compilation session and the results are stored on disk. This allows the VM to get closer to the ideal of "memory map and run", which was the goal of all of the optimizations we saw being made to native code.
Another issue Java runs into is that every class is stored separate from every other and is considered a stand-alone component that can be replaced at runtime. Whereas it is quite uncommon for native programs to use more than a few hundred shared libraries, it is quite common for a complex Java application to have many tens of thousands of class files across all of the libraries it uses.
Every single one of these class files is then run through a process analogous to dynamic linking as by-name references to methods on other classes have to be resolved at runtime. However, interestingly, class files are not well optimized for this use case: they are organized into "constant pools" containing unordered and dynamically sized entries that must be painstakingly parsed and re-hashed.
The result is that a lot of work must be done each time the VM starts to build process-specific data structures that allow these parts to be more rapidly put together at runtime, resulting in both decreased startup performance and (again) increased memory usage (especially when running multiple Java processes, even if they are using the same underlying libraries).
This was a serious issue for the BlackBerry (which largely ran Java applications), so RIM developed a custom file format called "COD" (I wish I knew what this stood for) where all of the class files that made up an application would be stored together into a single larger file, with its internal references already resolved and improved data structures for symbol resolution.
Finally, to add insult to injury, Java class files are actually typically distributed compressed into zip files called "Java archives" or "jar files". The result is that not even the bytecode can be shared between multiple processes: instead, each process decompresses a private copy of the class files it is using, and only the compressed archives themselves are able to be memory mapped.
The most common solution to this problem does not require any special magic: one simply needs to decompress Java archives into a shared classpath, allowing multiple instances of the JVM to memory map the class files and thereby share the memory pages for this asset (which also, again, allows these pages to be reclaimed by the operating system, as they can be loaded back from the file).
Interestingly, many of these core issues have been ignored by Oracle's VM. While we have now discussed solutions that third-parties have used in their virtual machines (whether for Java or for similar languages), Oracle's VM has concentrated on performance optimizations that can be applied after code can be loaded, and has largely avoided attempting to improve the earlier stages of code loading.
The one exception to this is a tool called Java Quick Starter, which is installed by default along with the Oracle VM on Windows XP and Windows 2000. However, the sole purpose of this program is to keep the operating system's disk cache warm for the files required by the Java VM; this is effectively equivalent to dedicating part of your computer's RAM to storage of the Java Virtual Machine.
Of course, this is an optimization that would apply just as well to native code as it does to Java class files. Therefore, as of Windows Vista, this feature has now been included with the operating system as part of a feature called SuperFetch, which keeps detailed statistics on not just which applications you use, but when you tend to use them, allowing it to pre-fetch them into memory.
Which now brings us to how Android solves some of these problems. It first should be noted that Android does not use Java bytecodes, nor technically does it run a "JVM": instead, it uses an alternative stack known as Dalvik, storing its own "Dalvik bytecode" in "Dalvik executables" (or "dex files"). This alternative stack is the framework in which some of these problems are solved.
First, the DEX file format itself is similar to BlackBerry's COD files: all of the class files for a library are combined together into a single file with an improved internal structure that makes it easier to look up classes or methods from other libraries and applications, as well as allowing some internal references to be pre-resolved to their targets.
Secondly, when the application is installed to the device, a step reminiscent of .NET's pre-JIT occurs, as DEX files are "optimized" into ODEX files. This optimization does not actually run a JIT compiler (as until Android 2.2 Dalvik did not even have a JIT), but does resolve machine-specific offsets into function tables and object layouts in order to allow for a much more direct interpretation.
With all of these optimizations, however, none of them have yet solved a very severe problem with Java startup performance: that global shared data structures must be generated from scratch in each process, as (for many reasons, from platform-independence to garbage collection implementation) they cannot be laid out by a compiler ahead of time into a binary that can simply be memory mapped.
It is this specific problem that Android solves by loading a single VM when the machine turns on called "zygote", loading (and initializing) all common classes into this process (which might also JIT compile them), and then using this one master VM as a template for all future VMs that the system needs to spawn: rather than making new ones from scratch, this common one is instead forked.
Forking a process on Unix allows both the new child and old parent process to share, in a copy-on-write fashion, all of the memory that had been allocated and initialized. This means that all of the work and storage that went into booting the VM, while unable to be paged back to disk (as would be possible with native code), is at least not duplicated between multiple processes.
For the developer using both iOS and Android, it is then useful to point out that Zygote is in a way conceptually similar to the dyld shared cache; however, rather than the cache being computed by Apple and distributed with the operating system, and then shared by each process using memory mapping, it is computed by the system as the device boots and then the pages are shared by forking.
There is a difference: to Substrate, Android appears to have only a single process. On iOS, Substrate is loaded into each application as it begins (causing extensions to be reinitialized in each process); but on Android, Substrate is loaded only once (into Zygote). Indeed, even if you attempt to hook the process of forking a new VM, the VM doesn't take on its "identity" until well after it "starts".
Some developers may therefore find this mental shift to be awkward as they transition back and forth between these two platforms (leading to questions regarding hooking individual applications), but if you instead think of Substrate as generally attempting to make global modifications (and simply requiring different implementations and timing to make that happen), this awkwardness can be largely mitigated.