After many months of research, development and QA, Wasmi’s most significant update ever is finally ready for production use.
Wasmi is an efficient and versatile WebAssembly (Wasm) interpreter with a focus on embedded environments. It is an excellent choice for plugin systems, cloud hosts and as smart contract execution engine.
Wasmi intentionally mirrors the Wasmtime API on a best-effort basis, making it an ideal drop-in replacement or prototyping runtime.
Install Wasmi’s CLI tool via
cargo install wasmi_cli
or use it as library via thewasmi
crate.
Wasmi v0.32 comes with a new execution engine that utilizes register-based bytecode enhancing its execution performance by a factor of up to 5. Additionally, its startup performance has been improved by several orders of magnitudes thanks to lazy compilation and other new techniques.
The changelog for v0.32 is huge and the following sections will present the most significant changes.
Startup Performance
Wasmi is a rewriting interpreter, meaning that it rewrites the incoming WebAssembly bytecode into Wasmi’s own internal bytecode that is geared towards efficient execution performance.
This re-writing is what we call compilation or translation in an interpreter. This is not to be confused with compiling the Wasmi interpreter itself.
Why is translation speed important for Wasmi?
Fast translation enables a fast startup time which is the time spent until the first instruction is executed.
As an interpreter, Wasmi is naturally optimized for fast startup times, making it well-suited for translation-intensive workloads where the time required to translate a Wasm binary exceeds the time needed to execute it.
Conversely, compute-intensive workloads, where execution time surpasses translation time, are better handled by JIT-based Wasm runtimes such as Wasmtime, WAMR, or Wasmer. 1
Lazy Translation
Translation can be costly, especially with the new register-based bytecode. To address this, lazy translation has been implemented, translating only the parts of the Wasm binary necessary for execution.
Wasmi supports 3 different modes of translation:
Eager
: Code is eagerly validated and eagerly translated ahead of time.- Note: This is the default mode for Wasmi v0.32.
Lazy
: Code is lazily translated and lazily validated.LazyTranslation
: Code is lazily translated but eagerly validated.- Note: While slower than
Lazy
this fixes the problem with partially validated Wasm modules.
- Note: While slower than
Usage as Library
let mut c = wasmi::Config::default();
c.compilation_mode(wasmi::CompilationMode::Lazy);
Usage in Wasmi’s CLI
Wasmi CLI now supports the command-line option --compilation-mode=<mode>
where <mode>
is one of eager
, lazy
, or lazy-translation
.
Unchecked Translation
Wasmi validates the Wasm binary which accounts for roughly 20-40% of the total time spent during the startup phase.
However, some users might want to skip Wasm validation altogether since they know ahead of time that used Wasm binaries are pre-validated. This is now possible via the unsafe fn Module::new_unchecked
API.
Non-streaming Translation
Wasmi v0.31 and earlier always used streaming translation to process their Wasm input. However, in practice most users never even made use of this, so in v0.32 Wasmi uses non-streaming translation by default which gives it yet another nice performance win.
Users who actually want to use streaming translation simply can use the new
Module::new_streaming
API for their needs.
Linker Caching
The Wasmi Linker
is used to define the set of host functions that a Wasm binary can use to communicate with the host. Oftentimes, dozens of host functions are defined, which can quickly become costly.
To address this, Wasmi now offers a LinkerBuilder
, which allows to efficiently instantiate new Linker
s after the initial setup. 4
Benchmarks with 50 defined host functions have demonstrated a 120x speedup using this approach.
Benchmarks
By combining all of the techniques above it is possible to speed up the startup time of Wasmi by several orders of magnitudes compared to the previous Wasmi v0.31.
The newest versions of all Wasm runtimes have been used at the time of writing this article. 5.
Currently, Winch only supports
x86_64
platforms and therefore was only tested on those systems.
ERC-20 - 7KB
Argon2 - 61KB
BZ - 147KB
Pulldown-Cmark - 1.6MB
Spidermonkey - 4.2MB
FFMPEG - 19.3MB
Note: Wasmtime (Cranelift) timed out and Stitch failed to compile
ffmpeg.wasm
.
Translation Benchmarks: Conclusion
Wasmi and Wasm3 perform best by far due to their lazy compilation capabilities. As expected, optimizing JIT-based Wasm runtimes like Wasmtime and Wasmer perform worse in this context. Single-pass JITs, which are designed for fast startup, such as Winch and Wasmer Singlepass, are also significantly slower. Despite also using lazy translation, Stitch’s translation performance is not ideal. However, it is important to note that both Winch and Stitch are still in an experimental phase of their development and improvements are to be expected.
Execution Speed
For an execution engine, the speed of computation is naturally of paramount importance. Unfortunately, the old Wasmi v0.31 left much to be desired in this regard.
Register-Based Bytecode
The old Wasmi v0.31 internally uses a stack-based intermediate representation (IR) to drive execution. This IR is similar to WebAssembly bytecode and thus allows for fast translation times.
Stack-based IRs generally use more instructions to represent the same problem as register-based IRs. 6 However, the performance of interpreters is mostly dictated by the dispatch of instructions. Hence, it is usually a good tradeoff to execute fewer instructions even if every executed instruction is more complex. 7
This is why starting with version 0.32 Wasmi now uses a register-based IR to drive its execution.
Memory Consumption
The new register-based IR was carefully designed to enhance execution performance and to minimize memory usage. As the vast majority of a Wasm binary is comprised of encoded instructions, this substantially decreases memory usage and enhances cache efficiency when executing Wasm through Wasmi. 8
Benchmarks
The newest versions of all Wasm runtimes have been used at the time of writing this article. 5.
Fibonacci (Iterative) - Compute Intense
Fibonacci (Recursive) - Call Intense
Wasmi v0.32 has not significantly improved over v0.31 in this test case. This is partly because Wasmi v0.31 was already comparatively fast and that the new register-based bytecode favors compute-intense workloads over call-intense ones. This usually is a good tradeoff since most Wasm producers (such as LLVM) produce compute intense workloads due to aggressive inlining.
Primes - Balanced
Matrix Multiplication - Memory Intense
Interestingly, Wasmer (Singlepass) seems to have some trouble on Apple silicon being even slower than some of the interpreters.
Argon - Compute Hash
Note: Stitch and Winch could not execute the
argon2.wasm
test case.
Coremark
The following table shows Coremark scores for the Wasm interpreters by CPU. 9
AMD Epyc 7763 | AMD Threadripper 3990x | Apple M2 Pro | Intel i7 14700K | |
---|---|---|---|---|
Wasmi v0.31 | 657 | 944 | 884 | 1759 |
Wasmi v0.32 | 1457 | 1779 | 1577 | 2979 |
Tinywasm | 235 | 339 | 592 | 772 |
Wasm3 | 1309 | 1999 | 2931 | 3831 |
Stitch | 1390 | 2187 | 3056 | 4892 |
Execution Benchmarks: Conclusion
Wasmi is especially strong on AMD server chips and lacks behind on Apple silicon. An explanation for this could be the difference in the technique for instruction dispatch being used. 10
The Stitch interpreter performs really well. The reason likely is that Stitch encourages the LLVM optimizer to produce tail calls for its instruction dispatch, despite Rust not supporting them. Due to various downsides this design decision was discussed and dismissed during the development of Wasmi v0.32. 11 12 Given Stitch’s impressive execution performance especially on Apple silicon and Windows platforms those decisions should be reevaluated again.
Confusingly the great results for Wasmi on the test cases for the Intel i7 14700K are not reflected by its Coremark score. This probably is because every test case, including Coremark, is biased towards some kinds of workloads to some degree.
Benchmark Suite
The benchmarks and plots above have been gathered and generated using the wasmi-benchmarks
repository.
The reader is encouraged to run the benchmarks and plot the results on their own computer to confirm or disprove the claims. Usage instructions can be found in the wasmi-benchmarks
’s README.md
.
Contributions adding more Wasm runtimes, improving the plots or to add new test cases are welcome!
Summary & Outlook
This article displayed the highlights of the new Wasmi version 0.32 and demonstrated the significant improvements in both startup and execution performance through various test cases.
With this new major update Wasmi now has a solid foundation for future development.
Many WebAssembly proposals such as the multi-memory
, simd
and gc
proposals that have been put on hold for the development of Wasmi v0.32 are awaiting their implementation.
The promising results especially on the AMD server chips are a decent indicator that Wasmi has great potential. The performance of Wasmi on Apple silicon will be improved in future releases.
Plans are underway to implement the Wasm C-API, enabling various ecosystems that can interface with C to use Wasmi as a library.
Wasmi will continue to solidify its position as an efficient and versatile Wasm interpreter with a fantastic startup performance and low memory consumption especially suited for embedded environments.
Special Thanks
- First and foremost, I want to thank Parity Technologies for financing and supporting the development of Wasmi for such a long time and for allowing Wasmi to become an independent project.
- I want to commend the members of the Bytecode Alliance for their outstanding efforts in shaping the WebAssembly specification and ecosystem. Their contributions, among others, include runtimes such as Wasmtime and WAMR as well as advanced WebAssembly tooling.
- Additionally, I want to extend my gratitude to OLUWAMUYIWA, who dedicated their time and effort to implement WASI preview1 support for Wasmi — which was absolutely amazing!
- Furthermore, I want to thank yamt, who inspired me with their Wasm runtime benchmarking platform. I highly recommend checking out their toywasm Wasm interpreter!
- Finally, I would like to acknowledge Neopallium for the thought-provoking discussions and experiments we shared about efficient interpreter dispatching techniques in Rust. I highly recommend checking out one of his Wasm experiments, s1vm.
There are basically two kinds of Wasm workloads:
- Compute-Intense: The time to execute the Wasm binary exceeds the time to translate it.
- Translation-Intense: The time to translate the Wasm binary exceeds the time to execute it.
If it is unclear whether a workload is translation-intensive or compute-intensive, it may be beneficial to use a Wasm runtime that balances both types of workloads. Examples include Wasmtime’s Winch or Wasmer’s Singlepass JIT, which are designed to handle a mix of translation and execution demands effectively.
↩︎For more information see this GitHub issue. ↩︎
Another downside is that some Wasm runtime limitations, for example the maximum number of bytes per Wasm function, may not be checked when using lazy function translation. ↩︎
An example code snippet for how to use the new
LinkerBuilder
is the following:fn test() { let mut builder = <Linker<()>>::build(); // Populate the linker with the desired host functionality: builder .func_wrap("env", "foo", |_caller: Caller<()>| println!("called foo)) .unwrap(); builder .func_wrap("env", "bar", |_caller: Caller<()>| println!("called bar)) .unwrap(); let builder = builder.finish(); }
Now
builder
can be used to quickly spawn newLinker
’s with the predefined set of host functions.↩︎let engine = Engine::default(); let linker = builder.create(&engine); // FAST!
The following versions have been used for the tested Wasm runtimes:
↩︎ ↩︎Runtime Version Wasmi v0.31 v0.31.2 Wasmi v0.32 v0.32.0-beta.18 Tinywasm v0.7.0 Wasm3 v0.5.0 Stitch v0.1.0 Wasmtime / Winch v20.0 Wasmer v4.3 A simple example for this is the translation of the following Wasm bytecode:
local.get 0 local.get 1 i32.add local.set 0
Which adds locals at index 0 and 1 and stores the result back into the local at index 0. Wasmi v0.32 translates this bytecode to a single Wasmi IR instruction:
0 <- i32.add 0 1
Thus reducing the amount of instructions needed to be executed from 4 down to 1. ↩︎
The new translation from stack-based to register-based bytecode is a complex and interesting topic that might warrant its own article if there is enough interest in it. ↩︎
A benchmark for startup and memory consumption, albeit somewhat outdated, can be found in the toywasm benchmarks. Significant improvements have been made to Wasmi since those benchmarks were conducted, so one should to take those numbers with a grain of salt.
↩︎For Wasmi and Wasm3 both
lazy
andeager
modes resulted in nearly the same scores. This is because Coremark is a long-running task where the impact of lazy translation is greatly reduced. The table displays the higher score among the different modes. ↩︎An explanation for Wasmi’s inferior performance on Apple silicon is the loop-switch dispatch that is a black box concerning the generated machine code and heavily depends on heuristics in the optimizer. Recently, Apple announced enhancements in their branch prediction for their latest M4 chips which could significantly affect Wasmi’s performance since efficient interpreters heavily rely on well-tuned processor branch prediction. 13 ↩︎
The preferred solution is to finally implement explicit tail calls for Rust. This has been an ongoing topic of discussion for years and was proposed multiple times. It is more than evident that explicit tail calls, while niche, allow for specific performance critical program designs. ↩︎
The downsides of Stitch’s approach are well documented in its own README. The main issue is its reliance on LLVM’s optimizer to produce the correct code on all platforms which is likely but not guaranteed. If LLVM does not produce the correct code, Wasmi will be slow but Stitch will not even work. ↩︎
The Structure and Performance of Efficient Interpreters by Ertl. et. al. ↩︎