After many months of research, development and QA, Wasmi’s most significant update ever is finally ready for production use.

Wasmi is an efficient and versatile WebAssembly (Wasm) interpreter with a focus on embedded environments. It is an excellent choice for plugin systems, cloud hosts and as smart contract execution engine.

Wasmi intentionally mirrors the Wasmtime API on a best-effort basis, making it an ideal drop-in replacement or prototyping runtime.

Install Wasmi’s CLI tool via cargo install wasmi_cli or use it as library via the wasmi crate.

Wasmi v0.32 comes with a new execution engine that utilizes register-based bytecode enhancing its execution performance by a factor of up to 5. Additionally, its startup performance has been improved by several orders of magnitudes thanks to lazy compilation and other new techniques.

The changelog for v0.32 is huge and the following sections will present the most significant changes.

Startup Performance

Wasmi is a rewriting interpreter, meaning that it rewrites the incoming WebAssembly bytecode into Wasmi’s own internal bytecode that is geared towards efficient execution performance.

This re-writing is what we call compilation or translation in an interpreter. This is not to be confused with compiling the Wasmi interpreter itself.

Why is translation speed important for Wasmi?

Fast translation enables a fast startup time which is the time spent until the first instruction is executed.

As an interpreter, Wasmi is naturally optimized for fast startup times, making it well-suited for translation-intensive workloads where the time required to translate a Wasm binary exceeds the time needed to execute it.
Conversely, compute-intensive workloads, where execution time surpasses translation time, are better handled by JIT-based Wasm runtimes such as Wasmtime, WAMR, or Wasmer. 1

Lazy Translation

Translation can be costly, especially with the new register-based bytecode. To address this, lazy translation has been implemented, translating only the parts of the Wasm binary necessary for execution.

Wasmi supports 3 different modes of translation:

  • Eager: Code is eagerly validated and eagerly translated ahead of time.
    • Note: This is the default mode for Wasmi v0.32.
  • Lazy: Code is lazily translated and lazily validated.
    • Note: One downside is that this allows for partially validated Wasm modules which are controversial within the wider Wasm community. 2 3
  • LazyTranslation: Code is lazily translated but eagerly validated.
    • Note: While slower than Lazy this fixes the problem with partially validated Wasm modules.

Usage as Library

let mut c = wasmi::Config::default();
c.compilation_mode(wasmi::CompilationMode::Lazy);

Usage in Wasmi’s CLI

Wasmi CLI now supports the command-line option --compilation-mode=<mode> where <mode> is one of eager, lazy, or lazy-translation.

Unchecked Translation

Wasmi validates the Wasm binary which accounts for roughly 20-40% of the total time spent during the startup phase. However, some users might want to skip Wasm validation altogether since they know ahead of time that used Wasm binaries are pre-validated. This is now possible via the unsafe fn Module::new_unchecked API.

Non-streaming Translation

Wasmi v0.31 and earlier always used streaming translation to process their Wasm input. However, in practice most users never even made use of this, so in v0.32 Wasmi uses non-streaming translation by default which gives it yet another nice performance win.

Users who actually want to use streaming translation simply can use the new Module::new_streaming API for their needs.

Linker Caching

The Wasmi Linker is used to define the set of host functions that a Wasm binary can use to communicate with the host. Oftentimes, dozens of host functions are defined, which can quickly become costly.

To address this, Wasmi now offers a LinkerBuilder, which allows to efficiently instantiate new Linkers after the initial setup. 4

Benchmarks with 50 defined host functions have demonstrated a 120x speedup using this approach.

Benchmarks

By combining all of the techniques above it is possible to speed up the startup time of Wasmi by several orders of magnitudes compared to the previous Wasmi v0.31.

The newest versions of all Wasm runtimes have been used at the time of writing this article. 5.

Currently, Winch only supports x86_64 platforms and therefore was only tested on those systems.

ERC-20 - 7KB

Argon2 - 61KB

BZ - 147KB

Pulldown-Cmark - 1.6MB

Spidermonkey - 4.2MB

FFMPEG - 19.3MB

Note: Wasmtime (Cranelift) timed out and Stitch failed to compile ffmpeg.wasm.

Translation Benchmarks: Conclusion

Wasmi and Wasm3 perform best by far due to their lazy compilation capabilities. As expected, optimizing JIT-based Wasm runtimes like Wasmtime and Wasmer perform worse in this context. Single-pass JITs, which are designed for fast startup, such as Winch and Wasmer Singlepass, are also significantly slower. Despite also using lazy translation, Stitch’s translation performance is not ideal. However, it is important to note that both Winch and Stitch are still in an experimental phase of their development and improvements are to be expected.

Execution Speed

For an execution engine, the speed of computation is naturally of paramount importance. Unfortunately, the old Wasmi v0.31 left much to be desired in this regard.

Register-Based Bytecode

The old Wasmi v0.31 internally uses a stack-based intermediate representation (IR) to drive execution. This IR is similar to WebAssembly bytecode and thus allows for fast translation times.

Stack-based IRs generally use more instructions to represent the same problem as register-based IRs. 6 However, the performance of interpreters is mostly dictated by the dispatch of instructions. Hence, it is usually a good tradeoff to execute fewer instructions even if every executed instruction is more complex. 7

This is why starting with version 0.32 Wasmi now uses a register-based IR to drive its execution.

Memory Consumption

The new register-based IR was carefully designed to enhance execution performance and to minimize memory usage. As the vast majority of a Wasm binary is comprised of encoded instructions, this substantially decreases memory usage and enhances cache efficiency when executing Wasm through Wasmi. 8

Benchmarks

The newest versions of all Wasm runtimes have been used at the time of writing this article. 5.

Fibonacci (Iterative) - Compute Intense

Fibonacci (Recursive) - Call Intense

Wasmi v0.32 has not significantly improved over v0.31 in this test case. This is partly because Wasmi v0.31 was already comparatively fast and that the new register-based bytecode favors compute-intense workloads over call-intense ones. This usually is a good tradeoff since most Wasm producers (such as LLVM) produce compute intense workloads due to aggressive inlining.

Primes - Balanced

Matrix Multiplication - Memory Intense

Interestingly, Wasmer (Singlepass) seems to have some trouble on Apple silicon being even slower than some of the interpreters.

Argon - Compute Hash

Note: Stitch and Winch could not execute the argon2.wasm test case.

Coremark

The following table shows Coremark scores for the Wasm interpreters by CPU. 9

AMD Epyc 7763AMD Threadripper 3990xApple M2 ProIntel i7 14700K
Wasmi v0.316579448841759
Wasmi v0.321457177915772979
Tinywasm235339592772
Wasm31309199929313831
Stitch1390218730564892

Execution Benchmarks: Conclusion

Wasmi is especially strong on AMD server chips and lacks behind on Apple silicon. An explanation for this could be the difference in the technique for instruction dispatch being used. 10

The Stitch interpreter performs really well. The reason likely is that Stitch encourages the LLVM optimizer to produce tail calls for its instruction dispatch, despite Rust not supporting them. Due to various downsides this design decision was discussed and dismissed during the development of Wasmi v0.32. 11 12 Given Stitch’s impressive execution performance especially on Apple silicon and Windows platforms those decisions should be reevaluated again.

Confusingly the great results for Wasmi on the test cases for the Intel i7 14700K are not reflected by its Coremark score. This probably is because every test case, including Coremark, is biased towards some kinds of workloads to some degree.

Benchmark Suite

The benchmarks and plots above have been gathered and generated using the wasmi-benchmarks repository. The reader is encouraged to run the benchmarks and plot the results on their own computer to confirm or disprove the claims. Usage instructions can be found in the wasmi-benchmarks’s README.md.

Contributions adding more Wasm runtimes, improving the plots or to add new test cases are welcome!

Summary & Outlook

This article displayed the highlights of the new Wasmi version 0.32 and demonstrated the significant improvements in both startup and execution performance through various test cases.

With this new major update Wasmi now has a solid foundation for future development. Many WebAssembly proposals such as the multi-memory, simd and gc proposals that have been put on hold for the development of Wasmi v0.32 are awaiting their implementation.

The promising results especially on the AMD server chips are a decent indicator that Wasmi has great potential. The performance of Wasmi on Apple silicon will be improved in future releases.

Plans are underway to implement the Wasm C-API, enabling various ecosystems that can interface with C to use Wasmi as a library.

Wasmi will continue to solidify its position as an efficient and versatile Wasm interpreter with a fantastic startup performance and low memory consumption especially suited for embedded environments.

Special Thanks

  • First and foremost, I want to thank Parity Technologies for financing and supporting the development of Wasmi for such a long time and for allowing Wasmi to become an independent project.
  • Additionally, I want to extend my gratitude to OLUWAMUYIWA, who dedicated their time and effort to implement WASI preview1 support for Wasmi — which was absolutely amazing!
  • Furthermore, I want to thank yamt, who inspired me with their Wasm runtime benchmarking platform. I highly recommend checking out their toywasm Wasm interpreter!
  • Finally, I would like to acknowledge Neopallium for the thought-provoking discussions and experiments we shared about efficient interpreter dispatching techniques in Rust. I highly recommend checking out one of his Wasm experiments, s1vm.

  1. There are basically two kinds of Wasm workloads:

    1. Compute-Intense: The time to execute the Wasm binary exceeds the time to translate it.
      • This use case is best covered by a JIT based Wasm runtime such as Wasmtime, WAMR or Wasmer.
    2. Translation-Intense: The time to translate the Wasm binary exceeds the time to execute it.
      • This use case is best covered by a Wasm runtime that optimizes for fast startup times such as Wasmi, Wasm3 or Wizard.

    If it is unclear whether a workload is translation-intensive or compute-intensive, it may be beneficial to use a Wasm runtime that balances both types of workloads. Examples include Wasmtime’s Winch or Wasmer’s Singlepass JIT, which are designed to handle a mix of translation and execution demands effectively.

     ↩︎
  2. For more information see this GitHub issue↩︎

  3. Another downside is that some Wasm runtime limitations, for example the maximum number of bytes per Wasm function, may not be checked when using lazy function translation. ↩︎

  4. An example code snippet for how to use the new LinkerBuilder is the following:

    fn test() {
        let mut builder = <Linker<()>>::build();
        // Populate the linker with the desired host functionality:
        builder
            .func_wrap("env", "foo", |_caller: Caller<()>| println!("called foo))
            .unwrap();
        builder
            .func_wrap("env", "bar", |_caller: Caller<()>| println!("called bar))
            .unwrap();
        let builder = builder.finish();
    }
    

    Now builder can be used to quickly spawn new Linker’s with the predefined set of host functions.

    let engine = Engine::default();
    let linker = builder.create(&engine); // FAST!
    
     ↩︎
  5. The following versions have been used for the tested Wasm runtimes:

    RuntimeVersion
    Wasmi v0.31v0.31.2
    Wasmi v0.32v0.32.0-beta.18
    Tinywasmv0.7.0
    Wasm3v0.5.0
    Stitchv0.1.0
    Wasmtime / Winchv20.0
    Wasmerv4.3
     ↩︎ ↩︎
  6. A simple example for this is the translation of the following Wasm bytecode:

    local.get 0
    local.get 1
    i32.add
    local.set 0
    

    Which adds locals at index 0 and 1 and stores the result back into the local at index 0. Wasmi v0.32 translates this bytecode to a single Wasmi IR instruction:

    0 <- i32.add 0 1
    

    Thus reducing the amount of instructions needed to be executed from 4 down to 1. ↩︎

  7. The new translation from stack-based to register-based bytecode is a complex and interesting topic that might warrant its own article if there is enough interest in it. ↩︎

  8. A benchmark for startup and memory consumption, albeit somewhat outdated, can be found in the toywasm benchmarks. Significant improvements have been made to Wasmi since those benchmarks were conducted, so one should to take those numbers with a grain of salt.

     ↩︎
  9. For Wasmi and Wasm3 both lazy and eager modes resulted in nearly the same scores. This is because Coremark is a long-running task where the impact of lazy translation is greatly reduced. The table displays the higher score among the different modes. ↩︎

  10. An explanation for Wasmi’s inferior performance on Apple silicon is the loop-switch dispatch that is a black box concerning the generated machine code and heavily depends on heuristics in the optimizer. Recently, Apple announced enhancements in their branch prediction for their latest M4 chips which could significantly affect Wasmi’s performance since efficient interpreters heavily rely on well-tuned processor branch prediction. 13 ↩︎

  11. The preferred solution is to finally implement explicit tail calls for Rust. This has been an ongoing topic of discussion for years and was proposed multiple times. It is more than evident that explicit tail calls, while niche, allow for specific performance critical program designs. ↩︎

  12. The downsides of Stitch’s approach are well documented in its own README. The main issue is its reliance on LLVM’s optimizer to produce the correct code on all platforms which is likely but not guaranteed. If LLVM does not produce the correct code, Wasmi will be slow but Stitch will not even work. ↩︎

  13. The Structure and Performance of Efficient Interpreters by Ertl. et. al. ↩︎