Understanding Emscripten

Started by
17 comments, last by snake5 8 years, 8 months ago

I'm getting into Emscripten, and it sounds pretty interesting. When I first started to learn about the concept of Emscripten, it sounded like I would compile my C/C++ through a compiler (gcc/++ or clang/++), and the object files would be translated into compiled JavaScript. By compiled JavaScript, I thought they meant byte-codes that'd normally be generated by the JS compiler, thus reducing the amount of JS code sent to the browser (for browser implementations) saving compile time. This compiled JS would also have many given optimizations, such as strongly-typed variables, and other optimizations that gcc/clang would generally do.

After learning more about Emscripten, it sounds like it uses LLVM to generate byte-code that gets run through the a JavaScript-based interpreter, where that JS interpreter is executed through the JS engine (Node, V8, SpiderMonkey, etc). Is this correct?

Also, LLVM isn't a compiler, but rather a later phase in the compilation process, right? It sounds like Emscripten works with LLVM, and its own Fastcomp compiler core to create the byte-code.

I've read that when configured correctly, Emscripten can output code that executes faster than JavaScript code written by hand. It may also have comparable speed to C/C++ code (up to 50% C/C++ speed).

Advertisement
I know vanishingly little about Emscripten, but I'm pretty familiar with LLVM.

LLVM is a toolkit with a couple of highly relevant components for this sort of work. First and more general-purpose is LLVM Bitcode, a sort of portable assembly language using Static Single Assignment rules. Secondly it is a collection of compiler back-ends which convert Bitcode into machine code for various machines.

I can't say for sure what Emscripten actually uses LLVM for.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Emscripten is a toolchain (compiler + runtime libraries), the compiler being based on LLVM, with Clang as the frontend (converts code to LLVM IR - intermediate representation) and a custom JS code generator as the backend (converts LLVM IR to output code). LLVM is used for the compilation process (communication between parts, code optimization).

"Fastcomp" is simply the name of the JS code generator (backend).

Object files already contain assembly code for the specific platform, so they cannot be used for compilation to other targets.

Emscripten can output code that executes faster than JavaScript code written by hand.

Assuming "by hand" is meant "from the average JS developer". And usually only in benchmarks. Emscripten generates huge JS files, big part of that code has to be there only to simulate the C/C++ environment. That is not efficient.

It may also have comparable speed to C/C++ code (up to 50% C/C++ speed).

Only in benchmarks. I've yet to see an actual project that did not use 10x more memory, load many times slower and execute enough code to stress the CPU well enough for measurements.

After learning more about Emscripten, it sounds like it uses LLVM to generate byte-code that gets run through the a JavaScript-based interpreter, where that JS interpreter is executed through the JS engine (Node, V8, SpiderMonkey, etc). Is this correct?

There is no "JS-based interpreter". C/C++/.. code -> LLVM IR -> JS code (that does not contain any interpreters), compiler is written in C++. Also, Node is not a JS engine (it uses V8).

reducing the amount of JS code sent to the browser (for browser implementations) saving compile time

Amount of code can be reduced by a) "minifying" - removing/reducing parts that are not important for execution (spaces, local variable names); b) generating less code. Emscripten supports both, but enabling that is up to the user.

Compilers generally work in a few phases.

First they have to parse the language (which itself may be broken into tokenization and grammar rule matching; C/C++ also have a preprocessor step that has its own tokenization and grammar) followed then by a correctness analysis pass (which may be partially or fully combined with the parsing pass).

After that, compilers will usually have some kind of high-level representation of the source code. This is then converted into an intermediate representation that is easier to perform optimization passes over. Once optimizations are done, the compiler will produce executable code in the target architecture's correct format. Some compilers have extra steps in there, or jumble the steps together a bit, but that's the gist of it.

LLVM handles that second part. It is a library that handles data in a specific intermediate representation (aka IR), implements a number of high-power optimization passes over that representation, and then includes "backends" which generate code for a specific machine.

Clang is a project bundled with LLVM that parses C, C++, Objective-C, and OpenCL kernels and generates LLVM IR. It then lets LLVM do optimizations and produce machine code.

The backend doesn't have to produce actual binary code for a machine. It can be binary code for a virtual machine (such as JVM bytecode or CLR/.NET bytecode). It can also be a text format, such as assembly language or the text encoding of LLVM IR itself, or even an entirely different language. Essentially, the backend converts the IR into some other format.

Emscripten is a project that mostly revolves around adding a LLVM backend that generates JavaScript (specifically the asm.js subset of it) from the LLVM IR. This means that any code that can be represented in LLVM IR (whether it comes from Clang or somewhere else) can be converted into asm.js using Emscripten's backend.

Converting one language into another language is often called "transpiling" these days. One does not _need_ something like LLVM to do this, but given the vast differences in the computing model of C and JavaScript, it really helps to have a good optimizer simplify the input C++ code to simpler machine-like instructions that can be translated to the target language.

Emscripten also provides implementations of various C libraries that wrap the JavaScript environment. That is a slightly more complex topic that requires you to understand linkers and how the full C/C++ compilation model works, and I'm not sure that's the case.

The result is that C code like
int a = func(b * 2); int c = a - 1;
gets translated into LLVM IR like
$0 = %b * 2; %a = call func $0; %c = %a - 1;
and then the Emscripten backend translates that into JS like
var _0 = (b * 2)|0; var _1 = func(_0)|0; var c = (_1 - 1)|0;
The final bit is the form described by asm.js, which is legal JavaScript but with some extra hints to help specialized JIT engines generate faster machine code.

The JS is just JS. You load it into a browser just like any other JS. It can be interpreted by any JavaScript engine, though it'll probably be very slow on any browser that doesn't explicitly optimize for asm.js formatted code. Emscripten is just converting C++ source into JavaScript source.

Note the "oddness" of the generates JS. Emscripten does not convert C++ classes into JavaScript objects, among many other things. This is because JavaScript objects work nothing at all like C++ classes, and also because JS objects can be horribly inefficient (they're by-reference heap-allocated bags of arbitrary key-value pairs). There are features in JS like typed memory arrays that can simulate the unofficial memory model of C++, and C++ pointers can then just be treated like integers into an array of bytes. There are similar tricks for dealing with integers, which don't exist in JavaScript.

The asm.js is so named because it only really supports simple operations on integer-like constructions and typed arrays, much like assembly language only really supports simple operations on registers and memory. It's not a special byte code, just a valid subset of JavaScript that is capable of emulating the C++ environment and which is easy for specialized JavaScript JIT engines to optimize into actual machine code.

Projects like WebAssembly seek to provide an actual binary "machine code" for the Internet to get around the size and speed limitations of using JavaScript for distributing and executing compiled languages like C++ or C#.

Short version: Clang parses C++ source and generates LLVM IR. LLVM+FastCode converts LLVM IR into JavaScript source. Finally, Emscripten glue emulates common C libraries in the JavaScript+browser environment.

Sean Middleditch – Game Systems Engineer – Join my team!


There is no "JS-based interpreter". C/C++/.. code -> LLVM IR -> JS code (that does not contain any interpreters).

That is not correct. Emscripten's compiler produces bytecode compatible with asm.js,which is an array-based virtual machine implemented in javascript.

It also packages up the asm.js script, and all the other components necessary to run the program, which is why one might have assumed that no interpreter was involved.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]


LLVM is a toolkit with a couple of highly relevant components for this sort of work. First and more general-purpose is LLVM Bitcode, a sort of portable assembly language using Static Single Assignment rules. Secondly it is a collection of compiler back-ends which convert Bitcode into machine code for various machines.

I can't say for sure what Emscripten actually uses LLVM for.

AND


snake5, on 23 Jul 2015 - 9:56 PM, said:

There is no "JS-based interpreter". C/C++/.. code -> LLVM IR -> JS code (that does not contain any interpreters).
That is not correct. Emscripten's compiler produces bytecode compatible with asm.js,which is an array-based virtual machine implemented in javascript.

I'm still pretty confused, but from what I've gathered previous to starting this thread, this is along the lines of what I thought. I read this presentation on asm.js back in February, and it this presentation sounds like it ports compiled C/C++ code over to optimized JS (strong typing, etc), like what SeanMiddleditch mentioned:


Emscripten also provides implementations of various C libraries that wrap the JavaScript environment. That is a slightly more complex topic that requires you to understand linkers and how the full C/C++ compilation model works, and I'm not sure that's the case.

The result is that C code like
int a = func(b * 2); int c = a - 1;gets translated into LLVM IR like
$0 = %b * 2; %a = call func $0; %c = %a - 1;and then the Emscripten backend translates that into JS like
var _0 = (b * 2)|0; var _1 = func(_0)|0; var c = (_1 - 1)|0;The final bit is the form described by asm.js, which is legal JavaScript but with some extra hints to help specialized JIT engines generate faster machine code.

The JS is just JS. You load it into a browser just like any other JS. It can be interpreted by any JavaScript engine, though it'll probably be very slow on any browser that doesn't explicitly optimize for asm.js formatted code. Emscripten is just converting C++ source into JavaScript source.

And, backing up a statement from snake5 here, I think I can see how LLVM fits into the process: as optimization during the conversion from C/C++ to JS:


One does not _need_ something like LLVM to do this, but given the vast differences in the computing model of C and JavaScript, it really helps to have a good optimizer simplify the input C++ code to simpler machine-like instructions that can be translated to the target language.

I still can't wrap my head around how Fastcomp fits into the picture if asm.js also generates JS. Is Fastcomp used by asm.js to actually generate the JS? It sounds like Fastcomp was a replacement for the compiler core by Emscripten.


The asm.js is so named because it only really supports simple operations on integer-like constructions and typed arrays, much like assembly language only really supports simple operations on registers and memory. It's not a special byte code, just a valid subset of JavaScript that is capable of emulating the C++ environment and which is easy for specialized JavaScript JIT engines to optimize into actual machine code.

It also sounds like asm.js could just be doing JS generation of simple C/C++ operations. Things like class scope, and possibly C function scoping and structs, are left to LLVM and Fastcomp.


Projects like WebAssembly seek to provide an actual binary "machine code" for the Internet to get around the size and speed limitations of using JavaScript for distributing and executing compiled languages like C++ or C#.

This interests a lot. JS opens many doors, but it can't do everything. The fact that we currently have to convert all dependency libraries into JS source gives me a nasty feeling. It basically open-sources your applications too. It makes me wonder how heavier applications, like Google's HTML5 player for YouTube works. It makes sense to stream the compressed data over a stream (I.E. a WebSocket), then uncompress the data on the client-side. When users upload videos in the supported format, the server probably converts it to a common video format that's stored on their cloud. This is probably why videos take so long to upload. Then, have a WebGL fragment shader convert the video's color space (most-likely YUV) to RGB that can be output to the canvas when streaming. Of course, this means that the HTML5 Player's JS is sent over the pipe to the client, and everyone has access to it. I haven't been able to find the source in my dev tools to verify this, however.

Hopefully an open-standard machine language spec is developed so that browsers can write a more efficient VM for code. It'd be really interesting to be able to write C, C++ or C# applications as libraries and such, and be able to compile them into a common web-standard binary. We could go back to compiling our dependency libraries and the executable as separate, compiled sources that both have the advantage of being smaller in download size, less bloated in memory consumption, and faster in execution. In a way, I think I'm describing an open standard for a web-based OS packaged into a browser.

...

That is not correct. Emscripten's compiler produces bytecode compatible with asm.js,which is an array-based virtual machine implemented in javascript.

It also packages up the asm.js script, and all the other components necessary to run the program, which is why one might have assumed that no interpreter was involved.

asm.js is not a virtual machine, where did you get that idea? (also, "array-based", wat?) It's a specification that is supported by some browsers that allows to leverage otherwise rather useless operations (such as "x|0") to specify the type of a variable (integer/float), allowing to generate simplified (and thus faster) code provided that some usage rules are not broken.

Also, what "all the other components" are you talking about? There's just the runtime library to include, and that doesn't include an interpreter, just the C/C++ runtime library implementation, including special ArrayBuffer-based heap emulation code. I wonder what you're confusing for an interpreter here.


including special ArrayBuffer-based heap emulation code

Hence 'array-based virtual machine'.

You can't implement most of the semantics of a C-like language without support for a flat, random-access memory model, and javascript lacks such a thing. Thus we have to treat each segment as a giant array, and generate code that interacts almost entirely with those arrays.

You imply it's just the C/C++ runtime library that is affected, but its much lower level than that - you can't even implement something as simple as the assignment operator without the heap emulation.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]


I still can't wrap my head around how Fastcomp fits into the picture if asm.js also generates JS. Is Fastcomp used by asm.js to actually generate the JS? It sounds like Fastcomp was a replacement for the compiler core by Emscripten.

Historically, the final Emscripten JavaScript code generation pass was written as a JavaScript command line application (executed in node.js shell). After a while, this became a performance bottleneck (mostly due to having to parse LLVM IR bytecode files in JS), so we needed to optimize the JavaScript code generation pass. This was done by implementing the final JS code generation in C/C++, and directly as a backend for LLVM (note that this is not currently a tablegenning backend, but similar to the LLVM CppBackend pass). The project lived in the github branch "fastcomp", and the name stuck around as the colloquial term for the portion of the Emscripten compiler that is implemented as part of the LLVM toolchain, as opposed to the Emscripten frontend, which a bunch of other python and JS scripts that comprise the whole Emscripten toolchain.

Currently "fastcomp" is the only supported code generation mode, but for a while back, we supported both old and new code generation modes in parallel. Also, fastcomp is only able to target asm.js style of JavaScript. The old pre-fastcomp JS compiler backend was able to target both asm.js and non-asm.js style of JavaScript, in order to give projects a managed migration path from the very old non-asm.js codegen to the more performant asm.js codegen. Today there is no reason to use the old non-fastcomp compiler backend anymore, and it has long since been deprecated and removed from Emscripten code tree, as the fastcomp backend has matured enough to be the only codegen path.


Quote
It may also have comparable speed to C/C++ code (up to 50% C/C++ speed).
Only in benchmarks. I've yet to see an actual project that did not use 10x more memory, load many times slower and execute enough code to stress the CPU well enough for measurements.

There are quite a few projects out there by now that utilize Emscripten to deliver full applications on the web. Most of these certainly do not use ten times the memory compared to native (or handwritten JavaScript), nor do they load many (how many?) times slower. Here is a list of some off the top of my head that have shipped:

- The Humble Mozilla Bundle: https://marketplace.firefox.com/app/humble-asmjs-store

- Contains asm.js versions of the following native games that can be run directly in a web browser: AAAaaa for Awesome!, Democracy 3, Dustforce DX, FTL: Faster than Light, Jack Lumber, Osmos, Super Hexagon, Voxatron, Zen Bound 2

- The Internet Archive MS-DOS Games: https://archive.org/details/softwarelibrary_msdos_games

- The Internet Archive MESS and MAME Arcade Games: https://archive.org/details/messmame

- Unity3D Dead Trigger 2 demo (very old now): http://beta.unity3d.com/jonas/DT2/

- Unity3D AngryBots demo (very old now): http://beta.unity3d.com/jonas/AngryBots/

- Unity3D WebGL Benchmark (somewhat old now): http://beta.unity3d.com/jonas/WebGLBenchmark/

- Since Unity 5.0, Unity supports asm.js+WebGL deployment: http://docs.unity3d.com/Manual/webgl-building.html

- Since Unreal Engine 4.5, it has been possible to deploy to asm.js+WebGL. UE 4.7 makes the UI workflow much easier without having to "hack" the engine internals.

- Unfortunately I was not able to find good demos of Unreal Engine offerings hosted on the web. It is possible to locally build the UE4 demos if one downloads the engine though. There is the Tappy Chicken demo at least: https://www.unrealengine.com/html5/

- Dune 2, Transport Tycoon Deluxe, X-COM: Enemy Unknown, Caesar 3: http://epicport.com/en

- My HTML5 ScummVM demo (very old now, predates asm.js even, so performance is not representative): http://clb.demon.fi/html5scummvm/

- If you are on a Firefox OS mobile device, it is possible to download the Disney Where's My Water? and Where's My Perry? games from the Marketplace. Those ports use Emscripten to deploy the games on the phone.

- Autodesk FormIt 360: http://formit360.autodesk.com/app/ . This is a good example of a non-game application using asm.js.

The number of projects deploying to the web has been steadily been rising for the past year. In the future, we are working on the WebAssembly specification, which aims to optimize asm.js experience even further on the web, and in addition to that, adding support for WebGL 2, SIMD and multithreading in order to bring the feature and performance parity even closer to native.

This topic is closed to new replies.

Advertisement