Performance of Low Level String Decoding in JavaScript
--
tl;dr: TextDecoder is really fast for bigger strings. Very small strings (< 18 char ish) can actually benefit from using a simple custom decoder using an array.
JavaScript is a very high level programming language. Most languages have characters and strings and can manipulate them differently. In JavaScript, that’s simply not the case. You have a very limited amount you can do. It’s the trade off of having very simple syntax. You don’t have as much control.
Nevertheless, there are some lower level APIs that are exposed. We have text encoders, as well as access to UInt8Arrays. These basically allocate linear memory that you can access similarly to arrays.
As part of a project I’m currently working on, there is a requirement to take basically an ~8mb book, store it in a JavaScript-compatible way, and decode it so that it can be used to render pages.
I’ve been thinking a lot recently about the most efficient way to store a book. In the end I decided to store the book in binary rather than in JSON string format. So this article is going to be about some performance / benchmarks around different ways to retrieve strings from binary string data.
JavaScript does useUTF-16 under the hood, but the built-in TextDecoder API allows parses in UTF-8. So we’ll basically be working exclusively in UTF-8.
A Micro-Course on UTF-8
UTF-8 is represented as a byte array (as all strings are underneath the hood regardless of encoding). Every byte stores a number from 0 to 255. Obviously there are more than 255 characters in all the languages. UTF-8 supports all 1,112,064 characters of unicode.
How is this possible? UTF-8 can stretch and take up extra room for more complex characters. The most common english characters exist from 0–127. These include alphanumerical characters, some symbols and some system characters — like line breaks for example. At byte 128, the decoder understands that the next byte then will become part of the character. Then it becomes a two-byte character. This makes the amount of characters exponential, as each byte above 127 has a whole other 255 to work with. Continuing with this concept, UTF-8 can store up to 4 bytes of string.