Bringing treesitter to the Internet
Recently, there has been a complete rewrite of Shiki, a rather nice syntax highlighter that you can employ to accentuate a plethora of code on your online blog. It’s built upon the very same system utilized within VS Code - TextMate grammars. Additionally, a side effect of this setup is its reliance on the Oniguruma regex library, which is not so nice for a number of reasons. It works by using glorified regexes to highlight the syntax.
A crazy thought struck me at some point. What if I used Treesitter to do all the heavy-lifting of finding out how to color the syntax of the code snippets? Treesitter is a novel approach to syntax highlighting which is utilized in editors like Neovim, Helix or Zed. Being a Neovim user myself, I was quite enthusiastic about the thought of being able to use the same tool used by Neovim.
This website is generated statically by Astro, which in turn runs on Node.js, at least that’s the setup for the time being. If I wanted to use Treesitter I would have to somehow plug Treesitter into the Node.js process…
My language of choice for doing this is Rust, because it’s a systems language, it has some of the best bindings to Treesitter, and it has pretty good interop with the JavaScript ecosystem.
WASM is too hard
My first intuition was to try to compile Treesitter into a WASM module, however this proved to be much harder than I anticipated at first.
The main problem is compiling tree-sitter
crate to wasm32-unknown-unknown
.
This is simply impossible to do without resorting to hacks, because there’s
a C header in the source code which cannot be compiled while passing that target.
Another problem is that there is currently an ABI mismatch between C and Rust
when it comes to the wasm32-unknown-unknown
target. The wasm32-wasi
target
is not affected by this issue.
The native approach
I’ve found that there are pretty good Rust bindings to the Native API for Node, so I decided to try loading Treesitter compiled as a dynamic library.
A lot of the work related to the setup, as well as the building can be automated using a CLI tool, which I definitely recommend checking out.
For this type of project we have to set the crate type as a dynamic library respecting the C ABI.
Next we need to import all the external libraries required to compile for Node.
In this section we define the dependencies required for the project. We need to
use the napi
crate. The feature flag indicates compatibility with Node.js N-API version 4.
The comment provides a link to the Node.js documentation explaining the N-API version matrix.
In this section, we define build dependencies. Build dependencies are dependencies that are only needed during the build process, such as compiler plugins or code generation tools.
We can add Treesitter dependencies like this.
In Rust we need to create a function which will be callable from the JavaScript
side, it needs to be marked by the #[napi]
procedural macro. It makes the function
visible to Node.js. By the way, there are only some types you can use in such a function.
Check the documentation for the available types.
In the entry function I load the configuration for a language, if there’s no such language then we need to return early. Next, we create a highlighter and pass the config, a source code we want to highlight, as well as a callback used for retrieving additional language configs. This is needed for handling injections. Last we convert events into a type which can be converted into a JavaScript object.
The events we get from Treesitter need to be converted into something serializable, e.g. HashMap
.
On the JavaScript side we need to load the Rust library like this, assuming we
are writing ES modules that is. This require
is required (heh) to be able to import the
library, but we have to create it ourselves.
Once we have this library we can load it inside Node (almost) like any other module.
As you can see by the [native code]
marker when calling toString()
on the hl
function, this function is written using a compiled language.
We can use this function for example inside a remark plugin, to transform the tree of elements we get from parsing a markdown file.
Below is an example which I used to highlight syntax in all code blocks.
Extensions
Using Treesitter means that we can easily add new language parsers, and write custom queries for highlights and injections.
For example if we want to have syntax highlighting for Astro, we have to install
parsers, which need to be included in the Cargo.toml
file. Then we can set up them
like this.
In the previous snippet I’ve used some custom-made things, as well as used once_cell
just to statically load all the configurations right at the beginning.
I’ve used here a function config_for
which is a simple wrapper around HighlightConfiguration::new(...)
and a query!
macro.
The macro just loads a string from a file and embeds it as a static string.
This is useful for when you would like to write a custom query, because the one included with the parser is not good enough, or if there is none at all.
Below is an example highlights.scm
query for Astro, which adds syntax highlighting
captures for all .astro
files. I’ve taken them from nvim-treesitter
repo.
Be careful, however, not every directive available in Neovim is available
for general use. For example, #lua-match?
is Neovim only and won’t work here.
As for the injections, we can add a injections.scm
file. This will allow us
to highlight additional languages embedded inside Astro, like TypeScript, or HTML.
Last but not least, you have to configure the styles for the classes generated by Treesitter. Some would argue this is in fact the hardest part, which is why I’ve borrowed the color scheme from the Kanagawa theme for Neovim :)
The result
If everything worked correctly you should be able to see a nicely highlighted snippet below :)
All the code, which I do use in production and might change in the future, is available on Github.