Tree Sitter
Syntax highlighting in editors has been done with regular expressions for a long time, but with the introduction of tree sitter there is a better way. Rather than directly targeting the source code we can target an abstract syntax tree (AST). This results in more consistent syntax highlighting, more consistent indentation, and better performance.
Background
I will briefly introduce what tree sitter is, but the best introduction for it's this video.
With tree sitter you write JavaScript grammars that can be used to generate native parsers. These parsers have some desirable properties for editors:
- The combination of them being native and GLR based means they're fast enough for editors
- The parsers are incremental so they can update just the necessary parts of the AST as you're editing
- The parsers handle syntax errors gracefully so a single error won't break all the syntax highlighting
Once someone writes a tree sitter grammar it can be used for any editors that have tree sitter support such as Emacs and Neovim.
Example
The best way to understand the difference is to see it. Take this snippet of mermaid code:
flowchart TD
Start --> Stop
In Emacs if you are in a tree sitter enabled mode you can call treesit-explore-mode
to see the AST from tree sitter. The AST will update in real time as you edit the code. The AST for the above snippet looks like this:
(diagram_flow flowchart
(flowchart_direction_tb td)
(flow_stmt_vertice
(flow_node
(flow_vertex (flow_vertex_id)))
(flow_link_simplelink (flow_link_arrow))
(flow_node
(flow_vertex (flow_vertex_id)))))
When you write a major mode the syntax highlighting and indentation rules target nodes in that tree. This gives you a degree of accuracy that cannot be matched by regular expression.
Say you want to highlight Start
and Stop
. All you have to do is target flow_vertex_id
. You don't have to care about how nodes are named or the context around them.
Major Modes
It's now easier to write robust major modes for Emacs, even when the language has tricky syntax. We'll start with a skeleton of a tree sitter based major mode:
(require 'treesit)
(eval-when-compile (require 'rx))
(declare-function treesit-parser-create "treesit.c")
(declare-function treesit-query-capture "treesit.c")
(declare-function treesit-induce-sparse-tree "treesit.c")
(declare-function treesit-node-child "treesit.c")
(declare-function treesit-node-start "treesit.c")
(declare-function treesit-node-type "treesit.c")
(defgroup {language} nil
:group 'languages)
(defcustom {language}-ts-mode-hook nil
:type 'hook
:group '{language})
(defcustom {language}-ts-indent-level 2
:type 'integer
:group '{language})
(defvar {language}-ts--syntax-table
(let ((table (make-syntax-table)))
table))
(defvar {language}-ts--treesit-font-lock-rules
(treesit-font-lock-rules
;; TODO
))
(defvar {language}-ts--indent-rules
`(({language}
;; TODO
)))
;;;###autoload
(define-derived-mode {language}-ts-mode prog-mode "{Language}"
:group '{language}
:syntax-table {language}-ts--syntax-table
(unless (treesit-ready-p '{language})
(error "Tree-sitter for {Language} isn't available"))
(treesit-parser-create '{language})
;; TODO
(setq-local comment-start "")
(setq-local treesit-simple-indent-rules {language}-ts--indent-rules)
;; TODO
(setq-local treesit-font-lock-feature-list '())
(setq-local treesit-font-lock-settings {language}-ts--treesit-font-lock-rules)
(treesit-major-mode-setup))
(provide '{language}-ts-mode)
First take care of a couple of things that need to be done before getting started:
- Install the tree sitter grammar for the language you are targeting
- Replace
{language}
with the name of the language in the above snippet - Run
treesit-explore-mode
on some code to see the AST
Now you can start developing. We'll start with adding some syntax highlighting. To stick with our example above we will add a rule to highlight Start
and Stop
to treesit-font-lock-rules
:
:language 'mermaid
:feature 'nodes
'((flow_vertex_id) @font-lock-type-face)
We then need to add it to the treesit-font-lock-feature-list
:
(setq-local treesit-font-lock-feature-list '(nodes))
And that is it. You didn't have to craft a single regular expression. Indentation is similarly straightforward:
`((mermaid
;; Setting a default level
((node-is "diagram_flow") column-0 0)
;; Indenting one level
((parent-is "diagram_flow") parent-bol mermaid-ts-indent-level)
))
Rules are added to {language}-ts--indent-rules
. You can base your rules on what the current node is, or what the parent node is. It seems limited, but since you're working on an AST it ends up being powerful.
Conclusion
Tree sitter along with language server will bring in a new era of editor support for languages.