When it comes to use cases like quick code formatting and syntax highlighting across many languages, tree-sitter is an excellent tool. But it does so much more than that. At Bearer, we use it as the base for our static code analysis feature.
In this article we’ll look at tree sitter, how to use it, and how to avoid some of mistakes we made when implementing it. This should help you in making the decision if tree sitter is a good choice for your use case.
What is tree sitter?
Tree sitter is a parser generator tool and an incremental parsing library. It enables you to generate concrete syntax tree from any language it supports. The official site has a full list of the supported languages and parsers.
That means that it can take in this javascript snippet:
Simple tree sitter queries
A tree on its own can be useful, but where it really shines is the ability to query nodes within the tree. Tree sitter has its own limited query language. It is somewhat of a cross between SQL and JSON querying. Tree sitter is optimized to query the tree on each file save, so you can expect blazing fast performance.
You can see the full spec for tree sitter’s query on their official page, but if like me you are looking to jump right into it we can walk trough some examples.
A basic example would be to look for each time the object property is accessed. You can do this by querying member_expression:
(member_expression) @param_expression
You assign a node name to the @param_expression variable, then from the consuming program you can access the @param_expression value. As a result of that query you get the highlighted nodes:
Query limitations
Those simple queries are quite easy to construct. Things get a bit more complicated as you try to build more complex queries.If you try to get the parent and child of each object, the naive way of building the query might be:
(member_expression object: (identifier) @param_object property: (property_identifier) @param_property )
But you will soon find out that a query like that doesn’t work for nested objects. For example, if you change the earlier code to find the address instead of the name, you get a different structure.
Here’s the javascript with the change:
let customers = [ { id: 0, name: 'paul', location: { address: 'oxford street' } }, { id: 1, name: 'jeff' }, { id: 2, name: 'mary' } ]; -- let customer = customers.find(cust => cust.name === 'jeff'); ++ let customer = customers.find(cust => cust.location.address === 'oxford street'); console.log(customer);
And here’s the new result:
Query performance tuning
Tree sitter’s query language is simple by design, and you should treat it as such. While it is optimized for performance, if you try to do too much, you could set yourself up for some problems.
Let’s take a look at an example of an html structure:
<html> <body> <div class="parent"> <div class="child1"></div> <div class="child2"></div> <div class="child3"></div> <div class="child4"></div> </div> <div class="parent"> <div class="child1"></div> <div class="child2"></div> </div> </body> </html>
If you want to query adjacent elements, you might end up with this query:
Compare to other tool
If your only need is markup languages such as XML, JSON, YAML you might get away without using tree sitter at all. In that case, the JSON parser of your programming language may be enough.
If you do need more control and support, using tree sitter for static code analysis is magnitudes better than using grep and regular expressions. You can learn more on how our tool differs from that approach in our CTO’s article on how we perform static code analysis.
The market for AST tooling isn’t exactly crowded. There are multiple language server protocols (LSP) available for each language, and some are even based on tree sitter. The main difference between tree sitter and LSP is that while tree-sitter is more about understanding a file, LSP is more about understanding the project. So while it isn’t uncommon for LSP to take a couple of seconds to perform its duties, tree-sitter responds in milliseconds.
Wrapping up
Hopefully, you now have a better understanding of tree sitter and how to approach using it. It can seem like overkill for some projects, and not quite powerful enough for more advanced use cases, but it succeeds as a foundation for which to build multi-language tooling.
Are you using tree-sitter or a similar tool to parse languages for an interesting use case? We’d love to hear about it! Let us know on @cycode.