Graph databases are a kind of database that uses graph structure for semantic queries on nodes and edges, as well as properties to model and persist data. Such databases excel at querying data that is related to each other via a long chain of connections. In traditional relational SQL databases, this would be typically modelled as a large number of tables. Here at Cycode, we were searching for a graph database to organize our data and to enable complex detections and insights. We needed these capabilities for our own internal use and also as a solution for giving our customers a flexible way to query their data.
In the process, we experimented with several graph databases, until we found the one that fit us the most. This blog post will compare the four graph databases that we tried: AWS Neptune, Neo4J, ArangoDB and RedisGraph, and explain why we eventually chose ArangoDB for our needs.
But first, let’s understand what a Graph Database is.
What is a Graph Database?
A graph database (sometimes referred to as GDB or GraphDB) is a database that uses graph structures to represent and store data, enabling semantic queries of the data points. The graph is the element that links together the data in “relationships”, and then retrieves them, from nodes, edges, and properties.
Here’s what it looks like:
Graph databases are not a new concept. In fact, they were already introduced 20 years ago, in the early 2000s. However, they recently gained traction because they provide a good solution for use cases such as social networking, recommendation systems and fraud detection. Another strong graph use case is a Knowledge Graph. Knowledge Graph enables storing data in a graph model and using graph queries to intuitively explore highly connected datasets. For Cycode, a relevant use case is the need to visualize and understand the relations between an organization’s source code management systems, build systems and large numbers of cloud resources that make up their development lifecycle.
Graph databases are designed to effectively query many connected data points, unlike the “classic” relational SQL database where joining multiple tables can incur a significant performance penalty. This makes them more flexible and effective to use in such cases.
Why We Chose to Use a Graph Database
When looking at the challenges of securing the development lifecycle, it was clear to us that there was no “one-size-fits-all” solution. Organizations use different tools and processes that define their development lifecycle. Therefore, we needed a solution that would allow us to model all of the lifecycle’s parts (such as SCMs, build system and cloud environments), define the initial building blocks, and then allow our customers to easily fine-tune it to exactly fit their needs.
Our requirements for such a solution included:
1. Flexible Detection Composition
We needed a flexible way to define and compose advanced detections and insights from all the new data sources we were adding. This included finding a way to answer unique customer requirements. For example, enabling the discovery of a build product from a private repository that was uploaded to a public artifactory.
2. Enhancing Existing Data
We also wanted a solution that could enhance the existing data in our system to create more complex detections. For example, finding which members pushed sensitive data to a repository without a pull request review and then uploaded the built artifact into an artifactory.
3. Simplicity
Due to the variance in our customers’ needs, adjusting our system logic to match each use case is an endless process. We needed a solution that would enable both us and our customers to simply adjust queries when new data is added and get the information in a friendly, intuitive method.
With these requirements in mind, we set out to find a graph database that would fit our needs and combine it with existing datastores.
Comparing 4 Different Graph Databases
We experimented with four graph databases: AWS Neptune, Neo4J, ArangoDB and RedisGraph.
AWS Neptune
Amazon Neptune is a managed graph database product offered by Amazon Web Services. It is based on the open-source BlazeGraph DB, which was acquired by Amazon in 2015. In terms of query language, it supports Apache TinkerPop Gremlin and W3C’s SPARQL. It was also the first database we were using when we started out our POCs.
When it comes to data modeling, AWS Neptune supports a single label per node, which is less flexible when it comes to later querying the graph by nodes’ labels. Node properties can not contain complex types, and such use cases need to be solved by using another solution, such as Elasticsearch integration.
As with any cloud service, you’re charged by-the-hour based on the database instance specs as well as on the size of the backup storage and volumes of the data transfer in and out from the instance.
AWS Neptune Advantages:
- Fully-managed by AWS. It is by nature highly scalable, and ships with all the bells-and-whistles such as read replicas, high availability, out-of-the-box encryption at rest, continuous backups and replication across availability zones.
- Supports multi-tenancy, which is critical for any SaaS application, by using a partitioned traversal strategy that effectively encapsulates the fact it serves different tenants, and does it efficiently.
- High-performance. Using various optimization methods and includes automatic properties indexing to improve querying speed.
AWS Neptune Disadvantages:
- No self-hosted deployment option. Offered only as a managed cloud service, you cannot deploy it to your own environment.
- It is inherently vendor-locked to AWS.
- Limited local development environment. Derived from the above, alternative solutions are needed for local development. In that case, we used Gremlin Server. Unfortunately, we encountered different behavior between it and Neptune quite quickly. This required us to spend additional time verifying that operations behave the same on both solutions.
Gremlin query in AWS Neptune visualized by Gremlin UI
Neo4J
Neo4J is probably the most known and prevalent GraphDB type with the largest user community. For querying the database, it uses the Cypher Query Language (CQL), which was developed by Neo4j in conjunction with the community and released as an open source standard under the name OpenCypher.
Initially released in 2010, Neo4j is now offered either as an GPL3-licensed open-source project with a “community edition” license, as a commercially licensed product, or as a hosted service with features such as online backup and high-availability.
For data modeling, it supports multiple labels per node, which allows more flexibility for querying the graph by nodes’ labels. As with Neptune, node properties can not contain complex types.
Neo4J Advantages:
- Established community and extensive documentation. Neo4j’s large community provides extensive support, and the product itself is well documented with a lot of information, guides and manuals available online.
- First-class support for GraphQL with the Neo4j GraphQL Library. This integration allows one to easily connect a Neo4J database to a GraphQL server, configure Cypher queries to GraphQL schema and even generate GraphQL schema from existing databases.
- Simple and frictionless setup. With the open-source community version available both as an executable and a docker image, it’s easy to set up a local development environment and get up and running.
Neo4J Disadvantages:
- Tricky multi-tenancy. Starting with v4.0, Neo4j supports multi-tenancy by having more than one active database at the same time. In the documentation, we found that the default configuration allows up to 100 databases. The community version however, is limited with a single database per installation.
- In some scenarios, not as performant as some of the competitors. Neo4j was benchmarked against several other Graph DBs (see here, here and here) and was presented as having slower performance in the presented test-cases.
- Different licenses and editions require extra attention for self-hosted deployments. There are different offerings for Cloud, Enterprise and Community editions, including different licenses. When having on-premise customers, each customer might need their own type of license (depending on the usage).
Neo4J query visualized in their built-in graph viewer
ArangoDB
ArangoDB is a free and open source database that uses AQL, a proprietary query language.
Initially released in 2011, ArangoDB is offered as a free to use product as well as a hosted cloud service with features such as enhanced security, elastic scale and expert support.
Described as a multi-model database, ArangoDB’s unique capability is that it enables creating edge collections for storing all the data as is as nodes.
Unlike most GraphDBs, it has a unique multi-model approach. This means nodes are essentially documents stored in collections as in any NoSQL database, using a special edges collection to support relationships modeling between documents across the various collections.
ArangoDB Advantages:
- Easy to get up and running with various deployment options. It’s easy to either set up a local development environment or a self-hosted deployment and get up and running while getting the full product’s functionalities without any limitations.
- Allows complex types as node properties. Derived from its multi-model approach, storing nodes and edges as documents in collections allows a more streamlined process of modeling data. It allows more flexibility and removes the need for additional integrations to store complex data in a separate store.
- High Performance. Quick query resolution, according to their benchmarks.
- Powerful query language. While this is subjective, we found AQL easy to grasp, with structure that reminded us of a programming language and enabled us to focus on writing queries rather than understanding syntax.
ArangoDB Disadvantages:
- Smaller community. While the product’s documentation is quite thorough, there are some scenarios that are not covered. We did find answered StackOverflow questions in some cases.
- Ecosystem still evolving. GraphQL integration is limited, and the schema needs to be generated manually. Some language drivers aren’t maintained. While Spring Data integration exists, it does not offer full graph functionality and there is no other official ORM-like or OGM support.
ArangoDB query visualized in the built in graph viewer
RedisGraph
RedisGraph is an open source pluggable graph database module for Redis, developed by RedisLabs. RedisGraph uses OpenCypher as its query language. While it supports the majority of OpenCypher operations, there are still some missing functionalities.
When it comes to data modeling, RedisGraph also supports a single label per node as well as having a limitation on node properties, which can not contain complex types.
RedisGraph is offered as an open-source product, with a flexible hosted cloud service option that also offers features such as high availability, backups and customized hardware to get the most efficient performance from its root Redis cluster.
RedisGraph Advantages:
- Simple and easy deployment and infrastructure operations. Thanks to being part of the Redis ecosystem, getting up and running should be simple for anyone that dealt with the Redis ecosystem before.
- Claims high performance thanks to using adjacency matrices. Compared to other GraphDBs, RedisGraph implementation is based on using to model relationships. It claims that this approach leads to high performance compared to other competitors in the field.
- Natively supports multi-tenancy with a graph per tenant approach. According to publications and our research this approach should easily scale very well even for a large number of tenants.
RedisGraph Disadvantages:
- Less battle tested than its competitors. Released in late 2018 and being a work in progress, it’s still missing some features of the Cypher query language it uses.
- Small ecosystem. Most language drivers offer just a simple wrapper around generic redis clients. No OGM/ORM-like integrations.
RedisGraph query
AWS Neptune vs. Neo4J vs ArangoDB vs. RedisGraph Comparison Table
AWS Neptune | Neo4J | ArangoDB | RedisGraph | |
Query Language | Gremlin, SPARQL | CypherQL | AQL | Subset of OpenCypher |
Deployment Options | Cloud
Self-hosted replacement possible with Gremlin Server (not 1:1 compliant with Neptune) |
Cloud & Self-Hosted (Enterprise Edition requires commercial license) | Cloud & Self-Hosted | Cloud & Self-Hosted |
Open Source | – | GPLv3 for Community Edition, commercial license needed for Enterprise Edition | Apache License 2.0 | Redis Source Available License |
Multi-Tenancy | Using PartitionStrategy | Database per tenant (Community edition limited to single database) | Graph Per Tenant | Graph Per Tenant |
Nested types on properties | – | – | V (Nodes/edges are documents) | – |
Labels per node | 1 | Unlimited | – (Collections are queried) | 1 |
Access Control | Standard AWS IAM for management, only full access for data operations | On enterprise edition – users, roles granularity.
On Cloud – only users |
Collection granularity per user | Admin, Read-Write or Read-only roles |
Edges Properties | V | V | V | V |
Directed Edges | V | V | V | V |
Wrapping it up and choosing a graph database
As we experimented with each product, we found each had its own unique advantages and disadvantages.
When Should You Choose Each One?
- For a self-hosted solution of your product to deploy on your customers’ environments, Arango and RedisGraph would probably allow it in the most convenient manner.
- If all you need is a frictionless, fully-managed solution with a strong “mothership” then AWS Neptune could be the way to go.
- If you’re after a battle tested, well documented widely-used solution with out-of-the-box GraphQL support, Neo4j could tick all of those boxes.
As mentioned before, we chose ArangoDB as our Graph Database, due to several reasons:
- First, its great multi-model approach fitted our needs. Using nodes as collections allowed us to quickly and easily utilize the graph to gain powerful insights from different sources without the need of additional data stores – by using complex queries on inner-properties.
- Second, its scalable multi-tenancy support assured us that we could use it as the numbers of our customers increase.
- Finally, its simple deployment options allowed us to provide an easy self-hosted solution for our customers that need our product in their environments.
Conclusion
The process of testing the different graph databases was very insightful. The graph database landscape is broader than what we initially thought. We learned that there are many aspects to take into consideration when choosing a graph database, and that those might not be the ones you initially considered. To anyone going on this journey, we recommend to first pinpoint your technical needs and limitations. Then, create some seed test data that reflects how you expect to use it in a graph. Experiment with each, and find which option gives the best combination of fitting your technical needs as well as answering other considerations like pricing, licensing and deployments options. This will help you choose the best solution for you.
Good luck!