Dockerized ConfiguredGraphFactory with JanusGraph, Cassandra, and Elasticsearch

As part of my work as a data scientist, I have increasingly been using graph databases to represent person-centric and event-centric data. One of the more interesting graph data bases out there is JanusGraph. As the spiritual successor to titan, it has a lot of bells and whistles that make it interesting to scale to enterprise-level graph databases.

The basics of how it works is that the data is stored in two back-end data stores. One stores the graph related information in HBASE (or Cassandra/CQL type databases) and one stores the ‘index’ of the data in solr or Elasticsearch. This is a clever way to getting the database to scale as the scaling task is off-loaded to the back-end data stores. Plus, this is a neat way of allowing JanusGraph to work in a variety of environments.

While a powerful tool, JanusGraph uses a documenting strategy of “Look at the source code” to understand what is really going on. In this post, I will outline how to setup a dockerized JanusGraph database underpinned with Elasticsearch and Cassandra, all using a ConfiguredGraphFactory so that this could scale horizontally with user-demand.

What is a ConfiguredGraphFactory? In JanusGraph there are multiple ways to specify the graph, simply meaning “what Cassandra keyspace should I look at and what Elasticsearch index should I query.” Most beginning tutorials will specify to use a local yaml file to specify these configurations. This has the draw back that your graph is static with the creation of the yaml file; every time a new graph is added, the yaml file needs to be updated and JanusGraph restarted. The ConfiguredGraphFactory gets around this by placing this configuration in a specific table in Cassandra that is then referenced by all connected JanusGraph nodes, as outlined here. In my opinion, this is the best way to setup JanusGraph if you anticipate having multiple nodes in your cluster.

Docker Setup

I have made a github repo with a docker compose and modified JanusGraph configuration files that you can find here: https://github.com/levilentz/janusgraph-es-cql-docker.

Most importantly is the docker-compose file:

version: '3'

services:
  elasticsearch:
    image: elasticsearch:6.6.0
    expose:
       - 9200
    networks:
      - janusgraph
    volumes:
      - elasticsearch:/usr/share/elasticsearch/data
    restart: always
    container_name: janusgraph_elasticsearch

  cassandra:
    image: cassandra:3
    expose:
      - 7000
      - 7001
      - 7199
      - 9042
      - 9160
      - 9404
    networks:
      - janusgraph
    healthcheck:
      test: ["CMD-SHELL", "[ $$(nodetool statusgossip) = running ]"]
      interval: 30s
      timeout: 10s
      retries: 5
    volumes:
      - cassandra:/var/lib/cassandra
    restart: always
    container_name: janusgraph_cassandra

  janusgraph:
    build:
      context: .
      dockerfile: Dockerfile
    image: janusgraphbuild
    depends_on:
      - elasticsearch
      - cassandra
    ports:
      - "8182:8182"
    networks:
      - janusgraph
    restart: always
    container_name: janusgraph

networks:
  janusgraph:

volumes:
  cassandra:
    external: true
  elasticsearch:
    external: true

In order to use this you will need to create external volumes called ‘cassandra’ and ‘elasticsearch’ for permanent data storage of Cassandra and Elasticsearch data. The Elasticsearch and Cassandra versions are to conform with the purported support in JanusGraph 0.4.0 per https://github.com/JanusGraph/janusgraph/releases/tag/v0.4.0. There is no modification of Cassandra or Elasticsearch needed.

The JanusGraph docker image is not the JanusGraph image from docker hub, but rather a built JVM docker container that contains 0.4.0 JanusGraph that you will have to download from the JanusGraph github. The reason for this is I had some problems customizing the JanusGraph dockerhub image to work with optimization recommendations that can be found here: https://www.experoinc.com/post/janusgraph-nuts-and-bolts-part-1-write-performance. I highly recommend you read that article for tips and tricks related to JanusGraph ingest speed.

Once you have the 0.4.0 JanusGraph downloaded, simply place all the files from my github into the file structure. These files simply make JanusGraph work with a ConfiguredGraphFactory and do some minor optimization of JanusGraph by using docker_entrypoint.sh to convert the values in “environment.env” to JanusGraph settings

Most of the configuration necessary for the ConfiguredGraphFactory is handled in janusgraph-0.4.0-hadoop2/conf/gremlin-server. Specifically in gremlin-server.yaml:

#This enables JanusGraphManager -- compatible with ConfiguredGraphFactory
graphManager: org.janusgraph.graphdb.management.JanusGraphManager
# Only one graph is specified, the ConfiguredManagementGraph
graphs: {
  ConfigurationManagementGraph: conf/gremlin-server/janusgraph-cql-es-server.properties
} 

and in janusgraph-cql-es-server.custom.template:

gremlin.graph=org.janusgraph.core.ConfiguredGraphFactory

Once these are set, JanusGraph will behave behave use the ConfiguredGraphFactory as a driver, letting you take the instance up and down as needed or to scale it horizontally. Each new instance will simply read the ConfiguredGraphFactory table within Cassandra as specified by CGFGRAPHNAME. I recommend you looking at the docker_entrypoint.sh to understand what variables are doing what in our JanusGaph.

One last thing to note about the docker-compose file is the restart needed for JanusGraph. In general JanusGraph will start faster than Cassandra, and it will invariably crash because of that. This perpetual restart will guarantee that JanusGraph will start eventually.

Now that you understand the configuration, bring the cluster up with:

docker-compose up -d 

ConfiguredGraphFactory Elasticsearch Setup

Once Janusgraph is running, we need to specify the elasticsearch settings for the ConfiguredGraphFactory. To get to this run the following:

docker exec -it janusgraph /app/bin/gremlin.sh

This will bring you to the gremlin shell within your janusgraph docker container. Once there run the following:

Feb 01, 2020 10:20:40 PM java.util.prefs.FileSystemPreferences$1 run
INFO: Created user preferences directory.

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/app/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/app/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
plugin activated: tinkerpop.server
plugin activated: tinkerpop.tinkergraph
22:20:44 WARN  org.apache.hadoop.util.NativeCodeLoader  - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.utilities
plugin activated: janusgraph.imports
gremlin> :remote connect tinkerpop.server conf/remote.yaml session
==>Configured localhost/127.0.0.1:8182-[743a5fba-8f39-45ad-b4a2-2c33605a2ea7]
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[743a5fba-8f39-45ad-b4a2-2c33605a2ea7] - type ':remote console' to return to local mode
gremlin> map = new HashMap<>()
gremlin> map.put('storage.backend', 'cql')
==>null
gremlin> map.put('storage.hostname', 'cassandra')
==>null
gremlin> map.put('storage.username', 'cassandra')
==>null
gremlin> map.put('storage.password', 'cassandra')
==>null
gremlin> map.put('storage.cql.keyspace', 'janusgraph-configuredgraphmanangement')
==>null
gremlin> map.put('index.search.backend','elasticsearch')
==>null
gremlin> map.put('index.search.hostname','elasticsearch')
==>null
gremlin> map.put('index.search.elasticsearch.http.auth.basic.password', 'elasticsearch')
==>null
gremlin> map.put('index.search.elasticsearch.http.auth.basic.username', 'elasticsearch')
==>null
gremlin> map.put('schema.default', 'none')
==>null
gremlin> map.put('storage.batch-loading',true)
==>null
gremlin> map.put('storage.buffer-size',10000)
==>null
gremlin> conf = new MapConfiguration(map)
==>org.apache.commons.configuration.MapConfiguration@58f9756
gremlin> ConfiguredGraphFactory.createTemplateConfiguration(conf)
==>null

Most of those settings are as recommended by the JanusGraph optimization article, but you can learn more by looking here: https://docs.janusgraph.org/basics/configuration-reference/.

Your JanusGraph is now setup to use the CGF with Elasticsearch and Cassandra docker containers.

Graph Creation

I find that the easiest way to create a graph withing JanusGraph while using a CGF is to copy the configuration itself and then customize it for that specific graph’s need. to create a vanilla graph, perform the following:

gremlin> map = ConfiguredGraphFactory.getConfiguration()
==>storage.cql.keyspace=janusgraph-configuredgraphmanangement
==>index.search.hostname=elasticsearch
==>index.search.elasticsearch.http.auth.basic.password=elasticsearch
==>index.search.elasticsearch.http.auth.basic.username=elasticsearch
==>Template_Configuration=true
==>storage.username=cassandra
==>storage.backend=cql
==>storage.hostname=cassandra
==>storage.password=cassandra
==>schema.default=none
==>storage.batch-loading=true
==>index.search.backend=elasticsearch
==>storage.buffer-size=10000
gremlin> map.put('storage.cql.keyspace', 'dev_janusgraph_testgraph')
==>janusgraph-configuredgraphmanangement
gremlin> map.put('index.search.index-name', 'janusgraph_testgraph')
==>null
gremlin> map.put('graph.graphname', 'testgraph')
==>null
gremlin> conf = new MapConfiguration(map)
==>org.apache.commons.configuration.MapConfiguration@5e65620a
gremlin> ConfiguredGraphFactory.createConfiguration(conf)
==>null
gremlin> ConfiguredGraphFactory.getGraphNames()
==>testgraph

As outlined on JanusGraph’s website (https://docs.janusgraph.org/basics/configured-graph-factory/#graph-and-traversal-bindings), you can now access traversal objects with GRAPHNAME_traversal or with ConfiguredGraphFactory.open(‘GRAPHNAME’). Shown by running the following:

gremlin> g = ConfiguredGraphFactory.open('testgraph').traversal()
==>graphtraversalsource[standardjanusgraph[cql:[cassandra]], standard]
gremlin> g.V().count()
==>0
gremlin> g = testgraph_traversal
==>graphtraversalsource[standardjanusgraph[cql:[cassandra]], standard]
gremlin> g.V().count()
==>0

Conclusion

This guide should help you setup a dockerized JanusGraph server with Elasticsearch and Cassandra running in the background. I encourage you to use this as a starting point to understand JanusGraph better. In future posts, I will outline how to start tweaking how JanusGraph interacts with the backend data stores such as Elasticsearch.

Leave a Reply

Your email address will not be published. Required fields are marked *