Custom Elasticsearch Mapping in JanusGraph

This is a followup to this post. I recommend you give it a read through before looking at this post.

One of the interesting operational quirks of JanusGraph is that it has a storage layer and an indexing layer. This allow for graph traversals to be accomplished efficiently on the storage layer, while free text or complex search operations can be performed with the indexing back-end. This may seem like redundant overhead, but for very large graphs this can be the difference of a usable product and a neat mathematical theory.

Generally, when indexing the graph or creating indexes, you will do so via the Gremlin Console, effectively using JanusGraph to create your HBase/Cassandra tables and your Elasticsearch/Solr indexes. With this flexibility comes some limitations in what you can do directly within the Gremlin Console. One such example is creating custom mappings within Elasticsearch.

Custom mappings and tokenization are what unleash the power of the Elastic Stack onto your data. By fine-tuning these mappings, tokenization, and other settings in Elasticsearch, you can have a drastic improvement over the default settings that JanusGraph will create. However, as documented in this bug (which I have not gotten a chance to work on), it is not super trivial to create custom settings within the Gremlin Console. In this blog post I will walk you through a quick example of how to add custom settings to Elasticsearch index primarily within the Gremlin Console.

Elasticsearch Settings

In that bug/feature request it alludes to the need to create an Elasticsearch template that can be used as the seed to the index created by JanusGraph. In essence this means that if you want custom normalization in Elasticsearch to be used by JanusGraph, you need to create a template first that you will, in practice, always use.

To make a concrete example, keyword searches are very powerful for aggregating within Elasticsearch. However, accessing the fieldname.keyword subfield is not supported in JanusGraph, so for large aggregations within JanusGraph, you could easily run out of memory in your Elasticsearch cluster. If we wanted to get keyword functionality in addition to our standard stemmers/tokenizers we need the following on our ES index:

PUT test
{  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      },
      "normalizer": {
        "lowerasciinormalizer": {
          "type": "custom",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      }
    }
  },
  "mappings": {
    "_default_": {
      "dynamic_templates": [
        {
          "string_as_keyword": {
            "match_mapping_type": "string",
            "match":   "*_k",
            "mapping": {
              "type": "keyword",
              "normalizer": "lowerasciinormalizer"                              
            }
          }
         }
      ]
    }
  }
}

(from https://discuss.elastic.co/t/wildcard-case-insensitive-query-string/75050/5)

If you are not familiar with ES syntax, what the above is saying is: Create a new index called “test”, add an analyzer and a tokenizer, then create a new mapping called ‘dynamic_templates’ that uses the normalizers.

As we cannot create settings on the fly with JanusGraph we need to convert the above to a ES template:

{
  "index_patterns": ["*janusgraph*"],
  "settings": {
    "number_of_shards": 3,
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      },
      "normalizer": {
        "lowerasciinormalizer": {
          "type": "custom",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      }
    }
  }
}

Using the PUT template api, this will instruct ES to use this template every time we create an index with the term “janusgraph” in it. This means that to get access to it, all our search indexes in JanusGraph need to contain the string “janusgraph”. To use something else, simply swap out the name above.

JanusGraph Usage

Now within JanusGraph, we need to use what we created above. Following the steps in the Git issue, it now becomes relatively straightforward to apply.

gremlin> :remote connect tinkerpop.server conf/remote.yaml session
==>Configured localhost/127.0.0.1:8182-[e539ce32-5f22-44bf-a6ed-6015dd755e4d]
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[e539ce32-5f22-44bf-a6ed-6015dd755e4d] - type ':remote console' to return to local mode
gremlin> map = ConfiguredGraphFactory.getTemplateConfiguration()
gremlin> map.put('storage.cql.keyspace', 'dev_janusgraph_testgraph')
==>dev-janusgraph-configuredgraphmanangement
gremlin> map.put('index.search.index-name', 'janusgraph_testgraph')
==>null
gremlin> map.put('graph.graphname', 'testgraph')
==>null
gremlin> conf = new MapConfiguration(map)
==>org.apache.commons.configuration.MapConfiguration@5ca42c31
gremlin> ConfiguredGraphFactory.createConfiguration(conf)
==>null
gremlin> ConfiguredGraphFactory.getGraphNames()
==>testgraph

The above copies the template present in the ConfiguredGraphFactory and updates the CQL Keyspace for our toy testgraph. In the next step we will actually use the ES template to get custom mappings:

gremlin> ConfiguredGraphFactory.getGraphNames()
==>testgraph
gremlin> mgmt = ConfiguredGraphFactory.open('testgraph').openManagement()
==>org.janusgraph.graphdb.database.management.ManagementSystem@37362e19
gremlin> mgmt.printSchema()
==>------------------------------------------------------------------------------------------------
Vertex Label Name              | Partitioned | Static                                             |
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Edge Label Name                | Directed    | Unidirected | Multiplicity                         |
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Property Key Name              | Cardinality | Data Type                                          |
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Vertex Index Name              | Type        | Unique    | Backing        | Key:           Status |
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Edge Index (VCI) Name          | Type        | Unique    | Backing        | Key:           Status |
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Relation Index                 | Type        | Direction | Sort Key       | Order    |     Status |
---------------------------------------------------------------------------------------------------

gremlin> mgmt.makeVertexLabel('name_text').make()
==>name_text
gremlin> mgmt.makeVertexLabel('Person').make()
==>Person
gremlin> mgmt.makePropertyKey('name_text').dataType(String.class).make()
==>name_text
gremlin> mgmt.makePropertyKey('name_keyword').dataType(String.class).make()
==>name_keyword
gremlin> name_text = mgmt.getPropertyKey('name_text')
==>name_text
gremlin> name_keyword = mgmt.getPropertyKey('name_keyword')
==>name_keyword
gremlin> mgmt.buildIndex('name_keyword', Vertex.class).addKey(name_keyword, Mapping.STRING.asParameter(), Parameter.of(org.janusgraph.graphdb.types.ParameterType.customParameterName("normalizer"), "lowerasciinormalizer")).buildMixedIndex("search")
==>name_keyword
gremlin> mgmt.buildIndex('name_text', Vertex.class).addKey(name_text).buildMixedIndex("search")
==>name_text
gremlin> mgmt.printSchema()
==>------------------------------------------------------------------------------------------------
Vertex Label Name              | Partitioned | Static                                             |
---------------------------------------------------------------------------------------------------
name_text                      | false       | false                                              |
Person                         | false       | false                                              |
---------------------------------------------------------------------------------------------------
Edge Label Name                | Directed    | Unidirected | Multiplicity                         |
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Property Key Name              | Cardinality | Data Type                                          |
---------------------------------------------------------------------------------------------------
name_text                      | SINGLE      | class java.lang.String                             |
name_keyword                   | SINGLE      | class java.lang.String                             |
---------------------------------------------------------------------------------------------------
Vertex Index Name              | Type        | Unique    | Backing        | Key:           Status |
---------------------------------------------------------------------------------------------------
name_keyword                   | Mixed       | false     | search         | name_keyword:    ENABLED |
name_text                      | Mixed       | false     | search         | name_text:    ENABLED |
---------------------------------------------------------------------------------------------------
Edge Index (VCI) Name          | Type        | Unique    | Backing        | Key:           Status |
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Relation Index                 | Type        | Direction | Sort Key       | Order    |     Status |
---------------------------------------------------------------------------------------------------

gremlin> mgmt.commit()
==>null
gremlin>

The above does several steps. First it creates a couple of new vertex labels, then it creates properties to be assigned to those vertex labels. To actually use our lowerasciinormalizer normalizer, this enables it:

.addKey(name_keyword, Mapping.STRING.asParameter(), Parameter.of(org.janusgraph.graphdb.types.ParameterType.customParameterName("normalizer"), "lowerasciinormalizer")).buildMixedIndex('search')

As the index.search.index-name contains the term ‘janusgraph’, when JG reaches out to ESm we will have access to the ‘lowerasciinormalizer’ normalizer within the JG API.

This allows us to do things such as:

gremlin> g = ConfiguredGraphFactory.open('testgraph').traversal()
==>graphtraversalsource[standardjanusgraph[cql:[cassandra]], standard]
gremlin> g.addV("Person").property('name_keyword', 'Levi Lentz').next()
==>v[4112]
gremlin> g.addV('Person').property('name_keyword', 'lEvI LeNtz').next()
==>v[4184]
gremlin> g.tx().commit()
==>null
gremlin> g.V().count()
==>2

And then search for them with:

gremlin> g.V().has('name_keyword', textRegex('.*levi.*')).count().next()
==>2

returning back the two hits, even though JG documentation indicates that should not work given the case sensitivity of the textRegex search.

Conclusion

While there are a lot of other bells and whistles within JanusGraph and Elasticsearch, I hope this is a good introduction to some of the more advanced mappings you can achieve within pure JanusGraph. Happy graphing.

Leave a Reply

Your email address will not be published. Required fields are marked *