Abstract
So far, we have discussed data modeling in Cassandra, building a large data analytics platform using Hadoop and related technologies. When business requirements require interconnectedness among various objects, an answer is having reference (joins) queries among multiple objects. Everything looks good until we have to deal with many such relations or the data volume becomes really huge.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
So far, we have discussed data modeling in Cassandra, building a large data analytics platform using Hadoop and related technologies. When business requirements require interconnectedness among various objects, an answer is having reference (joins) queries among multiple objects. Everything looks good until we have to deal with many such relations or the data volume becomes really huge.
LinkedIn, Twitter, and Facebook are popular examples where relationships among registered users can be described as a social graph. Another example could be building a credit history and doing risk analysis of granting loans to a group of customers. Definitely with multiple tables and join theory, we can implement a solution but obviously it will not be scalable and performant. That’s where idea of using a graph database comes very handy.
In this chapter we will discuss:
-
An introduction to graph concepts
-
Graph frameworks and databases
-
Titan Graph setup and installation
-
Graph use cases
Introduction to Graphs
Graph theory can be traced back to 1736. A graph is a data structure that consists of vertices/nodes and edges. A graph can have zero or multiple edges. A graph without edges is also referred to as an empty graph.
A vertex is a graph node that is connected to other graph nodes/vertices via edges. Each edge is an arc or line which connects multiple vertices.
In computer science, a graph can be categorized in many ways. A few of the popular ones include:
-
Simple and nonsimple graphs
-
Directed and undirected graphs
-
Cyclic and acyclic graphs
Simple and Nonsimple Graphs
A simple graph, as the name suggests, is a basic graph having at most one edge between two vertices. It's a graph that has no self loops but multiple nodes (see Figure 7-1a). On the other hand, graphs with self loops and multiple edges are nonsimple graphs (see Figure 7-1b). Here, a self loop is an edge which connects to a vertex itself. A graph having multiple edges between the same nodes is also called a multigraph or parallel graph. The graph shown in Figure 7-1c is a multigraph and a self-loop graph. It contains a self loop on vertex A and has multiple edges between vertex A and B.
Directed and Undirected Graphs
Graphs with multiple nodes and directed edges are directed graphs (see Figure 7-2a). Undirected graphs are the graphs having edges without direction (see Figure 7-2b).
Cyclic and Acyclic Graphs
A graph is said to have a cycle if that traverses a path with at least one edge and starts and ends at the same vertex, whereas a non-cyclic or acyclic graph doesn’t contain any cycle. Such graphs can be directed or undirected. Figure 7-3 shows graphical representations of cyclic and acylic graphs.
Figure 7-3a represents a cyclic graph with edges represented by dotted lines connecting the B, C, and D vertices.
Those are a few basic types of graphs. The next question that may come to mind is whether there are any open source tools, frameworks, or databases specifically built to handle graph-related problems?
Open Source Software for Graphs
Open source software is free, easy to submit bugs and request feature modifications according to our needs, and most importantly cost effective. In recent years, using open source software in the IT industry has become popular, and more organizations prefer open source solutions. A few of the considerations before adopting an open source solution are:
-
Should be mature and stable
-
Should be in active development and must have community support
-
There must be systems in production to validate industry usage
We asked the question of whether there are any tools, frameworks, or databases for solving graph-related problems. Well, let’s explore and find out!
Graph Frameworks: TinkerPop
Graph frameworks are used for graph data modelling and visualization. In this section we will discuss TinkerPop ( www.tinkerpop.com/ ) and its feature set. TinkerPop Blueprints is used as a specification by many NoSQL databases (including Titan). Blueprints provides a set of interfaces and implementations for graph data modeling and will be discussed in later in this section.
TinkerPop is an open source graph computing framework with multiple components, frameworks, and command-line tools for handling graph data modeling and visualization. In this section we will discuss them individually.
Pipes
Pipes is a dataflow framework that enables the splitting, merging, filtering, and transformation of data from input to output. Computations are evaluated in a memory-efficient, lazy fashion.
Think of pipes as vertices that are connected by edges, with functions for extraction, transformation, and data computation generally.
Gremlin
Gremlin is a graph traversal language that is used for graph query, analysis, and manipulation. The Gremlin distribution comes with built-in API support for Java and Groovy. We will discuss Gremlin at length in relation to Titan in the “Command-line Tools and Clients” section later in this chapter.
Frames
Frames exposes the elements of a Blueprints graph as Java objects. Instead of writing software in terms of vertices and edges, with Frames, software is written in terms of domain objects and their relationships to each other.
Rexster
Rexster is a multi-faceted graph server that exposes any Blueprints graph through several mechanisms with a general focus on REST. It exposes a graph server via the REST API and RexPro protocol. RexPro is a binary protocol for Rexster that can be used to send Gremlin scripts to a remote Rexster instance. The script is processed on the server and the results serialized and returned to the calling client. It also provides tools for a browser-based interface known as the Dog House and the Rexster console (which will be discussed with the Titan ecosystem).
Furnace
Furnace is a property graph algorithms package. It provides implementations for standard graph analysis algorithms that can be applied to property graphs in a meaningful way. Furnace provides different graph algorithm implementations that are optimized for different graph computing scenarios, such as single-machine graphs and distributed graphs.
Note
Single machine graphs involve graph data over a single node, whereas distributed graphs have data distributed across multiple nodes.
Blueprints
Blueprints, as the name suggests, is a property graph model interface with provided implementations. Databases that implement the Blueprints interface automatically support Blueprints-enabled applications.
Blueprints can be thought of as JDBC (Java DataBase Connectivity) or JPA (Java Persistence API) APIs for graph databases. Most graph databases implement Blueprints. Figure 7-4 shows a representation of how Blueprints can be visualized with the previously mentioned TinkerPop components.
So with this we have covered TinkerPop framework, its components and other graph-related concepts. The next question is whether there are any graph-based solutions that can be thought of as graph databases. In next section we will answer these questions.
Graph as a Database
In comparison to traditional RDBMS, NoSQL databases are less about schema and more about denormalized forms of data. But graph databases offer the flexibility to define relationships between nodes via edges, and that's why it is easy to understand them in terms of RDBMS concepts. Building graph-like queries with an RDBMS is certainly possible but as discussed previously it will be of very limited use. With non-graph databases, the ability to run graph-based queries for traversal or building graph structures is not supported and could be cumbersome to build. Because of inherent graph data structure support, graph databases will have an edge over traditional RDBMS.
A few differences between RDBMS and graph databases are
-
There is no need for index lookup with graph databases, as nodes or vertices are aware of properties they are connected with (e.g., edges) whereas with RDBMS we need to rely on an indexing mechanism.
-
Two vertices interconnected via edges can be different in properties and may evolve dynamically, but RDBMS imposes a fixed set of schema.
-
With graph databases, the relationship between two vertices is stored at the record level whereas with RDBMS it is defined at the table level.
One thing we need to keep in mind is that the current era of application development is one of using specific technologies for specific needs, which is a good fit for building polyglot or hybrid solutions. This means that for cases in which your needs are best served by running graph-like queries and your requirements lend themselves to a faster graph-based model, then the answer is simple: Use graph databases. A graph database uses nodes and edges and their properties to build and store data in a graph structure. Every element in a graph databases has a reference to adjoining nodes, which means no index lookup will be required. Figure 7-4 shows an example of a graph database storing Twitter users as nodes and their followers as edges. Each node contains an fname, id, lname, and role as properties, whereas each edge has a property to denote the date when a user became a follower of the adjoining node (i.e., user).
Figure 7-5 shows a Twitter connection and follower graph for users mevivs, chris_n, apress_team, and melissa_m. Here the vertex apress_team is being followed by the mevivs and melissa_m vertices. On the other hand, a transitive relation/traversal exists between chris_n, who is following mevivs, who follows apress_team, and the apress_team follows chris_n. In the “Gremlin Shell” and “Use Cases” sections, we will refer to the same Twitter example to explore command-line tools and Java APIs in sample exercises. When considering such transitive graph queries, one thing worth discussing is that the ways graph databases handle such queries is different than SQL queries. Handling of such transitive queries with SQL is not straightforward and would require performing complex joins, unions, or intersections. But handling such transitive queries with graph databases is much easier and requires just following the edges for incoming and outgoing data queries using the edge’s properties. The “Use Cases” section will discuss these graph traversals.
Let’s talk about available graph databases. A few of the popular graph databases are
-
Neo4J
-
OrientDB
-
Objectivity’s InfiniteGraph
-
Titan
Because the intent of this chapter is to discuss Cassandra and Titan graph databases at length, we will discuss only important features provided by the other databases mentioned here.
Neo4J
Neo4J ( www.neo4j.org/) is an open source database licensed under GPU (General Public Usage). It was developed by Neo Technology Inc. It stores data in the form of nodes that are connected by directed edges. A few important features provided include:
-
Scalable and highly available data
-
Fll ACID transaction support
-
REST-based access
-
Support for Cypher and Gremlin graph query language
OrientDB
OrientDB ( www.orientechnologies.com/orientdb/ ) is an Apache 2 licensed NoSQL graph database. It is managed and developed by Luca Garulli of Orient Technologies. A few important features provided by OrientDB are:
-
REST-based access
-
Full ACID transaction support
-
A SQL-like interface for query support
InfiniteGraph
InfiniteGraph ( www.infinitegraph.com/ ) is an enterprise distributed graph database built by Objectivity Inc. A few important features supported by InfiniteGraph are
-
Support for concurrency and consistency
-
Full ACID transaction support
-
A visualization tool
Titan
Titan (thinkaurelius.github.io/titan/) is an Apache licensed scalable graph database built to store a large amount data in the form of nodes and edges. It supports Cassandra as backend storage. A few of its important features are
-
Full text search and geospatial query support via Lucene/ES
-
ACID support
-
Eventual and intermediate consistency
-
Support for multiple databases which can be good for polyglot graph-based applications
-
Support for Gremlin and cypher
Titan and its various components will be discussed in the coming sections.
Titan Graph Databases
Titan is a transactional graph database that allows thousands of concurrent users to execute complex graph traversal queries in real time. It also provides support for graph data analytics, reporting, and ETL support via Hadoop integration. It also comes with built-in support of Elasticsearch and Lucene for geospatial queries and full text search. It also provides native support for a Blueprint TinkerPop graph stack.
Figure 7-6 shows a graphical representation of the Titan ecosystem.
Basic Concepts
In this section, we’ll introduce some basic concepts that are important for understanding Titan graph databases.
Vertex-Centric Indices
A vertex-centric index is specific to a vertex. Most traversal among referencing or non-referencing vertices would be via edges or their properties. Indexing such properties or edge labels can avoid performance overhead. Such indices can also be referred to as local indices. The purpose of vertex-centric indices is to sort and index adjoining edges of a vertex based on an edge’s properties and labels.
Titan also provides support for Elasticsearch, which can be run in standalone or embedded mode with Titan. With Elasticsearch it is also possible to perform full text search, executing geospatial queries, and numeric range queries. Elasticsearch allows us to query over nonindexed properties, as well.
Edge Compression
With edge compression, Titan can store compressed metadata and keep memory usage low. It also can store all edge labels and properties within the same data block for faster retrieval.
Graph Partitioning
This is where the underlying database matters the most. In Cassandra, we know that data for one particular row would be stored on one Cassandra node. Titan understands that with keys sorted by vertex ID, it can effectively partition and distribute graph data across Cassandra nodes. The vertex ID is a 64-bit unique value.
By default Titan adopts a random partitioning strategy to randomly assign vertices to nodes. Random partitions are efficient and keep the cluster balanced, but in the case of multiple backend storage options adopted for storing graphs, it would lead to performance issues and require cross-database communication across the instances. With Titan we can also configure explicit partitioning. With explicit partitioning we can control and store traversed subgraphs on same node.
We can enable an explicit partition in Titan as follows:
cluster.partition = true
cluster.max-partitions = 16
ids.flush = false
Here is the maximum number of partitions per cluster. Also we need to disable flushing IDs as well.
When using Cassandra as the storage backend option, it must be configured with the ByteOrderedPartitioner.
Titan stores vertices and adjoining edges and properties as a collection. This mechanism is called . With this mechanism, edges connecting to a vertex and its properties would be stored together in a collocated way. Each row will contain a vertex ID as a key and adjoining edges and properties as cells. Data representation in this format is common across all supported databases.
Figures 7-7 and 7-8 show the Titan data layout and edge storage mechanisms. (The images are from the Titan wiki page at https://github.com/thinkaurelius/titan/wiki/Titan-Data-Model and reused under the Apache License, Version 2.0, www.apache.org/licenses/LICENSE-2.0 .)
The underlying datastore will store each vertex along with adjoining edges and properties as a single row. Also these rows will be sorted by vertex id and cells will be sorted by property and edge key. Dynamic cells can be added at run time. Data collocation is very important, that’s why storing the vertex and adjoining edges as a single row would help to achieve high availability.
Backend Stores
Titan’s storage architecture is totally decoupled and supports multiple NoSQL databases, which are
-
Cassandra
-
HBase
-
BerkeleyDB
-
Persistit
Support for multiple NoSQL data stores allows adopting the right one based on application requirements. In other words, you can select specific technology for specific needs. Based on the CAP theorem we may opt for any one of the supported databases.
With Titan we can configure backend storage on the fly using the storage.backend option. Examples in this chapter will cover how to use this option with Cassandra.
Transaction Handling
Titan is a transactional graph database; hence every read/write operation would happen in a transaction boundary.
The following code snippet shows Titan wrap the vertex mevivs in a transaction boundary and commit it:
TitanGraph g = TitanFactory.open("/home/vivek/Titan");
Vertex mevivs = g.addVertex(null); //Implicitly wraps within transaction
mevivs.setProperty("fname", "vivek");
g.commit(); //Commits transaction
In cases with very large volume and a polyglot nature, permanent or temporary failures may happen. Here temporary failures are situations such as network failure, nodes not responding, and similar scenarios. In such scenarios we can configure the retry delay property with Titan like this:
Configuration conf = new BaseConfiguration();
conf.setProperty("storage.attempt-wait ",250); // time in milliseconds
Such temporary failure can be handled with retries, but permanent failure, like hardware failure, would require the user to explicitly handle TitanException:
Try
{
TitanGraph g = TitanFactory.open("/home/vivek/Titan");
Vertex mevivs = g.addVertex(null); //Implicitly wraps within transaction
mevivs.setProperty("fname", "vivek");
g.commit(); //Commits transaction
} catch (TitanException e) {
//configure explicit retry or fast-fail scenarios.
}
These are the Titan basic concepts and its architecture. Next, we will cover setup and installation of Titan Graph database.
Setup and Installation
We can download the latest Titan distribution from https://github.com/thinkaurelius/titan/wiki/Downloads . The latest version at the time of writing is 0.5.0. After downloading the distribution, extract it to a local folder. We will be referring to the TITAN_HOME variable at many places in this chapter. We can set it as follows:
export TITAN_HOME=/home/vivek/software/titan
Command-line Tools and Clients
With setup and installation in place, the first question that comes to mind is whether there are any command-line clients? Like the CQL shell for Cassandra, is there any option available with Titan for server-side scripting and quick analysis? Gremlin shell and Rexster are two command-line options we will be exploring in this section.
Gremlin Shell
Titan provides support for Gremlin shell for graph traversal and mutation using the Gremlin query language. It’s a functional language. Each step outputs an object, and with “.” (dot), we can access associated functions with it. For example,
gremlin> conf = new BaseConfiguration() // step 1
==>org.apache.commons.configuration.BaseConfiguration@2d3c117a
gremlin> conf.setProperty("storage.backend", "cassandrathrift") // step 2
Here conf is an object of Configuration created in step 1 whose function has been invoked in step 2.
Let’s discuss Gremlin with an exercise. In this recipe we will be using the same Twitter example, where users’ tweets will be a graph’s vertices and the relationship between a user and its tweets and followers will be edges. In this example we will be using Cassandra as the backend storage option. You can opt for running a standalone Cassandra server; otherwise, by default, it would start and connect with an embedded one.
-
1.
First we need to connect with Gremlin as in Figure 7-9.
-
2.
Next, we need to initialize a configuration object and set a few Cassandra-specific properties:
gremlin> conf = new BaseConfiguration()
==>org.apache.commons.configuration.BaseConfiguration@2d3c117a
gremlin> conf.setProperty("storage.backend", "cassandrathrift")
==>null
gremlin> conf.setProperty("storage.hostname", "localhost")
==>null
gremlin> conf.setProperty("storage.port", "9160")
==>null
gremlin> conf.setProperty("storage.keyspace", "twitter")
==>null
-
3.
Next, get an object of Titan graph:
gremlin> graph = TitanFactory.open(conf)
==>titangraph[cassandrathrift:localhost]
-
4.
Let’s make few vertex keys and edge labels:
gremlin> graph.makeKey("fname").
dataType(String.class).indexed(Vertex.class).make()
==>fname
gremlin> graph.makeKey("lname").
dataType(String.class).indexed(Vertex.class).make()
==>lname
gremlin> graph.makeKey("twitter_tag").
dataType(String.class).indexed(Vertex.class).make()
==>twitter_tag
gremlin> graph.makeKey("tweeted_at").
dataType(String.class).indexed(Vertex.class).make()
==>tweeted_at
gremlin> graph.makeLabel("has_tweeted").make()
==>has_tweeted
gremlin> tweet.setProperty("body", "Working on Cassandra book for apress")
==>null
gremlin> tweet.setProperty("tweeted_at", "2014-09-21")
==>null
Here fname, lname, and twitter_tag are vertex keys and the label has_tweeted will be used for edges in the next step.
-
5.
Let’s create vertices for user and tweet:
gremlin> vivs = graph.addVertex(null)
==>v[4]
gremlin> vivs.setProperty("fname", "vivek")
==>null
gremlin> vivs.setProperty("lname", "mishra")
==>null
gremlin> vivs.setProperty("twitter_tag", "mevivs")
==>null
gremlin> graph.V("fname","vivek")
==>v[4]
gremlin> tweet = graph.addVertex(null)
==>v[8]
-
6.
Next, add an edge between these two vertices:
gremlin> graph.addEdge(null, vivs, tweet, "has_tweeted")
==>e[2V-4-1E][4-has_tweeted->8]
-
7.
Let’s add apress_team as a user and establish and define “vivek follows apress_team” relationship edge:
gremlin> apress = graph.addVertex(null)
==>v[12]
gremlin> apress.setProperty("fname", "apress")
==>null
gremlin> apress.setProperty("twitter_tag", "apress_team")
==>null
gremlin> graph.addEdge(null, vivs, apress, "following")
==>e[3r-4-22][4-following->12]
-
8.
We can find a vertex by its key as follows:
gremlin> vivek = graph.V('fname','vivek').next()
==>v[4]
gremlin> vivek.map()
==>twitter_tag=mevivs
==>lname=mishra
==>fname=vivek
-
9.
We can also fetch all outgoing edges from vertex vivek having the relationship has:
gremlin> outVertex = vivek.out('has').next()
==>v[8]
gremlin> outVertex.map()
==>body=Working on Cassandra book for apress
==>tweeted_at=2014-09-21
The preceding recipe demonstrates a way to populate and traverse through a Twitter graph application using Gremlin query language.
Let’s discuss Rexster Rest API, the Dog House, and Titan Server.
Rexster: Server, Rest API, and the Dog House
As discussed in the TinkerPop section, using the REST API and web console, we can visualize and manage any Titan graph. In a previous recipe we discussed downloading and setting up the Titan distribution on a local box. To start Titan Server, embedded Elasticsearch, and Cassandra, we need to run
TITAN_HOME/bin/titan.sh
Next, to connect with the REST API and the Dog House we need to execute
TITAN_HOME/bin/rexster-console.sh
This will start Elasticsearch and connect with Elasticsearch transport at port 9300 and will get Rexster running at port 8184. The REST API and the Dog House console would get started on localhost:8182 port (see Figure 7-10).
Rexster Dog House
Figure 7-11 shows the Dog House console with tabs for the Dashboard, the option to browse edges and vertices, and the built-in Gremlin command-line shell.
The built-in Gremlin command-line client allows you to run graph mutation and traversal queries (discussed previously in this chapter).
Let’s explore the graph stored in the Gremlin recipe using Gremlin query language. We can analyze vertices using the Browse Vertices option as shown in Figure 7-12.
We can further drill down to properties of a specific vertex, as well (see Figure 7-13). The figure shows the properties and incoming and outgoing from vertex “vivek”. It has two outgoing edges “has” and “following” to tweet and apress_team vertex.
In the same way we can also explore edges and their properties (see Figures 7-14 and 7-15).
Figure 7-15 shows the properties and connected incoming and outgoing vertices of edge “has”. The vertices mevivs and tweet are connected via the edge “has” which in lay terms means “Vivek has tweeted a tweet!”
We can also visualize a graph by clicking the search icon which would render a visualization as shown in Figure 7-16.
The figure depicts a graphical representation of three vertices and two edges.
Rexster REST API
We can also query a Titan graph database using the REST API! For example, to get a list of all vertices we simply need to hit a request like this:
localhost:8182/graphs/$graph_name/vertices
For example, we can get a list of all vertices of the twitter “graph” as shown in Figure 7-17.
We can also query for a specific vertex by its ID (see Figure 7-18).
A complete list of all supported REST methods can be found at https://github.com/tinkerpop/rexster/wiki/Basic-REST-API .
Titan with Cassandra
In this section we will discuss the Titan implementation using Cassandra as a storage option. This section will discuss how to use the Titan Java API with Cassandra and perform use cases such as reading and writing to graphs.
Titan Java API
Titan is an implementation of the Blueprints graph interface. Titan provides Java- and Groovy-based implementations to access Titan.
The Titan Java API setup is fairly easy. For backend storage it relies on other databases, so to start using Titan we just need to add
<dependency>
<groupId>com.thinkaurelius.titan</groupId>
<artifactId>titan-all</artifactId>
<version>0.4.4</version>
</dependency>
The latest Titan release at the time of writing is version 0.5.0. For Cassandra, Titan relies on Netflix’s Astyanax Thrift client. The latest version of the TitanGraph API supports Astyanax’s 1.56.37 version. Please note that you may end up in dependency issues if a different version of Astyanax Thrift is being used in a project for other Cassandra-related implementations. This means support of CQL would also be very limited with Astyanax Thrift client support. Features specific to CQL3 (e.g., collections) may not work properly with this version of Astyanax.
With Cassandra running on remote machines over multiple nodes, we can configure those remote nodes with Titan with a comma-separated list of IP addresses.
Cassandra for Backend Storage
As discussed above, Cassandra can be used as a storage backend with Titan. In this section, we will configure Titan storage options, including using Cassandra, and open a graph instance. In the following “Use Cases” section, we will demonstrate how to use Titan with Cassandra, such as with writing and reading from the graph, via some simple exercises.
-
1.
The first thing is that we need to configure Titan for some storage options:
import org.apache.commons.configuration.BaseConfiguration;
import org.apache.commons.configuration.Configuration;
import com.thinkaurelius.titan.core.TitanFactory;
import com.thinkaurelius.titan.core.TitanGraph;
import com.thinkaurelius.titan.core.TitanKey;Configuration conf = new BaseConfiguration();
conf.setProperty("storage.backend", "cassandrathrift");
conf.setProperty("storage.hostname", "localhost");
conf.setProperty("storage.port", "9160");
conf.setProperty("storage.keyspace", "twitter");
Table 7-1 outlines and describes the configuration properties we used in the preceding step.
-
2.
Next we need to open a graph instance using TitanFactory:
TitanGraph graph = TitanFactory.open(conf);
Let’s further explore the Titan Java API with Cassandra via a few use cases, such as reading, writing, and batch processing data.
Use Cases
In this section, we will discuss graph traversal, reading, writing, and batch processing with graph data. Let’s first discuss a scenario in which vertices and connecting edges are large in number, which is quite common with big data.
Writing Data to a Graph
After instantiating an instance of a graph, let’s explore writing vertices and incident edges into a graph. We will be discussing the same Twitter example and will build a graph-based implementation for the user, its tweets, and followers.
Figure 7-19 shows a representation of a problem we will be implementing using the TitanGraph API.
-
1.
Let's add a vertex to the graph:
Vertex vivs = graph.addVertex(null);
vivs.setProperty("fname", "vivek");
vivs.setProperty("lname", "mishra");
vivs.setProperty("twitter_tag", "mevivs");
Here, Vertex is an API referred from Blueprints. The following are import statements for the preceding code snippet:
import com.tinkerpop.blueprints.Direction;
import com.tinkerpop.blueprints.Vertex;
You can assume a vertex as Java POJO and its properties as field variables.
-
2.
Let’s create another vertex for tweets and define an edge between the user and tweet vertex:
Vertex tweet = graph.addVertex(null);
tweet.setProperty("body", "Working on Cassandra book for apress");
tweet.setProperty("tweeted_at", "2014-09-21");
graph.addEdge(null, vivs, tweet, "has_tweeted"); // User vivs has tweets
-
3.
We can also add another vertex and establish a relation of "following":
Vertex apress = graph.addVertex(null);
apress.setProperty("fname", "apress");
apress.setProperty("twitter_tag", "apress_team");
graph.addEdge(null, vivs, apress, "following"); // Vivs is following apress team
-
4.
And then finally commit the transaction:
graph.commit();
Reading from the Graph
Let’s explore a bit around reading vertices and properties from a Titan graph:
-
1.
Reading from a graph is also fairly easy, and we can retrieve all vertices for a particular graph or even a specific vertex:
Iterable<Vertex> vertices = graph.getVertices();
Iterator<Vertex> iter = vertices.iterator();
-
2.
We can iterate over each vertex and retrieve incident edges like this:
while(iter.hasNext())
{
Vertex v = iter.next();
Iterable<Edge> keys = v.getEdges(Direction.BOTH);
...
}
-
3.
Each edge will have IN and OUT vertices, and we can retrieve those vertices via edges:
for(Edge key : keys)
{
System.out.print(key.getVertex(Direction.IN).toString()); // will print vivs on consle
System.out.print("=>");
System.out.print(key.getLabel());
System.out.print("=>");
System.out.println(key.getVertex(Direction.OUT).toString()); // will print tweets or apress
}
The preceding reading and writing to a Titan graph provides a simple recipe for how to use Titan with Cassandra.
Cassandra is all about large data processing and analytics. It is no different when working with a graph-based model using Cassandra. So what about batch processing of data with a Titan graph database? Titan does provide support for batch data processing, and in the next example we will explore how to perform batch loading using the Titan Java API.
Batch Loading
Titan provides support for batch loading using the BatchGraph API, which can be thought of as a wrapper around TitanGraph with configurable parameters to define batch size and the type of vertex ID. We can create a BatchGraph instance as follows:
BatchGraph bGraph = new BatchGraph<TITAN GRAPH INSTANCE>,<VERTEX ID TYPE>,<BATCH SIZE>);
Using a bulk loading API, we can push a batch of records with a single database call. That way graph data loading will always be faster.
Let’s explore more about bulk loading in Titan with a sample Java recipe. In this example, we will read data from a .csv file.
Figure 7-20 shows data in the format of User A following User B. Each user has fname, lname, and twitter_tag as properties of the vertex, where an edge label is following and contains a property value as Cassandra. Please note that you can find the sample .csv file with source code for this book under the src/main/resources folder.
Follow these steps to complete the recipe:
-
1.
First, the common step is to configure a graph for Cassandra:
import org.apache.commons.configuration.BaseConfiguration;
import org.apache.commons.configuration.Configuration;
import com.thinkaurelius.titan.core.TitanFactory;
import com.thinkaurelius.titan.core.TitanGraph;
import com.thinkaurelius.titan.core.TitanKey;Configuration conf = new BaseConfiguration();
conf.setProperty("storage.backend", "cassandrathrift");
conf.setProperty("storage.hostname", "localhost");
conf.setProperty("storage.port", "9160");
conf.setProperty("storage.keyspace", "batchprocess");
conf.setProperty("storage.batch-loading", "true");
-
2.
Let’s load the sample .csv file using FileReader:
File file = new File("src/main/resources/bulk_load.csv");
BufferedReader reader = new BufferedReader(new FileReader(file));
-
3.
Next, create an instance of a graph and wrap it with BatchGraph:
TitanGraph graph = TitanFactory.open(conf);
BatchGraph bgraph = new BatchGraph(graph, VertexIDType.STRING, 1000);
Here 1000 is the batch size and the vertex ID is of string type.
-
4.
Now let’s define each vertex property as a vertex key and each edge’s property as a label key:
// prepare vertex key for each property
KeyMaker maker = graph.makeKey("twitter_tag");
maker.dataType(String.class);
maker.make();
graph.makeKey("fname").dataType(String.class).make();
graph.makeKey("lname").dataType(String.class).make();
// prepare edge properties as label
LabelMaker labelMaker = graph.makeLabel("contentType");
labelMaker.make();
graph.makeLabel("following").make();
Here TitanKey and LabelKey are classes provided by Titan Java API, which are used to prepare vertex and edge keys.
-
5.
Now let’s read line by line from the file and extract vertex and edge properties:
while (reader.ready())
{
String line = reader.readLine();
StringTokenizer tokenizer = new StringTokenizer(line, ",");
while (tokenizer.hasMoreTokens())
{
// System.out.println(tokenizer.nextToken());
// twitter_tag,fname,lname,twitter_tag,fname,lname,edgeName,edgeProperty
final String in_twitter_tag = tokenizer.nextToken();
final String in_fname = tokenizer.nextToken();
final String in_lname = tokenizer.nextToken();
final String out_twitter_tag = tokenizer.nextToken();
final String out_fname = tokenizer.nextToken();
final String out_lname = tokenizer.nextToken();
final String edgeName = tokenizer.nextToken();
final String edgeProperty = tokenizer.nextToken();
...
}
}
-
6.
Now create in and out vertices within an extreme out while loop (see step 5) and assign an edge as follows:
//in vertex
Vertex in = bgraph.addVertex(Math.random() + "");
in.setProperty("twitter_tag", in_twitter_tag);
in.setProperty("fname", in_fname);
in.setProperty("lname", in_lname);
//out vertex
Vertex out = bgraph.addVertex(Math.random() + "");
out.setProperty("twitter_tag", out_twitter_tag);
out.setProperty("fname", out_fname);
out.setProperty("lname", out_lname);
//assign edge
bgraph.addEdge(null, in, out, edgeName);
-
7.
Finally we can call commit after successfully reading all records from the .csv file and populating BatchGraph:
bgraph.commit();
Here batch size is the number of vertices and edges to be loaded before we invoke the commit on the graph. One thing we should take care of is setting a moderate value as the batch size to avoid heap size issues while processing a big graph having millions or billions of edges.
The Supernode Problem
In the real world, big data-based graphs can be very large, and there can be a group of vertices having a very high number of incident edges. In graph theory, such vertices are called supernodes. With so many complex paths, a random traversal in a graph can lead us to such supernodes, which would badly affect the system’s performance.
Figure 7-21 shows my LinkedIn social graph, where the marked vertices can be termed supernodes.
For example, if I need to traverse through all connections for a particular group (say graphDB), such traversal without indices would lead to performance issues. With incident edges indexed by label, such lookups will be much quicker.
For example, if we want to find all friends of Vivek who have joined the “graphDB” group, doing a random traversal would require searching every connection of Vivek’s and then scanning groups joined by each of his friends. Imagine that Vivek has a large number of connections on LinkedIn. It can be assumed that random traversal in such a case would be a nightmare to find the desired output. But using a label-index query, it will be much quicker:
g.query ().has("friends of",EQUAL, "vivek")..has("group",EQUAL,"graphDB").vertices();
This query is a label-indexed query, which searches all of Vivek’s friends using “friends of” and then searches the remaining subset for graphDB using the “group” indexed edge.
We will explore this further in next section of this chapter.
Faster Deep Traversal
Deep traversal means going to the n’th level in the hierarchy of a graph. Let’s take the example we saw in the preceding section of my LinkedIn social graph, where I can query my connections for common group interests. Depending on the data volume and number of incident edges, iterating over each vertex via vertex query is probably not a good idea. Titan provides support for multiple-vertex queries, where multiple-vertex queries can be combined and send a single combined query to a graph database. That way retrieval of data will be a lot faster because we will be hitting the server only one time.
Let’s explore how to achieve faster deep traversal using a multiple-vertex query in Titan with a Hindu mythological epic called Ramayana. In this example we will try to find son of relationships at the leaf level. Figure 7-22 shows the family tree of Rama and their ancestors.
One way to find son of relationships at each level is to iterate through each level like this:
private static void iterateToLeaf(Vertex dasratha)
{
System.out.println("Finding sons for::" + dasratha.getProperty("fname"));
Iterable<Vertex> immediateSons = dasratha.getVertices(Direction.IN, "son of");
Iterator<Vertex> iter = immediateSons.iterator();
// one way is
while (iter.hasNext())
{
Vertex v = iter.next();
// recursive call
iterateToLeaf(v);
}
}
In the preceding code snippet, we need to invoke with the root vertex object, i.e., dasaratha, and then the recursive call will iterate through each vertex on each level. For smaller graphs it may work, but for large data and big data graphs, it is not a feasible solution. This is where multiple-vertex query comes in very handy and performant.
Let’s walk through a few code snippets to execute a multivertex query using Titan. You can find a complete GraphTraversalRunner.java example with the attached source code.
-
1.
First let’s add dasaratha as root vertex:
Vertex das = add("fname", "dasaratha", graph);
private static Vertex add(final String propertyName, final String value, TitanGraph graph)
{
Vertex vertex = graph.addVertex(null);
vertex.setProperty(propertyName, value);
return vertex;
}
-
2.
Next, add dasaratha’s son:
// add dasratha's son.
Vertex ram = addSon("fname", "ram", graph, das);
Vertex laxman = addSon("fname", "laxman", graph, das);
Vertex bharat = addSon("fname", "bharat", graph, das);
Vertex shatrugna = addSon("fname", "shatrugna", graph, das);
private static Vertex addSon(String propertyName, String value, TitanGraph graph, Vertex
father)
{
Vertex son = add(propertyName, value, graph);
graph.addEdge(null, son, father, "son of");
return son;
}
-
3.
Repeat step 2 for ram, laxman, bharat, and shatrugna:
// ram's son
addSon("fname", "luv", graph, ram);
addSon("fname", "kush", graph, ram);
// bharat's son
addSon("fname", "Daksha", graph, bharat);
addSon("fname", "Pushkala", graph, bharat);
// laxman's son
addSon("fname", "Angada", graph, laxman);
addSon("fname", "Chanderkedu", graph, laxman);
// Shatrugna’s son
addSon("fname", "Shatrugadi", graph, shatrugna);
addSon("fname", "Subahu", graph, shatrugna);
-
4.
Finally store the complete hierarchy:
graph.commit();
-
5.
Now to fetch all vertices with Direction.IN (means incoming edges) and having the label son of using multivertex support, we need to execute the following query:
//prepare multi vertex query
TitanMultiVertexQuery mq = graph.multiQuery();
mq.direction(Direction.IN).labels("son of");
mq.addVertex((TitanVertex) das);// add root
mq.addVertex((TitanVertex) ram);
mq.addVertex((TitanVertex) bharat);
mq.addVertex((TitanVertex) laxman);
mq.addVertex((TitanVertex) shatrugna);
//execute multi vertex query
Map<TitanVertex, Iterable<TitanVertex>> dfsResult = mq.vertices();
//iterate through result and print
for (TitanVertex key : dfsResult.keySet())
{
System.out.println("Finding son of" + key.getProperty("fname"));
Iterable<TitanVertex> sons = dfsResult.get(key);
Iterator<TitanVertex> sonIter = sons.iterator();
while (sonIter.hasNext())
{
System.out.println(sonIter.next().getProperty("fname"));
}
}
This way, we can perform faster deep traversal with Titan.
We have covered most of the important features supported by Titan graph database. Sample code snippets shared in this chapter should enable readers to use Titan with Cassandra, such as building social graphs, network graphs, or running graphs like queries. As far as the future of graph databases is concerned, the next chapter will discuss a report published about active development which is happening in the graph database world.
With this, we conclude our discussion around graph databases and how Titan can be integrated with Cassandra. For more details and supported feature sets you may refer to https://github.com/thinkaurelius/titan/wiki .
Summary
To summarize, topics covered in this chapter include:
-
Introduction to graphs
-
Understanding TinkerPop and Blueprints
-
Titan database ecosystem
-
Titan with Cassandra
The next chapter will walk you through the performance tuning and compaction techniques available with Cassandra.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2014 Vivek Mishra
About this chapter
Cite this chapter
Mishra, V. (2014). Titan Graph Databases with Cassandra. In: Beginning Apache Cassandra Development. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-0142-8_7
Download citation
DOI: https://doi.org/10.1007/978-1-4842-0142-8_7
Published:
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-0143-5
Online ISBN: 978-1-4842-0142-8
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)