graphio documentation

Graphio is a Python library for bulk loading data to Neo4j. Graphio collects multiple sets of nodes and relationships and loads them to Neo4j. A common example is parsing a set of Excel files to create a Neo4j prototype. Graphio only loads data, it is not meant for querying Neo4j and returning data.

Graphio can serialize data to JSON and CSV files. This is useful for debugging and for storing graph ready data sets.

The primary interface are NodeSet and RelationshipSet classes which are groups of nodes and relationships with similiar properties. Graphio can load these data sets to Neo4j using CREATE or MERGE operations.

Graphio uses the official Neo4j Python driver to connect to Neo4j.

Warning

Graphio was initially built on top of py2neo which is not actively maintained anymore. The most recent version of py2neo still works with graphio but this is not supported anymore. Please switch to the official Neo4j Python driver.

Version

https://img.shields.io/pypi/v/graphio

Install

Use pip to install:

pip install -U graphio

Example

Iterate over a file that contains people and the movies they like and extract nodes and relationships. Contents of example file ‘people.tsv’:

Alice; Matrix,Titanic
Peter; Matrix,Forrest Gump
John; Forrest Gump,Titanic

The goal is to create the follwing data in Neo4j:

  • (Person) nodes
  • (Movie) nodes
  • (Person)-[:LIKES]->(Movie) relationships
# the official Neo4j driver is used to connect to Neo4j
# you always need a Driver instance
from neo4j import GraphDatabase

driver = GraphDatabase.driver('neo4j://localhost:7687', auth=('neo4j', 'password'))

from graphio import NodeSet, RelationshipSet

# define data sets
people = NodeSet(['Person'], merge_keys=['name'])
movies = NodeSet(['Movie'], merge_keys=['title'])
person_likes_movie = RelationshipSet('LIKES', ['Person'], ['Movie'], ['name'], ['title'])

with open('people.tsv') as my_file:
   for line in my_file:
      # prepare data from the line
      name, titles = line.split(';')
      # split up the movies
      titles = titles.strip().split(',')

      # add one (Person) node per line
      people.add_node({'name': name})

      # add (Movie) nodes and :LIKES relationships
      for title in titles:
         movies.add_node({'title': title})
         person_likes_movie.add_relationship({'name': name}, {'title': title}, {'source': 'my_file'})


# create the nodes in NodeSet, needs a py2neo.Graph instance
people.create(driver)
movies.create(driver)
person_likes_movie.create(driver)

The code in the example should be easy to understand:

  1. Define the data sets you want to add.
  2. Iterate over a data source, transform the data and add to the data sets.
  3. Store data in Neo4j.

Note

The example does create mulitple nodes with the same properties. You have to take care of uniqueness yourself.

Continue with the Basic Workflow section.

Contents

Basic Workflow

NodeSets

With graphio you predefine the NodeSet and add nodes:

from graphio import NodeSet

people = NodeSet(['Person'], merge_keys=['name'])

people.add_node({'name': 'Peter', 'city': 'Munich'})

The first argument for the NodeSet is a list of labels used for all nodes in this NodeSet. The second optional argument are merge_keys, a list of properties that confer uniqueness of the nodes in this NodeSet. All operations based on MERGE queries need unique properties to identify nodes.

When you add a node to the NodeSet you can add arbitrary properties to the node.

Uniqueness of nodes

The uniqueness of the nodes is not checked when adding to the NodeSet. Thus, you can create mulitple nodes with the same ‘name’ property.

Use NodeSet.add_unique() to check if a node with the same properties exist already:

people = NodeSet(['Person'], merge_keys=['name'])

# first time
people.add_unique({'name': 'Jack', 'city': 'London'})
len(people.nodes) -> 1

# second time
people.add_unique({'name': 'Jack', 'city': 'London'})
len(people.nodes) -> 1

Warning

This function iterates all nodes when adding a new one and does not scale well. Use only for small nodesets.

Default properties

You can set default properties on the NodeSet that are added to all nodes when loading data:

people_in_europe = NodeSet(['Person'], merge_keys=['name'],
                           default_props={'continent': 'Europe'})

RelationshipSets

In a similar manner, RelationshipSet are predefined and you add relationships:

from graphio import RelationshipSet

person_likes_food = RelationshipSet('KNOWS', ['Person'], ['Food'], ['name'], ['type'])

person_likes_food.add_relationship(
   {'name': 'Peter'}, {'type': 'Pizza'}, {'reason': 'cheese'}
)

The arguments for the RelationshipSet

  • relationship type
  • labels of start node
  • labels of end node
  • property keys to match start node
  • property keys to match end node

When you add a relationship to RelationshipSet all you have to do is to define the matching properties for the start node and end node. You can also add relationship properties.

Default properties

You can set default properties on the RelationshipSet that are added to all relationships when loading data:

person_likes_food = RelationshipSet('KNOWS', ['Person'], ['Food'], ['name'], ['type'],
                                    default_props={'source': 'survey'})

Create Indexes

Both NodeSet and RelationshipSet allow you to create indexes to speed up data loading. create_index() creates indexes for all individual merge_keys properties as well as a compound index. create_index() creates the indexes required for matching the start node and end node:

from graphio import RelationshipSet
from neo4j import GraphDatabase

driver = GraphDatabase.driver('neo4j://localhost:7687', auth=('neo4j', 'password'))

person_likes_food = RelationshipSet('KNOWS', ['Person'], ['Food'], ['name'], ['type'])

person_likes_food.create_index(driver)

This will create single-property indexes for :Person(name) and :Food(type).

Load Data

After building NodeSet and RelationshipSet you can create or merge everything in Neo4j.

You need a neo4j.Driver instance to create data. See: https://neo4j.com/docs/api/python-driver/current/api.html#api-documentation

from neo4j import GraphDatabase

driver = GraphDatabase.driver('neo4j://localhost:7687', auth=('neo4j', 'password'))

people.create(driver)
person_likes_food.create(driver)

Warning

Graphio does not check if the nodes referenced in the RelationshipSet actually exist. It is meant to quickly build data sets and load them into Neo4j, not to maintain consistency.

Create

create() will, as the name suggests, create all data. This will create duplicate nodes even if a merge_key is set on a NodeSet.

Merge

merge() will merge on the merge_key defined on the NodeSet.

The merge operation for NodeSet offers more control.

You can pass a list of properties that should not be overwritten on existing nodes:

NodeSet.merge(driver, preserve=['name', 'currency'])

This is equivalent to:

ON CREATE SET ..all properties..
ON MATCH SET ..all properties except 'name' and 'currency'..

Graphio can also append properties to arrays:

NodeSet.merge(driver, append_props=['source'])

This will create a list for the node property source and append values ON MATCH.

Both can also be set on the NodeSet:

nodeset = NodeSet(['Person'], ['name'], preserve=['country'], array_props=['source'])

Group Data Sets in a Container

A Container can be used to group NodeSet and RelationshipSet:

my_data = Container()

my_data.add(people)
my_data.add(person_likes_food)

Note

This is particularly useful if you build many NodeSet and RelationshipSet and want to group data sets (e.g. because of dependencies).

You can iterate the NodeSet and RelationshipSet in the Container:

for nodeset in my_data.nodesets:
    nodeset.create(driver)

Serialization

Graphio can serialize NodeSet and RelationshipSet objects to different formats. This can be used to store processed, graph-ready data in a file.

Graphio supports the following formats for both NodeSet and RelationshipSet objects:

  • combined CSV and JSON files (CSV file with all data and JSON file with metadata), can be deserialized again
  • CSV files with all data (useful for quick tests, cannot be fully deserialized again)
  • JSON files with all data (useful for quick tests with small datasets, contains redundant data)

Combined CSV and JSON files

The most useful serialization format stores the data in a CSV file and the metadata in a JSON file. This avoids redundancy and allows to deserialize the data again.

Data Format
Nodes

The JSON file with metadata contains at least the following information:

  • the labels (labels)
  • property keys used for MERGE operations (merge_keys)

The csv file contains the properties of one node per row, the header contains the property keys.

Example:

nodeset.json:

{
    "labels": [
        "Person"
    ],
    "merge_keys": [
        "name"
    ]
}

nodeset.csv:

name,age
Lisa,42
Bob,23
Relationships

The JSON file with metadata contains at least the following information:

  • start node labels
  • end node labels
  • start node property keys to MATCH the start node
  • end node property keys to MATCH the end node
  • relationship type

The csv file contains one relationship per row, the start node, end node, and relationship properties are indicated by header prefixes (start_, end_, rel_).

Example:

relset.json:

{
  "start_node_labels": ["Person"],
  "end_node_labels": ["Person"],
  "start_node_properties": ["name"],
  "end_node_properties": ["name"],
  "rel_type": "KNOWS"
}

relset.csv:

start_name,end_name,rel_since
Lisa,Bob,2018
Bob,Lisa,2018
Serialize to CSV and JSON

To serialize a NodeSet or RelationshipSet object use to_csv_json_set():

people = NodeSet(['Person'], merge_keys=['name']

people.add_node({'name': 'Lisa'})
people.add_node({'name': 'Bob'})

people.to_csv_json_set('people.json', 'people.csv')

knows = RelationshipSet('KNOWS', ['Person'], ['Person'], ['name'], ['name'])
knows.add_relationship({'name': 'Lisa'}, {'name': 'Bob'}, {'since': '2018'})

knows.to_csv_json_set('knows.json', 'knows.csv')

CSV files

Graphio can serialize NodeSet and RelationshipSet objects to CSV files in the same format as the CSV files in the combined CSV/JSON format. This can be useful for quick tests with small datasets.

See NodeSet.to_csv() and RelationshipSet.to_csv() for details:

people = NodeSet(['Person'], merge_keys=['name']

people.add_node({'name': 'Lisa'})
people.add_node({'name': 'Bob'})

people.to_csv('nodeset.csv')

knows = RelationshipSet('KNOWS', ['Person'], ['Person'], ['name'], ['name'])
knows.add_relationship({'name': 'Lisa'}, {'name': 'Bob'}, {'since': '2018'})

knows.to_csv('relset.csv')

Graphio can generate matching Cypher queries to load these CSV files to Neo4j:

# NodeSet CREATE query
people.create_csv_query('nodeset.csv')

# NodeSet MERGE query
people.merge_csv_query('nodeset.csv')

# RelationshipSet CREATE query
knows.create_csv_query('relset.csv')

JSON files

note:Deserialization of simple JSON representations is currently not supported. Use the combined JSON/CSV format instead. The JSON serialization can still be useful to test small datasets.

NodeSet and RelationshipSet objects can be serialized to JSON:

people = NodeSet(['Person'], merge_keys=['name']

people.add_node({'name': 'Lisa'})

people.to_json('nodeset.json')

This will create a JSON file with full node descriptions:

nodeset.json:

{
  "labels": [
      "Person"
  ],
  "merge_keys": [
      "name"
  ],
  "nodes": [
      {
          "name": "Lisa"
      }
  ]
}

The same works with RelationshipSet objects:

person_like_food = RelationshipSet('LIKES', ['Person'], ['Food'], ['name'], ['type'])

person_like_food.add_relationship({'name': 'Lisa'}, {'type': 'Sushi'}, {'since': 'always'})

person_like_food.to_json('relset.json')

Main Classes

NodeSet

class graphio.NodeSet(labels=None, merge_keys=None, batch_size=None, default_props=None, preserve=None, append_props=None, indexed=False, additional_labels: List[str] = None, source: bool = False)

Container for a set of Nodes with the same labels and the same properties that define uniqueness.

add_node(properties)

Create a node in this NodeSet.

Parameters:properties (dict) – Node properties.
add_unique(properties)

Add a node to this NodeSet only if a node with the same merge_keys does not exist yet.

Note: Right now this function iterates all nodes in the NodeSet. This is of course slow for large numbers of nodes. A better solution would be to create an ‘index’ as is done for RelationshipSet.

Parameters:properties (dict) – Node properties.
all_property_keys() → Set[str]

Return a set of all property keys in this NodeSet

Returns:A set of unique property keys of a NodeSet
create(graph, database: str = None, batch_size=None)

Create all nodes from NodeSet.

create_csv_query(filename: str = None, periodic_commit=1000)

Create a Cypher query to load a CSV file created with NodeSet.to_csv() into Neo4j (CREATE statement).

Parameters:
  • filename – Optional filename. A filename will be autocreated if not passed.
  • periodic_commit – Number of rows to commit in one transaction.
Returns:

Cypher query.

create_index(graph, database=None)

Create indices for all label/merge ky combinations as well as a composite index if multiple merge keys exist.

classmethod from_csv_json_set(csv_file_path, json_file_path, load_items: bool = False, labels_key: str = None, mergekey_key: str = None)

Read the default CSV/JSON file combination. Needs paths to CSV and JSON file.

JSON keys can be overwritten by passing the respective parameters.

Parameters:
  • csv_file_path – Path to the CSV file.
  • json_file_path – Path to the JSON file.
  • load_items – Yield items from file (False, default) or load them to memory (True).
Returns:

The NodeSet.

merge(graph, merge_properties=None, batch_size=None, preserve=None, append_props=None, database=None)

Merge nodes from NodeSet on merge properties.

Parameters:merge_properties – The merge properties.
merge_csv_query(filename: str = None, periodic_commit=1000)

Create a Cypher query to load a CSV file created with NodeSet.to_csv() into Neo4j (MERGE statement).

Parameters:
  • filename – Optional filename. A filename will be autocreated if not passed.
  • periodic_commit – Number of rows to commit in one transaction.
Returns:

Cypher query.

node_properties()

Yield properties of the nodes in this set. Used for create function.

object_file_name(suffix: str = None) → str

Create a unique name for this NodeSet that indicates content. Pass an optional suffix. NOTE: suffix has to include the ‘.’ for a filename!

nodeset_Label_merge-key_uuid

With suffix:

nodeset_Label_merge-key_uuid.json
to_csv(filepath: str, quoting: int = None) → str

Create a CSV file for this nodeset. Header row is created with all properties. Each row contains the properties of a node.

Example:

>>> nodeset = NodeSet(labels=["Person"], merge_keys=["name"])
>>> nodeset.add_node({"name": "Alice", "age": 33})
>>> nodeset.add_node({"name": "Bob", "age": 44})
>>> nodeset.to_csv("/tmp/Person_name.csv")
'/tmp/Person_name.csv'

name,age Alice,33 Bob,44

Parameters:
  • filepath – Full path to the CSV file.
  • quoting – Optional quoting setting for csv writer (any of csv.QUOTE_MINIMAL, csv.QUOTE_NONE, csv.QUOTE_ALL etc).
to_csv_json_set(csv_file_path, json_file_path, type_conversion: dict = None)

Write the default CSV/JSON file combination.

Needs paths to CSV and JSON file.

Parameters:
  • csv_file_path – Path to the CSV file.
  • json_file_path – Path to the JSON file.
  • type_conversion – Optional dictionary to convert types of properties.
to_definition()

Create a NodeSetDefinition from this NodeSet. Later, NodeSetDefinition can become parent class of NodeSet.

to_dict()

Create dictionary defining the nodeset.

to_json(target_dir: str, filename: str = None)

Serialize NodeSet to a JSON file in a target directory.

This function is meant for dumping/reloading and not to create a general transport format. The function will likely be optimized for disk space or compressed in future.

update_node(properties: dict)

Update an existing node by overwriting all properties.

Note that this requires NodeSet(…, indexed=True) which is not the default!

Parameters:properties – Node property dictionary.

RelationshipSet

class graphio.RelationshipSet(rel_type, start_node_labels, end_node_labels, start_node_properties, end_node_properties, batch_size=None, default_props=None, source=False)

Container for a set of Relationships with the same type of start and end nodes.

Parameters:
  • rel_type – Realtionship type.
  • start_node_labels – Labels of the start node.
  • end_node_labels – Labels of the end node.
  • start_node_properties – Property keys to identify the start node.
  • end_node_properties – Properties to identify the end node.
  • batch_size – Batch size for Neo4j operations.
add_relationship(start_node_properties: dict, end_node_properties: dict, properties: dict = None)

Add a relationship to this RelationshipSet.

Parameters:properties – Relationship properties.
all_property_keys() → Set[str]

Return a set of all property keys in this RelationshipSet

Returns:A set of unique property keys of a NodeSet
create(graph, database=None, batch_size=None)

Create relationships in this RelationshipSet

create_csv_query(query_type: str, filename: str = None, periodic_commit=1000) → str

Generate the CREATE CSV query for this RelationshipSet. The function tries to take care of type conversions.

Note: You can’t use arrays as properties for nodes/relationships when creating CSV files.

LOAD CSV WITH HEADERS FROM xyz AS line MATCH (a:Gene), (b:Protein) WHERE a.sid = line.a_sid AND b.sid = line.b_sid AND b.taxid = line.b_taxid CREATE (a)-[r:MAPS]->(b) SET r.key1 = line.rel_key1, r.key2 = line.rel_key2

create_index(graph, database=None)

Create indices for start node and end node definition of this relationshipset. If more than one start or end node property is defined, all single property indices as well as the composite index are created.

classmethod from_csv_json_set(csv_file_path, json_file_path, load_items: bool = False, reltype_key=None, startnodeproperties_key=None, endnodeproperties_key=None, startnodelables_key=None, endnodelables_key=None)

Read the default CSV/JSON file combination. Needs paths to CSV and JSON file.

JSON keys can be overwritten by passing the respective parameters.

Parameters:
  • csv_file_path – Path to the CSV file.
  • json_file_path – Path to the JSON file.
  • load_items – Yield items from file (False, default) or load them to memory (True).
Returns:

The RelationshipSet.

merge(graph, database=None, batch_size=None)

Create relationships in this RelationshipSet

object_file_name(suffix: str = None) → str

Create a unique name for this RelationshipSet that indicates content. Pass an optional suffix. NOTE: suffix has to include the ‘.’ for a filename!

relationshipset_StartLabel_TYPE_EndLabel_uuid

With suffix:

relationshipset_StartLabel_TYPE_EndLabel_uuid.json
to_csv(filepath: str, quoting: int = None) → str

Write the RelationshipSet to a CSV file. The CSV file will be written to the given filepath.

Note: You can’t use arrays as properties for nodes/relationships when creating CSV files.

# CSV file header start_sid, end_sid, end_taxid, rel_key1, rel_key2

Parameters:
to_csv_json_set(csv_file_path, json_file_path, write_mode: str = 'w')

Write the default CSV/JSON file combination.

Needs paths to CSV and JSON file.

Parameters:
  • csv_file_path – Path to the CSV file.
  • json_file_path – Path to the JSON file.
  • write_mode – Write mode for the CSV file.
to_json(target_dir, filename: str = None)

Serialize NodeSet to a JSON file in a target directory.

This function is meant for dumping/reloading and not to create a general transport format. The function will likely be optimized for disk space or compressed in future.

Container

class graphio.Container(objects=None)

A container for a collection of Nodes, Relationships, NodeSets and RelationshipSets.

A typical parser function to e.g. read an Excel file produces a mixed output which then has to be processed accordingly.

Also, sanity checks and data statistics are useful.

merge_nodesets()

Merge all node sets if merge_key is defined.

nodesets

Get the NodeSets in the Container.

relationshipsets

Get the RelationshipSets in the Container.

Model Objects

Warning

This is the first iteration of the interface for model objects. The function/class signatures might change in the next releases.

Basic Usage

Creating NodeSet and RelationshipSet classes with string is error prone.

Graphio offers a simple object graph model system:

from graphio import ModelNode, ModelRelationship

class Person(ModelNode):
    name = MergeKey()

class Food(ModelNode):
    type = MergeKey()

class PersonLikes(ModelRelationship):
    source = Person
    target = Food
    type = 'LIKES'

You can use these classes to create NodeSet and RelationshipSet:

person_nodeset = Person.dataset()
food_nodeset = Food.dataset()

person_likes_food = PersonLikes.dataset()

When adding data to the RelationshipSet you can use the MergeKey properties of the ModelNode classes to avoid typing the properties as strings:

for name, food in [('Susan', 'Pizza'), ('Ann', 'Sushi')]:
    person_likes_food.add_relationship(
        {Person.name: name}, {Food.type: food}
    )

You can set one or multiple Label and MergeKey properties on the ModelNode:

class Person(ModelNode):
    first_name = MergeKey()
    last_name = MergeKey()

    Person = Label()
    Human = Label()

You can override the actual values of the Label and MergeKey:

class Person(ModelNode):
    first_name = MergeKey('first_name')
    last_name = MergeKey('surname')

    Person = Label('Individual')
    Human = Label('HomoSapiens')

Add data with model instances

You can create instances of the model objects to create individual nodes and relationships:

from graphio import ModelNode, ModelRelationship
from py2neo import Graph

graph = Graph()

class Person(ModelNode):
    name = MergeKey()

class Food(ModelNode):
    type = MergeKey()

class PersonLikes(ModelRelationship):
    source = Person
    target = Food
    type = 'LIKES'

alice = Person(name='Alice')
sushi = Food(type='Sushi')

alice.merge(graph)
sushi.merge(graph)

alice_likes_sushi = PersonLikes(alice, sushi)
alice_likes_sushi.merge(graph)

You can also link nodes without creating ModelRelationship instances:

alice.link(graph, PersonLikes, sushi, since='always')

ModelNode

class graphio.ModelNode(*args, **kwargs)

Baseclass for model objects.

additional_props

Return all properties except the merge properties.

Returns:Dictionary with all properties except the merge properties.
Return type:dict
classmethod dataset() → graphio.objects.nodeset.NodeSet
Returns:Return a NodeSet instance for this ModelNode.
classmethod factory(labels: List[str], merge_keys: List[str] = None, name: str = None) → type

Create a class with given labels and merge_keys. The merge_keys are optional but some functions do not work without them.

Parameters:
  • labels – Labels for this ModelNode class.
  • merge_keys – MergeKeys for this ModelNode class.
Returns:

The ModelNode class.

merge_props

Return the merge properties for this node.

Returns:Dictionary with the merge properties for this node.
Return type:dict

ModelRelationship

class graphio.ModelRelationship(source: graphio.model.ModelNode, target: graphio.model.ModelNode, **kwargs)

Base class for model relationships.

Knows about the class of source node and target node (instances of ModelNode) and the relationship type:

class Person(ModelNode):
    name = MergeKey()

class Food(ModelNode):
    name = MergeKey()

class PersonLikesToEat(ModelRelationship):
    source = Person
    target = Food
    type = 'LIKES'
classmethod dataset() → graphio.objects.relationshipset.RelationshipSet
Returns:Return a RelationshipSet instance for this ModelRelationship.

Helper Classes

class graphio.NodeDescriptor(labels: List[str], properties: dict, merge_keys: List[str] = None)

Unified interface to describe nodes with labels and properties.

NodeDescriptor instances are passed into functions when no ModelNode classes or instances are available.

Setting merge_keys is optional. If they are not set all property keys will be used as merge_keys.

Parameters:
  • labels – Labels for this node.
  • properties – Properties for this node.
  • merge_keys – Optional.

Indices and tables