graphio documentation¶
Graphio is a Python library for bulk loading data to Neo4j. Graphio collects multiple sets of nodes and relationships and loads them to Neo4j. A common example is parsing a set of Excel files to create a Neo4j prototype. Graphio only loads data, it is not meant for querying Neo4j and returning data.
Graphio can serialize data to JSON and CSV files. This is useful for debugging and for storing graph ready data sets.
The primary interface are NodeSet
and RelationshipSet
classes which are groups of nodes
and relationships with similiar properties. Graphio can load these data sets to Neo4j using CREATE
or MERGE
operations.
Graphio uses the official Neo4j Python driver to connect to Neo4j.
Warning
Graphio was initially built on top of py2neo which is not actively maintained anymore. The most recent version of py2neo still works with graphio but this is not supported anymore. Please switch to the official Neo4j Python driver.
Example¶
Iterate over a file that contains people and the movies they like and extract nodes and relationships. Contents of example file ‘people.tsv’:
Alice; Matrix,Titanic
Peter; Matrix,Forrest Gump
John; Forrest Gump,Titanic
The goal is to create the follwing data in Neo4j:
(Person)
nodes(Movie)
nodes(Person)-[:LIKES]->(Movie)
relationships
# the official Neo4j driver is used to connect to Neo4j
# you always need a Driver instance
from neo4j import GraphDatabase
driver = GraphDatabase.driver('neo4j://localhost:7687', auth=('neo4j', 'password'))
from graphio import NodeSet, RelationshipSet
# define data sets
people = NodeSet(['Person'], merge_keys=['name'])
movies = NodeSet(['Movie'], merge_keys=['title'])
person_likes_movie = RelationshipSet('LIKES', ['Person'], ['Movie'], ['name'], ['title'])
with open('people.tsv') as my_file:
for line in my_file:
# prepare data from the line
name, titles = line.split(';')
# split up the movies
titles = titles.strip().split(',')
# add one (Person) node per line
people.add_node({'name': name})
# add (Movie) nodes and :LIKES relationships
for title in titles:
movies.add_node({'title': title})
person_likes_movie.add_relationship({'name': name}, {'title': title}, {'source': 'my_file'})
# create the nodes in NodeSet, needs a py2neo.Graph instance
people.create(driver)
movies.create(driver)
person_likes_movie.create(driver)
The code in the example should be easy to understand:
- Define the data sets you want to add.
- Iterate over a data source, transform the data and add to the data sets.
- Store data in Neo4j.
Note
The example does create mulitple nodes with the same properties. You have to take care of uniqueness yourself.
Continue with the Basic Workflow section.
Contents¶
Basic Workflow¶
NodeSets¶
With graphio you predefine the NodeSet
and add nodes:
from graphio import NodeSet
people = NodeSet(['Person'], merge_keys=['name'])
people.add_node({'name': 'Peter', 'city': 'Munich'})
The first argument for the NodeSet
is a list of labels used for all nodes in this NodeSet
.
The second optional argument are merge_keys
, a list of properties that confer uniqueness of the nodes
in this NodeSet
. All operations
based on MERGE
queries need unique properties to identify nodes.
When you add a node to the NodeSet you can add arbitrary properties to the node.
Uniqueness of nodes¶
The uniqueness of the nodes is not checked when adding to the NodeSet. Thus, you can create mulitple nodes with the same ‘name’ property.
Use NodeSet.add_unique()
to check if a node with the same properties exist already:
people = NodeSet(['Person'], merge_keys=['name'])
# first time
people.add_unique({'name': 'Jack', 'city': 'London'})
len(people.nodes) -> 1
# second time
people.add_unique({'name': 'Jack', 'city': 'London'})
len(people.nodes) -> 1
Warning
This function iterates all nodes when adding a new one and does not scale well. Use only for small nodesets.
RelationshipSets¶
In a similar manner, RelationshipSet
are predefined and you add relationships:
from graphio import RelationshipSet
person_likes_food = RelationshipSet('KNOWS', ['Person'], ['Food'], ['name'], ['type'])
person_likes_food.add_relationship(
{'name': 'Peter'}, {'type': 'Pizza'}, {'reason': 'cheese'}
)
The arguments for the RelationshipSet
- relationship type
- labels of start node
- labels of end node
- property keys to match start node
- property keys to match end node
When you add a relationship to RelationshipSet
all you have to do is to define the matching properties for the
start node and end node. You can also add relationship properties.
Default properties¶
You can set default properties on the RelationshipSet
that are added to all relationships when loading data:
person_likes_food = RelationshipSet('KNOWS', ['Person'], ['Food'], ['name'], ['type'],
default_props={'source': 'survey'})
Create Indexes¶
Both NodeSet
and RelationshipSet
allow you to create indexes to speed up data loading.
create_index()
creates indexes for all individual merge_keys
properties as well as a compound index.
create_index()
creates the indexes required for matching the start node and end node:
from graphio import RelationshipSet
from neo4j import GraphDatabase
driver = GraphDatabase.driver('neo4j://localhost:7687', auth=('neo4j', 'password'))
person_likes_food = RelationshipSet('KNOWS', ['Person'], ['Food'], ['name'], ['type'])
person_likes_food.create_index(driver)
This will create single-property indexes for :Person(name) and :Food(type).
Load Data¶
After building NodeSet
and RelationshipSet
you can create or merge everything in Neo4j.
You need a neo4j.Driver
instance to create data. See: https://neo4j.com/docs/api/python-driver/current/api.html#api-documentation
from neo4j import GraphDatabase
driver = GraphDatabase.driver('neo4j://localhost:7687', auth=('neo4j', 'password'))
people.create(driver)
person_likes_food.create(driver)
Warning
Graphio does not check if the nodes referenced in the RelationshipSet
actually exist. It is meant
to quickly build data sets and load them into Neo4j, not to maintain consistency.
Create¶
create()
will, as the name suggests, create all data. This will create
duplicate nodes even if a merge_key
is set on a NodeSet
.
Merge¶
merge()
will merge on the merge_key
defined on the NodeSet
.
The merge operation for NodeSet
offers more control.
You can pass a list of properties that should not be overwritten on existing nodes:
NodeSet.merge(driver, preserve=['name', 'currency'])
This is equivalent to:
ON CREATE SET ..all properties..
ON MATCH SET ..all properties except 'name' and 'currency'..
Graphio can also append properties to arrays:
NodeSet.merge(driver, append_props=['source'])
This will create a list for the node property source
and append values ON MATCH
.
Both can also be set on the NodeSet
:
nodeset = NodeSet(['Person'], ['name'], preserve=['country'], array_props=['source'])
Group Data Sets in a Container¶
A Container
can be used to group NodeSet
and RelationshipSet
:
my_data = Container()
my_data.add(people)
my_data.add(person_likes_food)
Note
This is particularly useful if you build many NodeSet
and RelationshipSet
and want to group data sets (e.g. because of dependencies).
You can iterate the NodeSet
and RelationshipSet
in the Container
:
for nodeset in my_data.nodesets:
nodeset.create(driver)
Serialization¶
Graphio can serialize NodeSet
and RelationshipSet
objects to different formats.
This can be used to store processed, graph-ready data in a file.
Graphio supports the following formats for both NodeSet
and RelationshipSet
objects:
- combined CSV and JSON files (CSV file with all data and JSON file with metadata), can be deserialized again
- CSV files with all data (useful for quick tests, cannot be fully deserialized again)
- JSON files with all data (useful for quick tests with small datasets, contains redundant data)
Combined CSV and JSON files¶
The most useful serialization format stores the data in a CSV file and the metadata in a JSON file. This avoids redundancy and allows to deserialize the data again.
Data Format¶
Nodes¶
The JSON file with metadata contains at least the following information:
- the labels (labels)
- property keys used for MERGE operations (merge_keys)
The csv file contains the properties of one node per row, the header contains the property keys.
Example:
nodeset.json
:
{
"labels": [
"Person"
],
"merge_keys": [
"name"
]
}
nodeset.csv
:
name,age
Lisa,42
Bob,23
Relationships¶
The JSON file with metadata contains at least the following information:
- start node labels
- end node labels
- start node property keys to MATCH the start node
- end node property keys to MATCH the end node
- relationship type
The csv file contains one relationship per row, the start node, end node, and relationship properties are indicated by header prefixes (start_, end_, rel_).
Example:
relset.json
:
{
"start_node_labels": ["Person"],
"end_node_labels": ["Person"],
"start_node_properties": ["name"],
"end_node_properties": ["name"],
"rel_type": "KNOWS"
}
relset.csv
:
start_name,end_name,rel_since
Lisa,Bob,2018
Bob,Lisa,2018
Serialize to CSV and JSON¶
To serialize a NodeSet
or RelationshipSet
object use to_csv_json_set()
:
people = NodeSet(['Person'], merge_keys=['name']
people.add_node({'name': 'Lisa'})
people.add_node({'name': 'Bob'})
people.to_csv_json_set('people.json', 'people.csv')
knows = RelationshipSet('KNOWS', ['Person'], ['Person'], ['name'], ['name'])
knows.add_relationship({'name': 'Lisa'}, {'name': 'Bob'}, {'since': '2018'})
knows.to_csv_json_set('knows.json', 'knows.csv')
CSV files¶
Graphio can serialize NodeSet
and RelationshipSet
objects to CSV files in the same
format as the CSV files in the combined CSV/JSON format. This can be useful for quick tests with small datasets.
See NodeSet.to_csv()
and RelationshipSet.to_csv()
for details:
people = NodeSet(['Person'], merge_keys=['name']
people.add_node({'name': 'Lisa'})
people.add_node({'name': 'Bob'})
people.to_csv('nodeset.csv')
knows = RelationshipSet('KNOWS', ['Person'], ['Person'], ['name'], ['name'])
knows.add_relationship({'name': 'Lisa'}, {'name': 'Bob'}, {'since': '2018'})
knows.to_csv('relset.csv')
Graphio can generate matching Cypher queries to load these CSV files to Neo4j:
# NodeSet CREATE query
people.create_csv_query('nodeset.csv')
# NodeSet MERGE query
people.merge_csv_query('nodeset.csv')
# RelationshipSet CREATE query
knows.create_csv_query('relset.csv')
JSON files¶
note: | Deserialization of simple JSON representations is currently not supported. Use the combined JSON/CSV format instead. The JSON serialization can still be useful to test small datasets. |
---|
NodeSet
and RelationshipSet
objects can be serialized to JSON:
people = NodeSet(['Person'], merge_keys=['name']
people.add_node({'name': 'Lisa'})
people.to_json('nodeset.json')
This will create a JSON file with full node descriptions:
nodeset.json
:
{
"labels": [
"Person"
],
"merge_keys": [
"name"
],
"nodes": [
{
"name": "Lisa"
}
]
}
The same works with RelationshipSet
objects:
person_like_food = RelationshipSet('LIKES', ['Person'], ['Food'], ['name'], ['type'])
person_like_food.add_relationship({'name': 'Lisa'}, {'type': 'Sushi'}, {'since': 'always'})
person_like_food.to_json('relset.json')
Main Classes¶
NodeSet¶
-
class
graphio.
NodeSet
(labels=None, merge_keys=None, batch_size=None, default_props=None, preserve=None, append_props=None, indexed=False, additional_labels: List[str] = None, source: bool = False)¶ Container for a set of Nodes with the same labels and the same properties that define uniqueness.
-
add_node
(properties)¶ Create a node in this NodeSet.
Parameters: properties (dict) – Node properties.
-
add_unique
(properties)¶ Add a node to this NodeSet only if a node with the same merge_keys does not exist yet.
Note: Right now this function iterates all nodes in the NodeSet. This is of course slow for large numbers of nodes. A better solution would be to create an ‘index’ as is done for RelationshipSet.
Parameters: properties (dict) – Node properties.
-
all_property_keys
() → Set[str]¶ Return a set of all property keys in this NodeSet
Returns: A set of unique property keys of a NodeSet
-
create
(graph, database: str = None, batch_size=None)¶ Create all nodes from NodeSet.
-
create_csv_query
(filename: str = None, periodic_commit=1000)¶ Create a Cypher query to load a CSV file created with NodeSet.to_csv() into Neo4j (CREATE statement).
Parameters: - filename – Optional filename. A filename will be autocreated if not passed.
- periodic_commit – Number of rows to commit in one transaction.
Returns: Cypher query.
-
create_index
(graph, database=None)¶ Create indices for all label/merge ky combinations as well as a composite index if multiple merge keys exist.
-
classmethod
from_csv_json_set
(csv_file_path, json_file_path, load_items: bool = False, labels_key: str = None, mergekey_key: str = None)¶ Read the default CSV/JSON file combination. Needs paths to CSV and JSON file.
JSON keys can be overwritten by passing the respective parameters.
Parameters: - csv_file_path – Path to the CSV file.
- json_file_path – Path to the JSON file.
- load_items – Yield items from file (False, default) or load them to memory (True).
Returns: The NodeSet.
-
merge
(graph, merge_properties=None, batch_size=None, preserve=None, append_props=None, database=None)¶ Merge nodes from NodeSet on merge properties.
Parameters: merge_properties – The merge properties.
-
merge_csv_query
(filename: str = None, periodic_commit=1000)¶ Create a Cypher query to load a CSV file created with NodeSet.to_csv() into Neo4j (MERGE statement).
Parameters: - filename – Optional filename. A filename will be autocreated if not passed.
- periodic_commit – Number of rows to commit in one transaction.
Returns: Cypher query.
-
node_properties
()¶ Yield properties of the nodes in this set. Used for create function.
-
object_file_name
(suffix: str = None) → str¶ Create a unique name for this NodeSet that indicates content. Pass an optional suffix. NOTE: suffix has to include the ‘.’ for a filename!
nodeset_Label_merge-key_uuidWith suffix:
nodeset_Label_merge-key_uuid.json
-
to_csv
(filepath: str, quoting: int = None) → str¶ Create a CSV file for this nodeset. Header row is created with all properties. Each row contains the properties of a node.
Example:
>>> nodeset = NodeSet(labels=["Person"], merge_keys=["name"]) >>> nodeset.add_node({"name": "Alice", "age": 33}) >>> nodeset.add_node({"name": "Bob", "age": 44}) >>> nodeset.to_csv("/tmp/Person_name.csv") '/tmp/Person_name.csv'
name,age Alice,33 Bob,44
Parameters: - filepath – Full path to the CSV file.
- quoting – Optional quoting setting for csv writer (any of csv.QUOTE_MINIMAL, csv.QUOTE_NONE, csv.QUOTE_ALL etc).
-
to_csv_json_set
(csv_file_path, json_file_path, type_conversion: dict = None)¶ Write the default CSV/JSON file combination.
Needs paths to CSV and JSON file.
Parameters: - csv_file_path – Path to the CSV file.
- json_file_path – Path to the JSON file.
- type_conversion – Optional dictionary to convert types of properties.
-
to_definition
()¶ Create a NodeSetDefinition from this NodeSet. Later, NodeSetDefinition can become parent class of NodeSet.
-
to_dict
()¶ Create dictionary defining the nodeset.
-
to_json
(target_dir: str, filename: str = None)¶ Serialize NodeSet to a JSON file in a target directory.
This function is meant for dumping/reloading and not to create a general transport format. The function will likely be optimized for disk space or compressed in future.
-
update_node
(properties: dict)¶ Update an existing node by overwriting all properties.
Note that this requires NodeSet(…, indexed=True) which is not the default!
Parameters: properties – Node property dictionary.
-
RelationshipSet¶
-
class
graphio.
RelationshipSet
(rel_type, start_node_labels, end_node_labels, start_node_properties, end_node_properties, batch_size=None, default_props=None, source=False)¶ Container for a set of Relationships with the same type of start and end nodes.
Parameters: - rel_type – Realtionship type.
- start_node_labels – Labels of the start node.
- end_node_labels – Labels of the end node.
- start_node_properties – Property keys to identify the start node.
- end_node_properties – Properties to identify the end node.
- batch_size – Batch size for Neo4j operations.
-
add_relationship
(start_node_properties: dict, end_node_properties: dict, properties: dict = None)¶ Add a relationship to this RelationshipSet.
Parameters: properties – Relationship properties.
-
all_property_keys
() → Set[str]¶ Return a set of all property keys in this RelationshipSet
Returns: A set of unique property keys of a NodeSet
-
create
(graph, database=None, batch_size=None)¶ Create relationships in this RelationshipSet
-
create_csv_query
(query_type: str, filename: str = None, periodic_commit=1000) → str¶ Generate the CREATE CSV query for this RelationshipSet. The function tries to take care of type conversions.
Note: You can’t use arrays as properties for nodes/relationships when creating CSV files.
LOAD CSV WITH HEADERS FROM xyz AS line MATCH (a:Gene), (b:Protein) WHERE a.sid = line.a_sid AND b.sid = line.b_sid AND b.taxid = line.b_taxid CREATE (a)-[r:MAPS]->(b) SET r.key1 = line.rel_key1, r.key2 = line.rel_key2
-
create_index
(graph, database=None)¶ Create indices for start node and end node definition of this relationshipset. If more than one start or end node property is defined, all single property indices as well as the composite index are created.
-
classmethod
from_csv_json_set
(csv_file_path, json_file_path, load_items: bool = False, reltype_key=None, startnodeproperties_key=None, endnodeproperties_key=None, startnodelables_key=None, endnodelables_key=None)¶ Read the default CSV/JSON file combination. Needs paths to CSV and JSON file.
JSON keys can be overwritten by passing the respective parameters.
Parameters: - csv_file_path – Path to the CSV file.
- json_file_path – Path to the JSON file.
- load_items – Yield items from file (False, default) or load them to memory (True).
Returns: The RelationshipSet.
-
merge
(graph, database=None, batch_size=None)¶ Create relationships in this RelationshipSet
-
object_file_name
(suffix: str = None) → str¶ Create a unique name for this RelationshipSet that indicates content. Pass an optional suffix. NOTE: suffix has to include the ‘.’ for a filename!
relationshipset_StartLabel_TYPE_EndLabel_uuidWith suffix:
relationshipset_StartLabel_TYPE_EndLabel_uuid.json
-
to_csv
(filepath: str, quoting: int = None) → str¶ Write the RelationshipSet to a CSV file. The CSV file will be written to the given filepath.
Note: You can’t use arrays as properties for nodes/relationships when creating CSV files.
# CSV file header start_sid, end_sid, end_taxid, rel_key1, rel_key2
Parameters: - filepath – Path to csv file.
- relset (graphio.RelationshipSet) – The RelationshipSet
-
to_csv_json_set
(csv_file_path, json_file_path, write_mode: str = 'w')¶ Write the default CSV/JSON file combination.
Needs paths to CSV and JSON file.
Parameters: - csv_file_path – Path to the CSV file.
- json_file_path – Path to the JSON file.
- write_mode – Write mode for the CSV file.
-
to_json
(target_dir, filename: str = None)¶ Serialize NodeSet to a JSON file in a target directory.
This function is meant for dumping/reloading and not to create a general transport format. The function will likely be optimized for disk space or compressed in future.
Container¶
-
class
graphio.
Container
(objects=None)¶ A container for a collection of Nodes, Relationships, NodeSets and RelationshipSets.
A typical parser function to e.g. read an Excel file produces a mixed output which then has to be processed accordingly.
Also, sanity checks and data statistics are useful.
-
merge_nodesets
()¶ Merge all node sets if merge_key is defined.
-
nodesets
¶ Get the NodeSets in the Container.
-
relationshipsets
¶ Get the RelationshipSets in the Container.
-
Model Objects¶
Warning
This is the first iteration of the interface for model objects. The function/class signatures might change in the next releases.
Basic Usage¶
Creating NodeSet
and RelationshipSet
classes with string is error prone.
Graphio offers a simple object graph model system:
from graphio import ModelNode, ModelRelationship
class Person(ModelNode):
name = MergeKey()
class Food(ModelNode):
type = MergeKey()
class PersonLikes(ModelRelationship):
source = Person
target = Food
type = 'LIKES'
You can use these classes to create NodeSet
and RelationshipSet
:
person_nodeset = Person.dataset()
food_nodeset = Food.dataset()
person_likes_food = PersonLikes.dataset()
When adding data to the RelationshipSet
you can use the MergeKey
properties of the
ModelNode
classes to avoid typing the properties as strings:
for name, food in [('Susan', 'Pizza'), ('Ann', 'Sushi')]:
person_likes_food.add_relationship(
{Person.name: name}, {Food.type: food}
)
You can set one or multiple Label
and MergeKey
properties on the ModelNode
:
class Person(ModelNode):
first_name = MergeKey()
last_name = MergeKey()
Person = Label()
Human = Label()
You can override the actual values of the Label
and MergeKey
:
class Person(ModelNode):
first_name = MergeKey('first_name')
last_name = MergeKey('surname')
Person = Label('Individual')
Human = Label('HomoSapiens')
Add data with model instances¶
You can create instances of the model objects to create individual nodes and relationships:
from graphio import ModelNode, ModelRelationship
from py2neo import Graph
graph = Graph()
class Person(ModelNode):
name = MergeKey()
class Food(ModelNode):
type = MergeKey()
class PersonLikes(ModelRelationship):
source = Person
target = Food
type = 'LIKES'
alice = Person(name='Alice')
sushi = Food(type='Sushi')
alice.merge(graph)
sushi.merge(graph)
alice_likes_sushi = PersonLikes(alice, sushi)
alice_likes_sushi.merge(graph)
You can also link nodes without creating ModelRelationship
instances:
alice.link(graph, PersonLikes, sushi, since='always')
ModelNode¶
-
class
graphio.
ModelNode
(*args, **kwargs)¶ Baseclass for model objects.
-
additional_props
¶ Return all properties except the merge properties.
Returns: Dictionary with all properties except the merge properties. Return type: dict
-
classmethod
dataset
() → graphio.objects.nodeset.NodeSet¶ Returns: Return a NodeSet
instance for this ModelNode.
-
classmethod
factory
(labels: List[str], merge_keys: List[str] = None, name: str = None) → type¶ Create a class with given labels and merge_keys. The merge_keys are optional but some functions do not work without them.
Parameters: - labels – Labels for this ModelNode class.
- merge_keys – MergeKeys for this ModelNode class.
Returns: The ModelNode class.
-
merge_props
¶ Return the merge properties for this node.
Returns: Dictionary with the merge properties for this node. Return type: dict
-
ModelRelationship¶
-
class
graphio.
ModelRelationship
(source: graphio.model.ModelNode, target: graphio.model.ModelNode, **kwargs)¶ Base class for model relationships.
Knows about the class of source node and target node (instances of
ModelNode
) and the relationship type:class Person(ModelNode): name = MergeKey() class Food(ModelNode): name = MergeKey() class PersonLikesToEat(ModelRelationship): source = Person target = Food type = 'LIKES'
-
classmethod
dataset
() → graphio.objects.relationshipset.RelationshipSet¶ Returns: Return a RelationshipSet
instance for this ModelRelationship.
-
classmethod
Helper Classes¶
-
class
graphio.
NodeDescriptor
(labels: List[str], properties: dict, merge_keys: List[str] = None)¶ Unified interface to describe nodes with labels and properties.
NodeDescriptor instances are passed into functions when no ModelNode classes or instances are available.
Setting merge_keys is optional. If they are not set all property keys will be used as merge_keys.
Parameters: - labels – Labels for this node.
- properties – Properties for this node.
- merge_keys – Optional.