Getting Started: Bulk Loading Track¶

Best for: ETL processes, large datasets, data migration, and high-performance data ingestion.

Prerequisites¶

Neo4j Database: Running locally or remotely

# Using Docker (recommended for testing)
docker run -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:latest

Install Graphio:
```
pip install graphio
```

Step 1: Set Up Connection¶

from graphio import NodeSet, RelationshipSet
from neo4j import GraphDatabase

# Connect to Neo4j
driver = GraphDatabase.driver('neo4j://localhost:7687', auth=('neo4j', 'password'))

# Optional: For Enterprise Edition, you can specify a database
# database = 'mydb'  # All operations can target this specific database

Step 2: Define Data Containers¶

# Define node containers
people = NodeSet(['Person'], merge_keys=['email'])
companies = NodeSet(['Company'], merge_keys=['name'], deduplicate=True)  # Prevent duplicate companies

# Define relationship container
employments = RelationshipSet(
    'WORKS_AT',           # Relationship type
    ['Person'],           # Start node labels  
    ['Company'],          # End node labels
    ['email'],            # Start node match keys
    ['name']              # End node match keys
)

Step 3: Add Data in Batches¶

# Add nodes (can handle thousands efficiently)
people.add({'name': 'Alice Smith', 'email': 'alice@example.com', 'age': 30})
people.add({'name': 'Bob Johnson', 'email': 'bob@example.com', 'age': 25})

# You can also specify OGM instances if using hybrid approach
# from your_models import Person
# people.add(Person(name='Alice', email='alice@example.com', age=30))

companies.add({'name': 'ACME Corp', 'industry': 'Technology'})

# Add relationships
employments.add(
    {'email': 'alice@example.com'},  # Start node
    {'name': 'ACME Corp'},           # End node  
    {'position': 'Developer'}        # Relationship properties
)

Step 4: Create Indexes (Performance)¶

# Create indexes before bulk loading
people.create_index(driver)
companies.create_index(driver)

Step 5: Bulk Load to Neo4j¶

# Load data efficiently
companies.create(driver)  # Load companies first
people.create(driver)     # Then people
employments.create(driver)  # Finally relationships

# For Enterprise Edition, specify target database:
# companies.create(driver, database='production')
# people.create(driver, database='production')
# employments.create(driver, database='production')

print(f"Loaded {len(people.nodes)} people and {len(companies.nodes)} companies")

What You've Learned¶

✅ How to create NodeSet and RelationshipSet containers
✅ How to batch data for efficient loading
✅ How to create indexes for performance
✅ Proper loading order (nodes before relationships)
✅ How to prevent duplicates with built-in deduplication

Next Steps¶

Need data validation? → OGM Track
Want to combine both? → Hybrid Approach
Deep dive into bulk loading → Bulk Loading Guide