neo4j import webinar
TRANSCRIPT
Neo4j Import WebinarMark Needham (@markhneedham)30th July 2015
Neo Technology, Inc Confidential#neo4j
Chicago Crime dataset
Neo Technology, Inc Confidential#neo4j
Chicago Crime dataset
Neo Technology, Inc Confidential#neo4j
Chicago Crime CSV file
imported into
The goal
Neo Technology, Inc Confidential#neo4j
Exploring the data
Neo Technology, Inc Confidential#neo4j
Exploring the data
LOAD CSV WITH HEADERS FROM"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowRETURN rowLIMIT 1
Neo Technology, Inc Confidential#neo4j
Exploring the data
Neo Technology, Inc Confidential#neo4j
Exploring the data
Neo Technology, Inc Confidential#neo4j
Sketch a rough initial model
Neo Technology, Inc Confidential#neo4j
Import a sample: CrimesLOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (crime:Crime { id: row.ID, description: row.Description, caseNumber: row.`Case Number`, arrest: row.Arrest, domestic: row.Domestic});
Neo Technology, Inc Confidential#neo4j
Import a sample: Crimes
Show how to do this better by splitting up the attributes
Neo Technology, Inc Confidential#neo4j
Import a sample: Crime TypesLOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (:CrimeType { name: row.`Primary Type`});
Neo Technology, Inc Confidential#neo4j
Import a sample: Crimes -> Crime TypesLOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MATCH (crime:Crime { id: row.ID, description: row.Description})MATCH (crimeType:CrimeType { name: row.`Primary Type`})MERGE (crime)-[:TYPE]->(crimeType);
Neo Technology, Inc Confidential#neo4j
Add indexesCREATE INDEX ON :Label(property)
Neo Technology, Inc Confidential#neo4j
Add indexesCREATE INDEX ON :Label(property)
CREATE INDEX ON :Crime(id);CREATE INDEX ON :Location(name);CREATE INDEX ON :CrimeType(name);CREATE INDEX ON :Location(name);...
Neo Technology, Inc Confidential#neo4j
Periodic Commit
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv
MERGE (crime:Crime { id: row.ID, description: row.Description})
Neo Technology, Inc Confidential#neo4j
Periodic Commit• Neo4j keeps all transaction state in
memory which becomes problematic for large CSV files
• USING PERIODIC COMMIT flushes the transaction after a certain number of rows
• Default is 1000 rows but it’s configurable• Currently only works with LOAD CSV
Neo Technology, Inc Confidential#neo4j
Avoiding the Eager• Cypher has an Eager operator which will
bring forward parts of a query to ensure safety
• We don’t want to see this operator when we’re importing data – it will slow things down a lot
• Put a diagram of eager => slow (maybe a query plan?)
Neo Technology, Inc Confidential#neo4j
LOAD CSV in summary• ETL power tool• Built into Neo4J since version 2.1• Can load data from any URL• Good for medium size data (up to 10M
rows)
Neo Technology, Inc Confidential#neo4j
Bulk loading an initial data set• Introducing the Neo4j Import Tool• Find it in the bin folder of your Neo4j
download• Used to large sized initial data sets• Skips the transactional layer of Neo4j and
writes store files directly
Neo Technology, Inc Confidential#neo4j
Expects files in a certain format
:ID(Crime) :LABEL description :ID(Beat) :LABEL
:START_ID(Crime) :END_ID(Beat)
:TYPE
Nodes
Relationships
Neo Technology, Inc Confidential#neo4j
What we have…
Neo Technology, Inc Confidential#neo4j
Chicago Crime CSV file
Neo4j ready CSV files
Translation Phase required
Translation Phase
Neo Technology, Inc Confidential#neo4j
Chicago Crime CSV file
Spark all the things
Spark Job
processed by
spits out
Neo4j ready CSV files
imported into
Neo Technology, Inc Confidential#neo4j
The Spark Job
Neo Technology, Inc Confidential#neo4j
The Spark Job
Neo Technology, Inc Confidential#neo4j
Submitting the Spark Job./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar
real 1m25.506suser8m2.183ssys 0m24.267s
Neo Technology, Inc Confidential#neo4j
Submitting the Spark Job./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar
real 1m25.506suser8m2.183ssys 0m24.267s
Neo Technology, Inc Confidential#neo4j
The generated files$ ls -1 tmp/*.csvtmp/beats.csvtmp/crimeDates.csvtmp/crimes.csvtmp/crimesBeats.csvtmp/crimesDates.csvtmp/crimesLocations.csvtmp/crimesPrimaryTypes.csvtmp/dates.csvtmp/locations.csvtmp/primaryTypes.csv
Neo Technology, Inc Confidential#neo4j
Importing into Neo4jDATA=tmpNEO=./neo4j-enterprise-2.2.3$NEO/bin/neo4j-import \--into $DATA/crimes.db \--nodes $DATA/crimes.csv \--nodes $DATA/beats.csv \--nodes $DATA/primaryTypes.csv \--nodes $DATA/locations.csv \--relationships $DATA/crimesBeats.csv \--relationships $DATA/crimesPrimaryTypes.csv \--relationships $DATA/crimesLocations.csv \--stacktrace
IMPORT DONE in 36s 208ms
Neo Technology, Inc Confidential#neo4j
Enriching the crime graph
Neo Technology, Inc Confidential#neo4j
Enriching the crime graph
Neo Technology, Inc Confidential#neo4j
Enriching the crime graph
Neo Technology, Inc Confidential#neo4j
2 options
JSON CSVjq LOAD CSV
JSON Language Driver
HTTP API
Neo Technology, Inc Confidential#neo4j
Using py2neo to load JSON into Neo4jimport jsonfrom py2neo import Graph, authenticate
authenticate("localhost:7474", "neo4j", "foobar")graph = Graph()
with open('categories.json') as data_file: json = json.load(data_file)
query = """WITH {json} AS documentUNWIND document.categories AS categoryUNWIND category.sub_categories AS subCategoryMERGE (c:CrimeCategory {name: category.name})MERGE (sc:SubCategory {code: subCategory.code})ON CREATE SET sc.description = subCategory.descriptionMERGE (c)-[:CHILD]->(sc)"""
print graph.cypher.execute(query, json = json)
Neo Technology, Inc Confidential#neo4j
Enriching the crime graph
Translate from JSON to CSV
Neo Technology, Inc Confidential#neo4j
Enriching the crime graph
Import using LOAD CSV
Neo Technology, Inc Confidential#neo4j
Updating the graph• As new crimes come in we want to update
the graph to take them into account
Neo Technology, Inc Confidential#neo4j
Updating the graph• Import this using REST Transactional API
Neo Technology, Inc Confidential#neo4j
This talk brought to you by…
Neo Technology, Inc Confidential#neo4j
And that’s it…