neo4j import webinar

Post on 08-Jan-2017

272 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Neo4j Import WebinarMark Needham (@markhneedham)30th July 2015

Neo Technology, Inc Confidential#neo4j

Chicago Crime dataset

Neo Technology, Inc Confidential#neo4j

Chicago Crime dataset

Neo Technology, Inc Confidential#neo4j

Chicago Crime CSV file

imported into

The goal

Neo Technology, Inc Confidential#neo4j

Exploring the data

Neo Technology, Inc Confidential#neo4j

Exploring the data

LOAD CSV WITH HEADERS FROM"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowRETURN rowLIMIT 1

Neo Technology, Inc Confidential#neo4j

Exploring the data

Neo Technology, Inc Confidential#neo4j

Exploring the data

Neo Technology, Inc Confidential#neo4j

Sketch a rough initial model

Neo Technology, Inc Confidential#neo4j

Import a sample: CrimesLOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (crime:Crime { id: row.ID, description: row.Description, caseNumber: row.`Case Number`, arrest: row.Arrest, domestic: row.Domestic});

Neo Technology, Inc Confidential#neo4j

Import a sample: Crimes

Show how to do this better by splitting up the attributes

Neo Technology, Inc Confidential#neo4j

Import a sample: Crime TypesLOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (:CrimeType { name: row.`Primary Type`});

Neo Technology, Inc Confidential#neo4j

Import a sample: Crimes -> Crime TypesLOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MATCH (crime:Crime { id: row.ID, description: row.Description})MATCH (crimeType:CrimeType { name: row.`Primary Type`})MERGE (crime)-[:TYPE]->(crimeType);

Neo Technology, Inc Confidential#neo4j

Add indexesCREATE INDEX ON :Label(property)

Neo Technology, Inc Confidential#neo4j

Add indexesCREATE INDEX ON :Label(property)

CREATE INDEX ON :Crime(id);CREATE INDEX ON :Location(name);CREATE INDEX ON :CrimeType(name);CREATE INDEX ON :Location(name);...

Neo Technology, Inc Confidential#neo4j

Periodic Commit

USING PERIODIC COMMIT

LOAD CSV WITH HEADERS FROM file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv

MERGE (crime:Crime { id: row.ID, description: row.Description})

Neo Technology, Inc Confidential#neo4j

Periodic Commit• Neo4j keeps all transaction state in

memory which becomes problematic for large CSV files

• USING PERIODIC COMMIT flushes the transaction after a certain number of rows

• Default is 1000 rows but it’s configurable• Currently only works with LOAD CSV

Neo Technology, Inc Confidential#neo4j

Avoiding the Eager• Cypher has an Eager operator which will

bring forward parts of a query to ensure safety

• We don’t want to see this operator when we’re importing data – it will slow things down a lot

• Put a diagram of eager => slow (maybe a query plan?)

Neo Technology, Inc Confidential#neo4j

LOAD CSV in summary• ETL power tool• Built into Neo4J since version 2.1• Can load data from any URL• Good for medium size data (up to 10M

rows)

Neo Technology, Inc Confidential#neo4j

Bulk loading an initial data set• Introducing the Neo4j Import Tool• Find it in the bin folder of your Neo4j

download• Used to large sized initial data sets• Skips the transactional layer of Neo4j and

writes store files directly

Neo Technology, Inc Confidential#neo4j

Expects files in a certain format

:ID(Crime) :LABEL description :ID(Beat) :LABEL

:START_ID(Crime) :END_ID(Beat)

:TYPE

Nodes

Relationships

Neo Technology, Inc Confidential#neo4j

What we have…

Neo Technology, Inc Confidential#neo4j

Chicago Crime CSV file

Neo4j ready CSV files

Translation Phase required

Translation Phase

Neo Technology, Inc Confidential#neo4j

Chicago Crime CSV file

Spark all the things

Spark Job

processed by

spits out

Neo4j ready CSV files

imported into

Neo Technology, Inc Confidential#neo4j

The Spark Job

Neo Technology, Inc Confidential#neo4j

The Spark Job

Neo Technology, Inc Confidential#neo4j

Submitting the Spark Job./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar

real 1m25.506suser8m2.183ssys 0m24.267s

Neo Technology, Inc Confidential#neo4j

Submitting the Spark Job./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar

real 1m25.506suser8m2.183ssys 0m24.267s

Neo Technology, Inc Confidential#neo4j

The generated files$ ls -1 tmp/*.csvtmp/beats.csvtmp/crimeDates.csvtmp/crimes.csvtmp/crimesBeats.csvtmp/crimesDates.csvtmp/crimesLocations.csvtmp/crimesPrimaryTypes.csvtmp/dates.csvtmp/locations.csvtmp/primaryTypes.csv

Neo Technology, Inc Confidential#neo4j

Importing into Neo4jDATA=tmpNEO=./neo4j-enterprise-2.2.3$NEO/bin/neo4j-import \--into $DATA/crimes.db \--nodes $DATA/crimes.csv \--nodes $DATA/beats.csv \--nodes $DATA/primaryTypes.csv \--nodes $DATA/locations.csv \--relationships $DATA/crimesBeats.csv \--relationships $DATA/crimesPrimaryTypes.csv \--relationships $DATA/crimesLocations.csv \--stacktrace

IMPORT DONE in 36s 208ms

Neo Technology, Inc Confidential#neo4j

Enriching the crime graph

Neo Technology, Inc Confidential#neo4j

Enriching the crime graph

Neo Technology, Inc Confidential#neo4j

Enriching the crime graph

Neo Technology, Inc Confidential#neo4j

2 options

JSON CSVjq LOAD CSV

JSON Language Driver

HTTP API

Neo Technology, Inc Confidential#neo4j

Using py2neo to load JSON into Neo4jimport jsonfrom py2neo import Graph, authenticate

authenticate("localhost:7474", "neo4j", "foobar")graph = Graph()

with open('categories.json') as data_file: json = json.load(data_file)

query = """WITH {json} AS documentUNWIND document.categories AS categoryUNWIND category.sub_categories AS subCategoryMERGE (c:CrimeCategory {name: category.name})MERGE (sc:SubCategory {code: subCategory.code})ON CREATE SET sc.description = subCategory.descriptionMERGE (c)-[:CHILD]->(sc)"""

print graph.cypher.execute(query, json = json)

Neo Technology, Inc Confidential#neo4j

Enriching the crime graph

Translate from JSON to CSV

Neo Technology, Inc Confidential#neo4j

Enriching the crime graph

Import using LOAD CSV

Neo Technology, Inc Confidential#neo4j

Updating the graph• As new crimes come in we want to update

the graph to take them into account

Neo Technology, Inc Confidential#neo4j

Updating the graph• Import this using REST Transactional API

Neo Technology, Inc Confidential#neo4j

This talk brought to you by…

Neo Technology, Inc Confidential#neo4j

And that’s it…

top related