csv - Should I use Neo4j's Import Tool or Load Command to Insert Several Million Rows? -
i have several csv files range 25-100 mb in size. have created constraints, created indices, using periodic commit, , increased allocated memory in neo4j-wrapper.conf , neo4j.properties.
neo4j.properties:
neostore.nodestore.db.mapped_memory=50m neostore.relationshipstore.db.mapped_memory=500m neostore.propertystore.db.mapped_memory=100m neostore.propertystore.db.strings.mapped_memory=100m neostore.propertystore.db.arrays.mapped_memory=0m neo4j-wrapper.conf changes:
wrapper.java.initmemory=5000 wrapper.java.maxmemory=5000 however load still taking long time, , considering using released import tool (http://neo4j.com/docs/milestone/import-tool.html). before switch it, wondering whether doing else improve speed of imports.
i begin creating several constraints make sure ids i'm using unique:
create constraint on (country) assert c.name unique; //and constraints other name identifiers well.. i use periodic commit...
using periodic commit 10000 i load in csv ignore several fields
load csv headers "file:/path/to/file/myfile.csv" line line line.countryname not null , line.cityname not null , line.neighborhoodname not null i create necessary nodes data.
with line merge(country:country {name : line.countryname}) merge(city:city {name : line.cityname}) merge(neighborhood:neighborhood { name : line.neighborhoodname, size : toint(line.neighborhoodsize), nickname : coalesce(line.neighborhoodnn, ""), ... 50 other features }) merge (city)-[:in]->(country) create (neighborhood)-[:in]->(city) //note each neighborhood appears once does make sense use create unique rather applying merge country reference? speed up?
a ~250,000-line csv file took on 12 hours complete, , seemed excessively slow. else can doing speed up? or make sense use annoying-looking import tool?
a couple of things. firstly, suggest reading mark needham's "avoiding eager" blog post:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
basically says should add profile start of each of queries see if of them use eager operator. if can cost performance-wise , should split queries separate merges
secondly, neighborhood merge contains lot of properties, , each time it's trying match on every single 1 of properties before deciding if should create or not. i'd suggest like:
merge (neighborhood:neighborhood {name: line.neighborhoodname}) on create set neighborhood.size = toint(line.neighborhoodsize), neighborhood.nickname = coalesce(line.neighborhoodnn, ""), ... 50 other features })
Comments
Post a Comment