Working with Hierarchical Trees in Neo4j
Introduction
My name is Tomaz Bratanic. I want to demonstrate how you should approach hierarchical location trees in Neo4j. From what I have learned during importing/querying with them, I came up with a few ground rules one should follow to in order to get the correct query results.
Rules of location tree:
-
All relationships are directed from children to parents, going up the hiearchy.
-
We have a single type for all relationships. (PARENT;FROM;IS_IN)
-
Every node has a single outgoing relationship to its parent.
-
Every node can have one or multiple incoming relationships from its children.
Contact:
-
twitter: @tb_tomaz
-
github: https://github.com/tomasonjo
Import
Let’s load some data into our graph to explore.
Add constraints and indexes
First, we need to add indexes and constraints, as they will optimize our queries. The first array in the procedure below sets the indexes, and the second array contains the unique constraints. You will need to have the APOC library installed.
CREATE INDEX ON :County(name);
CREATE INDEX ON :City(name);
CREATE INDEX ON :ZipCode(name);
CREATE INDEX ON :Address(name);
CREATE CONSTRAINT ON (h:Hospital) ASSERT h.id IS UNIQUE;
CREATE CONSTRAINT ON (s:State) ASSERT s.name IS UNIQUE;
Location hierarchical tree import
You can notice that we do not take the standard approach, where we merge each node separately, but we merge them in pattern with their parent in a hierarchical tree because some counties/cities/addresses share the same name.
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/hospitals-neo4j/master/Hospital%20General%20Information.csv" as row
WITH row
WHERE row.State = 'New York'
// state name is unique
MERGE (state:State{name:row.State})
// merge by pattern with their parents
MERGE (state)<-[:IS_IN]-(county:County{name:row.`County Name`})
MERGE (county)<-[:IS_IN]-(city:City{name:row.City})
MERGE (city)<-[:IS_IN]-(zip:ZipCode{name:row.`ZIP Code`})
MERGE (zip)<-[:IS_IN]-(address:Address{name:row.Address})
// for entities, it is best to have an id system
MERGE (h:Hospital{id:row.`Provider ID`})
MERGE (h)-[:IS_IN]->(address)
Additional hospital information
We will also import some additional information about the hospitals such as their ratings, ownership, and more.
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/hospitals-neo4j/master/Hospital%20General%20Information.csv" as row
WITH row
WHERE row.State = 'New York'
MATCH (h:Hospital{id:row.`Provider ID`})
SET h.phone=row.`Phone Number`,
h.emergency_services = row.`Emergency Services`,
h.name= row.`Hospital Name`,
h.mortality = row.`Mortality national comparison`,
h.safety = row.`Safety of care national comparison`,
h.timeliness = row.`Timeliness of care national comparison`,
h.experience = row.`Patient experience national comparison`,
h.effectiveness = row.`Effectiveness of care national comparison`
MERGE (type:HospitalType{name:row.`Hospital Type`})
MERGE (h)-[:HAS_TYPE]->(type)
MERGE (ownership:Ownership{name: row.`Hospital Ownership`})
MERGE (h)-[:HAS_OWNERSHIP]->(ownership)
MERGE (rating:Rating{name:row.`Hospital overall rating`})
MERGE (h)-[:HAS_RATING]->(rating)
Geospatial import
The last thing to import is the geospatial information of hospitals.
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/hospitals-neo4j/master/gpsinfo.csv" as row
WITH row
WHERE row.State = 'New York'
MATCH (hospital:Hospital {id:row.id})
SET hospital.latitude = toFloat(row.latitude),
hospital.longitude = toFloat(row.longitude)
Spatial query example
Let’s say you get lost on Liberty Island
and want to find the nearest 10
hospitals. Distance is in meters. Note: does not work in Neo4j Sandbox.
WITH "Liberty Island, Manhattan" as myLocation
CALL apoc.spatial.geocodeOnce(myLocation) YIELD location
WITH point({longitude: location.longitude, latitude: location.latitude}) as myPosition,100 as distanceInKm
MATCH (h:Hospital)-->(rating:Rating)
WHERE exists(h.latitude) and
distance(myPosition, point({longitude:h.longitude,latitude:h.latitude})) < (distanceInKm * 100)
RETURN h.name as hospital,rating.name as rating,distance(myPosition,
point({longitude:h.longitude,latitude:h.latitude})) as distance
ORDER BY distance LIMIT 10
Data Validation
Validation #1
We can check if any :Address
has more than one relationship going up the hierarchy. Every node has a single outgoing relationship to its parent rule.
MATCH (a:Address)
WHERE size((a)-[:IS_IN]->()) > 1
RETURN a
Validation #2
We can also check the length of all the paths in location tree. Because of our rules we placed, every hospital must have exactly one location path because every hospital have exactly one address.
MATCH path=(h:Hospital)-[:IS_IN*..10]->(location)
WHERE NOT (location)-[:IS_IN]->()
RETURN distinct(length(path)) as length,
count(*) as numberOfPaths,
count(distinct(h)) as numberOfHospitals
Queries
Lets run a few queries and learn about our data.
Average rating by ownership
MATCH (r)<-[:HAS_RATING]-(h:Hospital)-[:HAS_OWNERSHIP]->(o)
RETURN o.name as ownership,avg(toInteger(r.name)) as averageRating
ORDER BY averageRating DESC LIMIT 15
Number of hospitals per city
MATCH (h:Hospital)-[:IS_IN*3..3]->(city)
RETURN city.name as city,count(h) as NumberOfHospitals
ORDER BY NumberOfHospitals DESC LIMIT 15
Top 10 states by rating
MATCH (r)<-[:HAS_RATING]-(h:Hospital)-[:IS_IN*5..5]->(state)
WHERE NOT r.name="Not Available"
RETURN state.name as state,avg(toInteger(r.name)) as averageRating,count(h) as numberOfHospitals
ORDER BY averageRating DESC LIMIT 15
Is this page helpful?