Schema optimization
All the examples in this page assume that the |
Although Neo4j does not enforce the use of a schema, adding indexes and constraints before writing data makes the writing process more efficient. When updating nodes or relationships, having constraints in place is also the best way to avoid duplicates.
The connector options for schema optimization are summarized below.
The schema optimization options described here cannot be used with the |
Option | Description | Value | Default |
---|---|---|---|
|
Creates indexes and property uniqueness constraints on nodes using the properties defined by the (Deprecated in favor of |
One of |
|
|
Creates property uniqueness and key constraints on nodes using the properties defined by the |
One of |
|
|
Creates property uniqueness and key constraints on relationships using the properties defined by the |
One of |
|
|
Creates property type and property existence constraints on both nodes and relationships, enforcing the type and non-nullability from the DataFrame schema. |
Comma-separated list of |
|
Indexes on node properties
Indexes in Neo4j are often used to increase search performance.
You can create an index by setting the schema.optimization.type
option to INDEX
.
val df = List(
"Product 1",
"Product 2",
).toDF("name")
df.write
.format("org.neo4j.spark.DataSource")
.mode(SaveMode.Overwrite)
.option("labels", ":Product")
.option("node.keys", "name")
.option("schema.optimization.type", "INDEX")
.save()
Schema query
Before the writing process starts, the connector runs the following schema query:
CREATE INDEX spark_INDEX_Product_name FOR (n:Product) ON (n.name)
The format of the index name is spark_INDEX_<LABEL>_<NODE_KEYS>
, where <LABEL>
is the first label from the labels
option and <NODE_KEYS>
is a dash-separated sequence of one or more properties as specified in the node.keys
options.
Notes:
-
The index is not recreated if already present.
-
With multiple labels, only the first label is used to create the index.
Node property uniqueness constraints
Node property uniqueness constraints ensure that property values are unique for all nodes with a specific label. For property uniqueness constraints on multiple properties, the combination of the property values is unique.
You can create a constraint by setting the schema.optimization.node.keys
option to UNIQUE
.
val df = List(
"Product 1",
"Product 2",
).toDF("name")
df.write
.format("org.neo4j.spark.DataSource")
.mode(SaveMode.Overwrite)
.option("labels", ":Product")
.option("node.keys", "name")
.option("schema.optimization.node.keys", "UNIQUE")
.save()
Schema query
Before the writing process starts, the connector runs the following schema query:
CREATE CONSTRAINT `spark_NODE_UNIQUE-CONSTRAINT_Product_name` IF NOT EXISTS FOR (e:Product) REQUIRE (e.name) IS UNIQUE
Notes:
-
The constraint is not recreated if already present.
-
With multiple labels, only the first label is used to create the index.
-
You cannot create a uniqueness constraint on a node property if a key constraint already exists on the same property.
-
This schema optimization only works with the
Overwrite
save mode.
Before version 5.3.0, node property uniqueness constraints could be added with the |
Node key constraints
Node key constraints ensure that, for a given node label and set of properties:
-
All the properties exist on all the nodes with that label.
-
The combination of the property values is unique.
You can create a constraint by setting the schema.optimization.node.keys
option to KEY
.
val df = List(
"Product 1",
"Product 2",
).toDF("name")
df.write
.format("org.neo4j.spark.DataSource")
.mode(SaveMode.Overwrite)
.option("labels", ":Product")
.option("node.keys", "name")
.option("schema.optimization.node.keys", "KEY")
.save()
Schema query
Before the writing process starts, the connector runs the following schema query:
CREATE CONSTRAINT `spark_NODE_KEY-CONSTRAINT_Product_name` IF NOT EXISTS FOR (e:Product) REQUIRE (e.name) IS NODE KEY
Notes:
-
The constraint is not recreated if already present.
-
You cannot create a key constraint on a node property if a uniqueness constraint already exists on the same property.
-
This schema optimization only works with the
Overwrite
save mode.
Relationship property uniqueness constraints
Relationship property uniqueness constraints ensure that property values are unique for all relationships with a specific type. For property uniqueness constraints on multiple properties, the combination of the property values is unique.
You can create a constraint by setting the schema.optimization.relationship.keys
option to UNIQUE
.
val df = Seq(
("John", "Doe", 1, "Product 1", 200, "ABC100"),
("Jane", "Doe", 2, "Product 2", 100, "ABC200")
).toDF("name", "surname", "customerID", "product", "quantity", "order")
df.write
.mode(SaveMode.Overwrite)
.format("org.neo4j.spark.DataSource")
.option("relationship", "BOUGHT")
.option("relationship.save.strategy", "keys")
.option("relationship.source.save.mode", "Overwrite")
.option("relationship.source.labels", ":Customer")
.option("relationship.source.node.properties", "name,surname,customerID:id")
.option("relationship.source.node.keys", "customerID:id")
.option("relationship.target.save.mode", "Overwrite")
.option("relationship.target.labels", ":Product")
.option("relationship.target.node.properties", "product:name")
.option("relationship.target.node.keys", "product:name")
.option("relationship.properties", "quantity,order")
.option("schema.optimization.relationship.keys", "UNIQUE")
.option("relationship.keys", "order")
.save()
Schema query
Before the writing process starts, the connector runs the following schema query:
CREATE CONSTRAINT `spark_RELATIONSHIP_UNIQUE-CONSTRAINT_BOUGHT_order` IF NOT EXISTS FOR ()-[e:BOUGHT]->() REQUIRE (e.order) IS UNIQUE
Notes:
-
The constraint is not recreated if already present.
-
You cannot create a uniqueness constraint on a relationship property if a key constraint already exists on the same property.
-
This schema optimization only works with the
Overwrite
save mode.
Relationship key constraints
Relationship key constraints ensure that, for a given relationship type and set of properties:
-
All the properties exist on all the relationships with that type.
-
The combination of the property values is unique.
You can create a constraint by setting the schema.optimization.relationship.keys
option to KEY
.
val df = Seq(
("John", "Doe", 1, "Product 1", 200, "ABC100"),
("Jane", "Doe", 2, "Product 2", 100, "ABC200")
).toDF("name", "surname", "customerID", "product", "quantity", "order")
df.write
.mode(SaveMode.Overwrite)
.format("org.neo4j.spark.DataSource")
.option("relationship", "BOUGHT")
.option("relationship.save.strategy", "keys")
.option("relationship.source.save.mode", "Overwrite")
.option("relationship.source.labels", ":Customer")
.option("relationship.source.node.properties", "name,surname,customerID:id")
.option("relationship.source.node.keys", "customerID:id")
.option("relationship.target.save.mode", "Overwrite")
.option("relationship.target.labels", ":Product")
.option("relationship.target.node.properties", "product:name")
.option("relationship.target.node.keys", "product:name")
.option("relationship.properties", "quantity,order")
.option("schema.optimization.relationship.keys", "KEY")
.option("relationship.keys", "order")
.save()
Schema query
Before the writing process starts, the connector runs the following schema query:
CREATE CONSTRAINT `spark_RELATIONSHIP_KEY-CONSTRAINT_BOUGHT_order` IF NOT EXISTS FOR ()-[e:BOUGHT]->() REQUIRE (e.order) IS RELATIONSHIP KEY
Notes:
-
The constraint is not recreated if already present.
-
You cannot create a key constraint on a relationship property if a uniqueness constraint already exists on the same property.
-
This schema optimization only works with the
Overwrite
save mode.
Property type and property existence constraints
Property type constraints ensure that a property have the required property type for all nodes with a specific label (node property type constraints) or for all relationships with a specific type (relationship property type constraints).
Property existence constraints ensure that a property exists (IS NOT NULL
) for all nodes with a specific label (node property existence constraints) or for all relationships with a specific type (relationship property existence constraints).
The connector uses the DataFrame schema to enforce types (with the mapping described in the Data type mapping) and the nullable
flags of each column to determine whether to enforce existence.
You can create:
-
Property type constraints for both nodes and relationships by setting the
schema.optimization
option toTYPE
. -
Property existence constraints for both nodes and relationships by setting the
schema.optimization
option toEXISTS
. -
Both at the same time by setting the
schema.optimization
option toTYPE,EXISTS
.
Notes:
-
The constraints are not recreated if already present.
On nodes
df.write
.format("org.neo4j.spark.DataSource")
.mode(SaveMode.Overwrite)
.option("labels", ":Person")
.option("node.keys", "surname")
.option("schema.optimization", "TYPE,EXISTS")
.save()
Schema queries
Before the writing process starts, the connector runs the following schema queries (one query for each DataFrame column):
CREATE CONSTRAINT `spark_NODE-TYPE-CONSTRAINT-Person-name` IF NOT EXISTS FOR (e:Person) REQUIRE e.name IS :: STRING
CREATE CONSTRAINT `spark_NODE-TYPE-CONSTRAINT-Person-surname` IF NOT EXISTS FOR (e:Person) REQUIRE e.surname IS :: STRING
CREATE CONSTRAINT `spark_NODE-TYPE-CONSTRAINT-Person-age` IF NOT EXISTS FOR (e:Person) REQUIRE e.age IS :: INTEGER
If a DataFrame column is not nullable, the connector runs additional schema queries.
For example, if the age
column is not nullable, the connector runs the following schema query:
CREATE CONSTRAINT `spark_NODE-NOT_NULL-CONSTRAINT-Person-age` IF NOT EXISTS FOR (e:Person) REQUIRE e.age IS NOT NULL
On relationships
val df = Seq(
("John", "Doe", 1, "Product 1", 200, "ABC100"),
("Jane", "Doe", 2, "Product 2", 100, "ABC200")
).toDF("name", "surname", "customerID", "product", "quantity", "order")
df.write
.mode(SaveMode.Overwrite)
.format("org.neo4j.spark.DataSource")
.option("relationship", "BOUGHT")
.option("relationship.save.strategy", "keys")
.option("relationship.source.save.mode", "Overwrite")
.option("relationship.source.labels", ":Customer")
.option("relationship.source.node.properties", "name,surname,customerID:id")
.option("relationship.source.node.keys", "customerID:id")
.option("relationship.target.save.mode", "Overwrite")
.option("relationship.target.labels", ":Product")
.option("relationship.target.node.properties", "product:name")
.option("relationship.target.node.keys", "product:name")
.option("relationship.properties", "quantity,order")
.option("schema.optimization", "TYPE,EXISTS")
.save()
Schema queries
Before the writing process starts, the connector runs the following schema queries (property type constraint queries for source and target node properties, then one property type constraint query for each DataFrame column representing a relationship property):
CREATE CONSTRAINT `spark_NODE-TYPE-CONSTRAINT-Customer-name` IF NOT EXISTS FOR (e:Customer) REQUIRE e.name IS :: STRING
CREATE CONSTRAINT `spark_NODE-TYPE-CONSTRAINT-Customer-surname` IF NOT EXISTS FOR (e:Customer) REQUIRE e.surname IS :: STRING
CREATE CONSTRAINT `spark_NODE-TYPE-CONSTRAINT-Customer-id` IF NOT EXISTS FOR (e:Customer) REQUIRE e.id IS :: INTEGER
CREATE CONSTRAINT `spark_NODE-TYPE-CONSTRAINT-Product-name` IF NOT EXISTS FOR (e:Product) REQUIRE e.name IS :: STRING
CREATE CONSTRAINT `spark_RELATIONSHIP-TYPE-CONSTRAINT-BOUGHT-quantity` IF NOT EXISTS FOR ()-[e:BOUGHT]->() REQUIRE e.quantity IS :: INTEGER
CREATE CONSTRAINT `spark_RELATIONSHIP-TYPE-CONSTRAINT-BOUGHT-order` IF NOT EXISTS FOR ()-[e:BOUGHT]->() REQUIRE e.order IS :: STRING
If a DataFrame column is not nullable, the connector runs additional schema queries.
For example, if the experience
column is not nullable, the connector runs the following schema query:
CREATE CONSTRAINT `spark_NODE-NOT_NULL-CONSTRAINT-Customer-id` IF NOT EXISTS FOR (e:Customer) REQUIRE e.id IS NOT NULL
CREATE CONSTRAINT `spark_RELATIONSHIP-NOT_NULL-CONSTRAINT-BOUGHT-quantity` IF NOT EXISTS FOR ()-[e:BOUGHT]->() REQUIRE e.quantity IS NOT NULL