Data type mapping
Neo4j and Cypher® provide a type system that describes how values are stored in the database, but these types do not always exactly match what Spark provides.
In some cases, there are data types that Neo4j provides that Spark does not have an equivalent for, and vice versa.
Data type mappings
Neo4j type | Spark type | Notes |
---|---|---|
|
|
Example: |
|
|
Example: |
|
|
Example: |
|
|
Example: |
|
|
For more information on spatial types in Neo4j, see Spatial values |
|
|
Example: |
|
|
Example: |
|
|
Example: |
|
|
Example: |
|
|
Example: |
|
|
|
|
|
Nodes in Neo4j are represented as property containers; that is they appear as structs with properties corresponding to whatever properties were in the node. For ease of use it is usually better to return individual properties than a node from a query. |
|
|
Relationships are returned as maps, identifying the source and target of the relationship, its type, along with properties (if any) of the relationship. For ease of use it is usually better to return individual properties than a relationship from a query. |
|
|
Example: |
|
|
In Neo4j, arrays must be consistently typed (for example, an array must contain only |
Complex data types
Spark does not natively support all Neo4j data types (for example Point
, Time
, Duration
).
Such types are transformed into Struct
types containing all the useful data.
Neo4j type | Spark Struct |
---|---|
Duration |
Struct(Array( ("type", DataTypes.StringType, false), ("months", DataTypes.LongType, false), ("days", DataTypes.LongType, false), ("seconds", DataTypes.LongType, false), ("nanoseconds", DataTypes.IntegerType, false), ("value", DataTypes.StringType, false) )) |
Point |
Struct(Array( ("type", DataTypes.StringType, false), ("srid", DataTypes.IntegerType, false), ("x", DataTypes.DoubleType, false), ("y", DataTypes.DoubleType, false), ("z", DataTypes.DoubleType, true), )) |
Time |
Struct(Array( ("type", DataTypes.StringType, false), ("value", DataTypes.StringType, false) )) |
Map type
When a column is a map, the connector tries to flatten it. For example, consider the following dataset:
id | name | lives_in |
---|---|---|
1 |
Andrea Santurbano |
{address: 'Times Square, 1', city: 'NY', state: 'NY'} |
2 |
Davide Fantuzzi |
{address: 'Statue of Liberty, 10', city: 'NY', state: 'NY'} |
The connector flattens the lives_in
column into three columns lives_in.address
, lives_in.city
, and lives_in.state
:
id | name | lives_in.address | lives_in.city | lives_in.state |
---|---|---|---|---|
1 |
Andrea Santurbano |
Times Square, 1 |
NY |
NY |
2 |
Davide Fantuzzi |
Statue of Liberty, 10 |
NY |
NY |
When a Dataframe column is a map, what we do internally is to flatten the map as Neo4j does not support this type for graph entity properties; so for a Spark job like this:
val data = Seq(
("Foo", 1, Map("inner" -> Map("key" -> "innerValue"))),
("Bar", 2, Map("inner" -> Map("key" -> "innerValue1"))),
).toDF("id", "time", "table")
data.write
.mode(SaveMode.Append)
.format(classOf[DataSource].getName)
.option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl)
.option("labels", ":MyNodeWithFlattenedMap")
.save()
In Neo4j for the nodes with label MyNodeWithFlattenedMap
you’ll find this information stored:
MyNodeWithFlattenedMap { id: 'Foo', time: 1, `table.inner.key`: 'innerValue' } MyNodeWithFlattenedMap { id: 'Bar', time: 1, `table.inner.key`: 'innerValue1' }
Now you could fall into problematic situations like the following one:
val data = Seq(
("Foo", 1, Map("key.inner" -> Map("key" -> "innerValue"), "key" -> Map("inner.key" -> "value"))),
("Bar", 1, Map("key.inner" -> Map("key" -> "innerValue1"), "key" -> Map("inner.key" -> "value1"))),
).toDF("id", "time", "table")
data.write
.mode(SaveMode.Append)
.format(classOf[DataSource].getName)
.option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl)
.option("labels", ":MyNodeWithFlattenedMap")
.save()
since the resulting flattened keys are duplicated, the Neo4j Spark will pick one of the associated value in a non-deterministic way.
Because the information that we’ll store into Neo4j will be this (consider that the order is not guaranteed):
MyNodeWithFlattenedMap { id: 'Foo', time: 1, `table.key.inner.key`: 'innerValue' // but it could be `value` as the order is not guaranteed } MyNodeWithFlattenedMap { id: 'Bar', time: 1, `table.key.inner.key`: 'innerValue1' // but it could be `value1` as the order is not guaranteed }
Group duplicated keys to array of values
You can use the option schema.map.group.duplicate.keys
to avoid this problem. The connector will group all the values with the same keys into an array. The default value for the option is false
.
In a scenario like this:
val data = Seq(
("Foo", 1, Map("key.inner" -> Map("key" -> "innerValue"), "key" -> Map("inner.key" -> "value"))),
("Bar", 1, Map("key.inner" -> Map("key" -> "innerValue1"), "key" -> Map("inner.key" -> "value1"))),
).toDF("id", "time", "table")
data.write
.mode(SaveMode.Append)
.format(classOf[DataSource].getName)
.option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl)
.option("labels", ":MyNodeWithFlattenedMap")
.option("schema.map.group.duplicate.keys", true)
.save()
the output would be:
MyNodeWithFlattenedMap { id: 'Foo', time: 1, `table.key.inner.key`: ['innerValue', 'value'] // the order is not guaranteed } MyNodeWithFlattenedMap { id: 'Bar', time: 1, `table.key.inner.key`: ['innerValue1', 'value1'] // the order is not guaranteed }
Constraint type mapping
Spark type |
Neo4j Type |
BooleanType |
BOOLEAN |
StringType |
STRING |
IntegerType |
INTEGER |
LongType |
INTEGER |
FloatType |
FLOAT |
DoubleType |
FLOAT |
DateType |
DATE |
TimestampType |
LOCAL DATETIME |
Custom |
POINT |
Custom |
DURATION |
DataTypes.createArrayType(BooleanType, false) |
LIST<BOOLEAN NOT NULL> |
DataTypes.createArrayType(StringType, false) |
LIST<STRING NOT NULL> |
DataTypes.createArrayType(IntegerType, false) |
LIST<INTEGER NOT NULL> |
DataTypes.createArrayType(LongType, false) |
LIST<INTEGER NOT NULL> |
DataTypes.createArrayType(FloatType, false) |
LIST<FLOAT NOT NULL> |
DataTypes.createArrayType(DoubleType, false) |
LIST<FLOAT NOT NULL> |
DataTypes.createArrayType(DateType, false) |
LIST<DATE NOT NULL> |
DataTypes.createArrayType(TimestampType, false) |
LIST<LOCAL DATETIME NOT NULL> |
DataTypes.createArrayType(pointType, false) |
LIST<POINT NOT NULL> |
DataTypes.createArrayType(durationType, false) |
LIST<DURATION NOT NULL> |
For the arrays in particular we use the version without null elements as Neo4j does not allow to have them in arrays.