Data type mapping

Neo4j and Cypher^® provide a type system that describes how values are stored in the database, but these types do not always exactly match what Spark provides.

In some cases, there are data types that Neo4j provides that Spark does not have an equivalent for, and vice versa.

Data type mappings

Table 1. Spark to Neo4j type mapping reference
Neo4j type	Spark type	Notes
`String`	`string`	Example: `"Hello"`
`Integer`	`long`	Example: `12345`
`Float`	`double`	Example: `3.141592`
`Boolean`	`boolean`	Example: `true`
`Point`	`struct { type: string, srid: integer, x: double, y: double, z: double }`	For more information on spatial types in Neo4j, see Spatial values
`Date`	`date`	Example: `2020-09-11`
`Time`	`struct { type: string, value: string }`	Example: `[offset-time, 12:14:08.209Z]`
`LocalTime`	`struct { type: string, value: string }`	Example: `[local-time, 12:18:11.628]`
`DateTime`	`timestamp`	Example: `2020-09-11 12:17:39.192` This mapping only applies when reading from Neo4j; otherwise, a `timestamp` is written to Neo4j as a `LocalDateTime`. If you need to write a `DateTime`, use a Cypher write query.
`LocalDateTime`	`timestamp`	Example: `2020-09-11 12:14:49.081`
`Duration`	`struct { type: string, months: long, days: long, seconds: long, nanonseconds: integer, value: string }`	See Temporal functions: duration
`Node`	`struct { <id>: long, <labels>: array[string], (PROPERTIES) }`	Nodes in Neo4j are represented as property containers; that is they appear as structs with properties corresponding to whatever properties were in the node. For ease of use it is usually better to return individual properties than a node from a query.
`Relationship`	`struct { <rel.id>: long, <rel.type>: string, <source.id>: long, <target.id>: long, (PROPERTIES) }`	Relationships are returned as maps, identifying the source and target of the relationship, its type, along with properties (if any) of the relationship. For ease of use it is usually better to return individual properties than a relationship from a query.
`Path`	`string`	Example: `path[(322)←[20280:AIRLINE]-(33510)]`. For ease of use it is recommended to use path functions to return individual properties/aspects of a path from a query.
`[Array of same type]`	`array[element]`	In Neo4j, arrays must be consistently typed (for example, an array must contain only `Float` values). The inner Spark type matches the type mapping above.

Complex data types

Spark does not natively support all Neo4j data types (for example Point, Time, Duration). Such types are transformed into Struct types containing all the useful data.

Neo4j type Spark Struct

Table 2. Complex data type conversion
Neo4j type	Spark `Struct`
Duration	Struct(Array( ("type", DataTypes.StringType, false), ("months", DataTypes.LongType, false), ("days", DataTypes.LongType, false), ("seconds", DataTypes.LongType, false), ("nanoseconds", DataTypes.IntegerType, false), ("value", DataTypes.StringType, false) ))
Point	Struct(Array( ("type", DataTypes.StringType, false), ("srid", DataTypes.IntegerType, false), ("x", DataTypes.DoubleType, false), ("y", DataTypes.DoubleType, false), ("z", DataTypes.DoubleType, true), ))
Time	Struct(Array( ("type", DataTypes.StringType, false), ("value", DataTypes.StringType, false) ))

Duration

Struct(Array(
    ("type", DataTypes.StringType, false),
    ("months", DataTypes.LongType, false),
    ("days", DataTypes.LongType, false),
    ("seconds", DataTypes.LongType, false),
    ("nanoseconds", DataTypes.IntegerType, false),
    ("value", DataTypes.StringType, false)
  ))

Point

Struct(Array(
    ("type", DataTypes.StringType, false),
    ("srid", DataTypes.IntegerType, false),
    ("x", DataTypes.DoubleType, false),
    ("y", DataTypes.DoubleType, false),
    ("z", DataTypes.DoubleType, true),
  ))

Time

Struct(Array(
    ("type", DataTypes.StringType, false),
    ("value", DataTypes.StringType, false)
  ))

Map type

When a column is a map, the connector tries to flatten it. For example, consider the following dataset:

id	name	lives_in
1	Andrea Santurbano	{address: 'Times Square, 1', city: 'NY', state: 'NY'}
2	Davide Fantuzzi	{address: 'Statue of Liberty, 10', city: 'NY', state: 'NY'}

name

lives_in

Andrea Santurbano

{address: 'Times Square, 1', city: 'NY', state: 'NY'}

Davide Fantuzzi

{address: 'Statue of Liberty, 10', city: 'NY', state: 'NY'}

The connector flattens the lives_in column into three columns lives_in.address, lives_in.city, and lives_in.state:

id	name	lives_in.address	lives_in.city	lives_in.state
1	Andrea Santurbano	Times Square, 1	NY	NY
2	Davide Fantuzzi	Statue of Liberty, 10	NY	NY

name

lives_in.address

lives_in.city

lives_in.state

Andrea Santurbano

Times Square, 1

Davide Fantuzzi

Statue of Liberty, 10

When a Dataframe column is a map, what we do internally is to flatten the map as Neo4j does not support this type for graph entity properties; so for a Spark job like this:

val data = Seq(
  ("Foo", 1, Map("inner" -> Map("key" -> "innerValue"))),
  ("Bar", 2, Map("inner" -> Map("key" -> "innerValue1"))),
).toDF("id", "time", "table")

data.write
  .mode(SaveMode.Append)
  .format(classOf[DataSource].getName)
  .option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl)
  .option("labels", ":MyNodeWithFlattenedMap")
  .save()

In Neo4j for the nodes with label MyNodeWithFlattenedMap you’ll find this information stored:

MyNodeWithFlattenedMap {
    id: 'Foo',
    time: 1,
    `table.inner.key`: 'innerValue'
}
MyNodeWithFlattenedMap {
    id: 'Bar',
    time: 1,
    `table.inner.key`: 'innerValue1'
}

Now you could fall into problematic situations like the following one:

val data = Seq(
  ("Foo", 1, Map("key.inner" -> Map("key" -> "innerValue"), "key" -> Map("inner.key" -> "value"))),
  ("Bar", 1, Map("key.inner" -> Map("key" -> "innerValue1"), "key" -> Map("inner.key" -> "value1"))),
).toDF("id", "time", "table")
data.write
  .mode(SaveMode.Append)
  .format(classOf[DataSource].getName)
  .option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl)
  .option("labels", ":MyNodeWithFlattenedMap")
  .save()

since the resulting flattened keys are duplicated, the Neo4j Spark will pick one of the associated value in a non-deterministic way.

Because the information that we’ll store into Neo4j will be this (consider that the order is not guaranteed):

MyNodeWithFlattenedMap {
    id: 'Foo',
    time: 1,
    `table.key.inner.key`: 'innerValue' // but it could be `value` as the order is not guaranteed
}
MyNodeWithFlattenedMap {
    id: 'Bar',
    time: 1,
    `table.key.inner.key`: 'innerValue1' // but it could be `value1` as the order is not guaranteed
}

Group duplicated keys to array of values

You can use the option schema.map.group.duplicate.keys to avoid this problem. The connector will group all the values with the same keys into an array. The default value for the option is false. In a scenario like this:

val data = Seq(
  ("Foo", 1, Map("key.inner" -> Map("key" -> "innerValue"), "key" -> Map("inner.key" -> "value"))),
  ("Bar", 1, Map("key.inner" -> Map("key" -> "innerValue1"), "key" -> Map("inner.key" -> "value1"))),
).toDF("id", "time", "table")
data.write
  .mode(SaveMode.Append)
  .format(classOf[DataSource].getName)
  .option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl)
  .option("labels", ":MyNodeWithFlattenedMap")
  .option("schema.map.group.duplicate.keys", true)
  .save()

the output would be:

MyNodeWithFlattenedMap {
    id: 'Foo',
    time: 1,
    `table.key.inner.key`: ['innerValue', 'value'] // the order is not guaranteed
}
MyNodeWithFlattenedMap {
    id: 'Bar',
    time: 1,
    `table.key.inner.key`: ['innerValue1', 'value1'] // the order is not guaranteed
}

Constraint type mapping

Table 3. Spark to Cypher constraint type mapping
Spark type	Neo4j Type
BooleanType	BOOLEAN
StringType	STRING
IntegerType	INTEGER
LongType	INTEGER
FloatType	FLOAT
DoubleType	FLOAT
DateType	DATE
TimestampType	LOCAL DATETIME
Custom `pointType` as: Struct { type: string, srid: integer, x: double, y: double, z: double }	POINT
Custom `durationType` as: Struct { type: string, months: long, days: long, seconds: long, nanonseconds: integer, value: string }	DURATION
DataTypes.createArrayType(BooleanType, false)	LIST<BOOLEAN NOT NULL>
DataTypes.createArrayType(StringType, false)	LIST<STRING NOT NULL>
DataTypes.createArrayType(IntegerType, false)	LIST<INTEGER NOT NULL>
DataTypes.createArrayType(LongType, false)	LIST<INTEGER NOT NULL>
DataTypes.createArrayType(FloatType, false)	LIST<FLOAT NOT NULL>
DataTypes.createArrayType(DoubleType, false)	LIST<FLOAT NOT NULL>
DataTypes.createArrayType(DateType, false)	LIST<DATE NOT NULL>
DataTypes.createArrayType(TimestampType, false)	LIST<LOCAL DATETIME NOT NULL>
DataTypes.createArrayType(pointType, false)	LIST<POINT NOT NULL>
DataTypes.createArrayType(durationType, false)	LIST<DURATION NOT NULL>

For the arrays in particular we use the version without null elements as Neo4j does not allow to have them in arrays.