Data type mapping

Neo4j and Cypher® provide a type system that describes how values are stored in the database, but these types do not always exactly match what Spark provides.

In some cases, there are data types that Neo4j provides that Spark does not have an equivalent for, and vice versa.

Data type mappings

Table 1. Spark to Neo4j type mapping reference
Neo4j type Spark type Notes

String

string

Example: "Hello"

Integer

long

Example: 12345

Float

double

Example: 3.141592

Boolean

boolean

Example: true

Point

struct { type: string, srid: integer, x: double, y: double, z: double }

For more information on spatial types in Neo4j, see Spatial values

Date

date

Example: 2020-09-11

Time

struct { type: string, value: string }

Example: [offset-time, 12:14:08.209Z]

LocalTime

struct { type: string, value: string }

Example: [local-time, 12:18:11.628]

DateTime

timestamp

Example: 2020-09-11 12:17:39.192 This mapping only applies when reading from Neo4j; otherwise, a timestamp is written to Neo4j as a LocalDateTime. If you need to write a DateTime, use a Cypher write query.

LocalDateTime

timestamp

Example: 2020-09-11 12:14:49.081

Duration

struct { type: string, months: long, days: long, seconds: long, nanonseconds: integer, value: string }

See Temporal functions: duration

Node

struct { <id>: long, <labels>: array[string], (PROPERTIES) }

Nodes in Neo4j are represented as property containers; that is they appear as structs with properties corresponding to whatever properties were in the node. For ease of use it is usually better to return individual properties than a node from a query.

Relationship

struct { <rel.id>: long, <rel.type>: string, <source.id>: long, <target.id>: long, (PROPERTIES) }

Relationships are returned as maps, identifying the source and target of the relationship, its type, along with properties (if any) of the relationship. For ease of use it is usually better to return individual properties than a relationship from a query.

Path

string

Example: path[(322)←[20280:AIRLINE]-(33510)]. For ease of use it is recommended to use path functions to return individual properties/aspects of a path from a query.

[Array of same type]

array[element]

In Neo4j, arrays must be consistently typed (for example, an array must contain only Float values). The inner Spark type matches the type mapping above.

Complex data types

Spark does not natively support all Neo4j data types (for example Point, Time, Duration). Such types are transformed into Struct types containing all the useful data.

Table 2. Complex data type conversion
Neo4j type Spark Struct

Duration

Struct(Array(
    ("type", DataTypes.StringType, false),
    ("months", DataTypes.LongType, false),
    ("days", DataTypes.LongType, false),
    ("seconds", DataTypes.LongType, false),
    ("nanoseconds", DataTypes.IntegerType, false),
    ("value", DataTypes.StringType, false)
  ))

Point

Struct(Array(
    ("type", DataTypes.StringType, false),
    ("srid", DataTypes.IntegerType, false),
    ("x", DataTypes.DoubleType, false),
    ("y", DataTypes.DoubleType, false),
    ("z", DataTypes.DoubleType, true),
  ))

Time

Struct(Array(
    ("type", DataTypes.StringType, false),
    ("value", DataTypes.StringType, false)
  ))

Map type

When a column is a map, the connector tries to flatten it. For example, consider the following dataset:

id name lives_in

1

Andrea Santurbano

{address: 'Times Square, 1', city: 'NY', state: 'NY'}

2

Davide Fantuzzi

{address: 'Statue of Liberty, 10', city: 'NY', state: 'NY'}

The connector flattens the lives_in column into three columns lives_in.address, lives_in.city, and lives_in.state:

id name lives_in.address lives_in.city lives_in.state

1

Andrea Santurbano

Times Square, 1

NY

NY

2

Davide Fantuzzi

Statue of Liberty, 10

NY

NY

When a Dataframe column is a map, what we do internally is to flatten the map as Neo4j does not support this type for graph entity properties; so for a Spark job like this:

val data = Seq(
  ("Foo", 1, Map("inner" -> Map("key" -> "innerValue"))),
  ("Bar", 2, Map("inner" -> Map("key" -> "innerValue1"))),
).toDF("id", "time", "table")

data.write
  .mode(SaveMode.Append)
  .format(classOf[DataSource].getName)
  .option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl)
  .option("labels", ":MyNodeWithFlattenedMap")
  .save()

In Neo4j for the nodes with label MyNodeWithFlattenedMap you’ll find this information stored:

MyNodeWithFlattenedMap {
    id: 'Foo',
    time: 1,
    `table.inner.key`: 'innerValue'
}
MyNodeWithFlattenedMap {
    id: 'Bar',
    time: 1,
    `table.inner.key`: 'innerValue1'
}

Now you could fall into problematic situations like the following one:

val data = Seq(
  ("Foo", 1, Map("key.inner" -> Map("key" -> "innerValue"), "key" -> Map("inner.key" -> "value"))),
  ("Bar", 1, Map("key.inner" -> Map("key" -> "innerValue1"), "key" -> Map("inner.key" -> "value1"))),
).toDF("id", "time", "table")
data.write
  .mode(SaveMode.Append)
  .format(classOf[DataSource].getName)
  .option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl)
  .option("labels", ":MyNodeWithFlattenedMap")
  .save()

since the resulting flattened keys are duplicated, the Neo4j Spark will pick one of the associated value in a non-deterministic way.

Because the information that we’ll store into Neo4j will be this (consider that the order is not guaranteed):

MyNodeWithFlattenedMap {
    id: 'Foo',
    time: 1,
    `table.key.inner.key`: 'innerValue' // but it could be `value` as the order is not guaranteed
}
MyNodeWithFlattenedMap {
    id: 'Bar',
    time: 1,
    `table.key.inner.key`: 'innerValue1' // but it could be `value1` as the order is not guaranteed
}

Group duplicated keys to array of values

You can use the option schema.map.group.duplicate.keys to avoid this problem. The connector will group all the values with the same keys into an array. The default value for the option is false. In a scenario like this:

val data = Seq(
  ("Foo", 1, Map("key.inner" -> Map("key" -> "innerValue"), "key" -> Map("inner.key" -> "value"))),
  ("Bar", 1, Map("key.inner" -> Map("key" -> "innerValue1"), "key" -> Map("inner.key" -> "value1"))),
).toDF("id", "time", "table")
data.write
  .mode(SaveMode.Append)
  .format(classOf[DataSource].getName)
  .option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl)
  .option("labels", ":MyNodeWithFlattenedMap")
  .option("schema.map.group.duplicate.keys", true)
  .save()

the output would be:

MyNodeWithFlattenedMap {
    id: 'Foo',
    time: 1,
    `table.key.inner.key`: ['innerValue', 'value'] // the order is not guaranteed
}
MyNodeWithFlattenedMap {
    id: 'Bar',
    time: 1,
    `table.key.inner.key`: ['innerValue1', 'value1'] // the order is not guaranteed
}

Constraint type mapping

Table 3. Spark to Cypher constraint type mapping

Spark type

Neo4j Type

BooleanType

BOOLEAN

StringType

STRING

IntegerType

INTEGER

LongType

INTEGER

FloatType

FLOAT

DoubleType

FLOAT

DateType

DATE

TimestampType

LOCAL DATETIME

Custom pointType as: Struct { type: string, srid: integer, x: double, y: double, z: double }

POINT

Custom durationType as: Struct { type: string, months: long, days: long, seconds: long, nanonseconds: integer, value: string }

DURATION

DataTypes.createArrayType(BooleanType, false)

LIST<BOOLEAN NOT NULL>

DataTypes.createArrayType(StringType, false)

LIST<STRING NOT NULL>

DataTypes.createArrayType(IntegerType, false)

LIST<INTEGER NOT NULL>

DataTypes.createArrayType(LongType, false)

LIST<INTEGER NOT NULL>

DataTypes.createArrayType(FloatType, false)

LIST<FLOAT NOT NULL>

DataTypes.createArrayType(DoubleType, false)

LIST<FLOAT NOT NULL>

DataTypes.createArrayType(DateType, false)

LIST<DATE NOT NULL>

DataTypes.createArrayType(TimestampType, false)

LIST<LOCAL DATETIME NOT NULL>

DataTypes.createArrayType(pointType, false)

LIST<POINT NOT NULL>

DataTypes.createArrayType(durationType, false)

LIST<DURATION NOT NULL>

For the arrays in particular we use the version without null elements as Neo4j does not allow to have them in arrays.