Neo4j Streams - Procedures

The Kafka Connect Neo4j Connector is the recommended method to integrate Kafka with Neo4j, as Neo4j Streams is no longer under active development and will not be supported after version 4.4 of Neo4j.

The most recent version of the Kafka Connect Neo4j Connector can be found here.

The Streams project comes out with a list of procedures.

Configuration

You can enable/disable the procedures by changing this variable inside the neo4j.conf

neo4j.conf

streams.procedures.enabled=<true/false, default=true>

Please note that by default the dbms.security.procedures.whitelist property is disabled, so Neo4j will load all procedures found. If you enable it then you have also to declare a comma separated list of procedures to be loaded by default. For example:

dbms.security.procedures.whitelist=streams.*,apoc.*

If you try to CALL one of the Streams procedures without declaring them into the whitelist, you will receive an error like the following:

Figure 1. Neo4j Streams procedure not found

Multi Database Support

Neo4j 4.0 Enterprise has multi-tenancy support, in order to support this feature you can set for each database instance a configuration suffix with the following pattern <DB_NAME> to the properties in your neo4j.conf file.

So, to enable the Streams procedures the following property should be added:

neo4j.conf

streams.procedures.enabled.<DB_NAME>=<true/false, default=true>

So if you have a instance name foo you can specify a configuration in this way:

neo4j.conf

streams.procedures.enabled.foo=<true/false, default=true>

The old property:

neo4j.conf

streams.procedures.enabled=<true/false, default=true>

are still valid and it refers to Neo4j’s default db instance.

In particular the following property will be used as default value for non-default db instances, in case of the specific configuration params is not provided:

streams.procedures.enabled=<true/false, default=true>

streams.publish

This procedure allows custom message streaming from Neo4j to the configured environment by using the underlying configured Producer.

Uses:

CALL streams.publish('my-topic', 'Hello World from Neo4j!')

The message retrieved from the Consumer is the following:

{"payload":"Hello world from Neo4j!"}

If you use a local docker (compose) setup, you can check for these messages with:

docker exec -it kafka kafka-console-consumer --topic my-topic --bootstrap-server kafka:9092

Input Parameters:

Variable Name Type Description

Variable Name	Type	Description
`topic`	String	The topic where you want to publish the data
`payload`	Object	The data that you want to stream

topic

String

The topic where you want to publish the data

payload

Object

The data that you want to stream

Configuration parameters:

Name Type Description

Name	Type	Description
`key`	Object	The key value of message that you want to stream. Please note that if the key doesn’t exist, you get a message with a random UUID as key value
`partition`	Int	The partition of message that you want to stream

key

Object

The key value of message that you want to stream. Please note that if the key doesn’t exist, you get a message with a random UUID as key value

partition

Int

The partition of message that you want to stream

You can send any kind of data in the payload, nodes, relationships, paths, lists, maps, scalar values and nested versions thereof.

In case of nodes or relationships if the topic is defined in the patterns provided by the configuration their properties will be filtered in according with the configuration.

streams.publish.sync

Similar to streams.publish procedure, but in a synchronous way.

Uses:

CALL streams.publish.sync('my-topic', 'my-payload', {<config>}) YIELD value RETURN value

This procedure return a RecordMetadata value like this {"timestamp": 1, "offset": 2, "partition", 3, "keySize", 4, "valueSize", 5}

Variable Name Description

Variable Name	Description
`timestamp`	The timestamp of the record in the topic/partition.
`offset`	The offset of the record in the topic/partition.
`partition`	The partition the record was sent to
`keySize`	The size of the serialized, uncompressed key in bytes
`valueSize`	The size of the serialized, uncompressed value in bytes

timestamp

The timestamp of the record in the topic/partition.

offset

The offset of the record in the topic/partition.

partition

The partition the record was sent to

keySize

The size of the serialized, uncompressed key in bytes

valueSize

The size of the serialized, uncompressed value in bytes

streams.consume

This procedure allows to consume messages from a given topic.

Uses:

CALL streams.consume('my-topic', {<config>}) YIELD event RETURN event

Example: Imagine you have a producer that publish events like this {"name": "Andrea", "surname": "Santurbano"}, we can create user nodes in this way:

CALL streams.consume('my-topic') YIELD event
CREATE (p:Person{firstName: event.data.name, lastName: event.data.surname})

In case you want to read a specific offset of a topic partition you can do it by executing the following query:

CALL streams.consume('my-topic', {timeout: 5000, partitions: [{partition: 0, offset: 30}]}) YIELD event
CREATE (p:Person{firstName: event.data.name, lastName: event.data.surname})

Input Parameters:

Variable Name Type Description

Variable Name	Type	Description
`topic`	String	The topic where you want to publish the data
`config`	Map<K,V>	The configuration parameters

topic

String

The topic where you want to publish the data

config

Map<K,V>

The configuration parameters

Available configuration parameters

Variable Name Type Description

Variable Name	Type	Description
`timeout`	Number (default `1000`)	Define the time that the procedure should be listen the topic
`from`	String	It’s the Kafka configuration parameter `auto.offset.reset`. If not specified it inherits the underlying `kafka.auto.offset.reset` value
`groupId`	String	It’s the Kafka configuration parameter `group.id`. If not specified it inherits the underlying `kafka.group.id` value
`autoCommit`	Boolean (default `true`)	It’s the Kafka configuration parameter `enable.auto.commit`. If not specified it inherits the underlying `kafka.enable.auto.commit` value
`commit`	Boolean (default `true`)	In case of `autoCommit` is set to `false` you can decide if you want to commit the data.
`broker`	String	The comma separated string of Kafka nodes url. If not specified it inherits the underlying `kafka.bootstrap.servers` value
`partitions`	List<Map<K,V>>	The map contains the information about partition and offset in order to start reading from a
`keyDeserializer`	String	The supported deserializer for the Kafka Record Key If not specified it inherits the underlying `kafka.key.deserializer` value. Supported deserializers are: `org.apache.kafka.common.serialization.ByteArrayDeserializer` and `io.confluent.kafka.serializers.KafkaAvroDeserializer`
`valueDeserializer`	String	The supported deserializer for the Kafka Record Value If not specified it inherits the underlying `kafka.value.deserializer` value Supported deserializers are: `org.apache.kafka.common.serialization.ByteArrayDeserializer` and `io.confluent.kafka.serializers.KafkaAvroDeserializer`
`schemaRegistryUrl`	String	The schema registry url, required in case you are dealing with AVRO messages.

timeout

Number (default 1000)

Define the time that the procedure should be listen the topic

from

String

It’s the Kafka configuration parameter auto.offset.reset. If not specified it inherits the underlying kafka.auto.offset.reset value

groupId

String

It’s the Kafka configuration parameter group.id. If not specified it inherits the underlying kafka.group.id value

autoCommit

Boolean (default true)

It’s the Kafka configuration parameter enable.auto.commit. If not specified it inherits the underlying kafka.enable.auto.commit value

commit

Boolean (default true)

In case of autoCommit is set to false you can decide if you want to commit the data.

broker

String

The comma separated string of Kafka nodes url. If not specified it inherits the underlying kafka.bootstrap.servers value

partitions

List<Map<K,V>>

The map contains the information about partition and offset in order to start reading from a

keyDeserializer

String

The supported deserializer for the Kafka Record Key If not specified it inherits the underlying kafka.key.deserializer value. Supported deserializers are: org.apache.kafka.common.serialization.ByteArrayDeserializer and io.confluent.kafka.serializers.KafkaAvroDeserializer

valueDeserializer

String

The supported deserializer for the Kafka Record Value If not specified it inherits the underlying kafka.value.deserializer value Supported deserializers are: org.apache.kafka.common.serialization.ByteArrayDeserializer and io.confluent.kafka.serializers.KafkaAvroDeserializer

schemaRegistryUrl

String

The schema registry url, required in case you are dealing with AVRO messages.

Partitions

Variable Name Type Description

Variable Name	Type	Description
`partition`	Number	It’s the Kafka partition number to read
`offset`	Number	It’s the offset to start to read the topic partition

partition

Number

It’s the Kafka partition number to read

offset

Number

It’s the offset to start to read the topic partition

Streams Sink Lifecycle procedure

We provide a set of procedures in order to manage the Sink lifecycle.

Proc. Name Description

Proc. Name	Description
`CALL streams.sink.stop() YIELD name, value`	stops the Sink, and return the status, with the error if one occurred during the process
`CALL streams.sink.start() YIELD name, value`	starts the Sink, and return the status, with the error if one occurred during the process
`CALL streams.sink.restart() YIELD name, value`	restart the Sink, and return the status, with the error if one occurred during the process
`CALL streams.sink.config() YIELD name, value`	returns the Sink config, please check the table "Streams Config"
`CALL streams.sink.status() YIELD name, value`	returns the status

CALL streams.sink.stop() YIELD name, value

stops the Sink, and return the status, with the error if one occurred during the process

CALL streams.sink.start() YIELD name, value

starts the Sink, and return the status, with the error if one occurred during the process

CALL streams.sink.restart() YIELD name, value

restart the Sink, and return the status, with the error if one occurred during the process

CALL streams.sink.config() YIELD name, value

returns the Sink config, please check the table "Streams Config"

CALL streams.sink.status() YIELD name, value

returns the status

Please consider that in order to use this procedures you must enable the streams procedures and they are runnable only on the leader.

Table 1. Streams Config
Config Name	Description
invalid_topics	return a list of invalid topics
streams.sink.topic.pattern.relationship	return a Map<K,V> where the K is the topic name and V is the provided pattern
streams.sink.topic.cud	return a list of topics defined for the CUD format
streams.sink.topic.cdc.sourceId	return a list of topics defined for the CDC SourceId strategy
streams.sink.topic.cypher	return a Map<K,V> where the K is the topic name and V is the provided Cypher Query
streams.sink.topic.cdc.schema	return a list of topics defined for the CDC Schema strategy
streams.sink.topic.pattern.node	return a Map<K,V> where the K is the topic name and V is the provided pattern
streams.sink.errors	return a Map<K,V> where the K sub property name, and V is the value
streams.sink.source.id.strategy.config	returns the config for the SourceId CDC strategy

Example

Executing: CALL streams.sink.config()
+----------------------------------------------------------------------------------------------------------------------------------------------+
| name                                      | value                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------+
| "streams.sink.errors"                     | {}                                                                                               |
| "streams.sink.source.id.strategy.config"  | {idName -> "sourceId", labelName -> "SourceEvent"}                                               |
| "streams.sink.topic.cypher"               | {shouldWriteCypherQuery -> "MERGE (n:Label {id: event.id}) ON CREATE SET n += event.properties"} |
| "streams.sink.topic.cud"                  | []                                                                                               |
| "streams.sink.topic.cdc.schema"           | []                                                                                               |
| "streams.sink.topic.cdc.sourceId"         | []                                                                                               |
| "streams.sink.topic.pattern.node"         | {}                                                                                               |
| "streams.sink.topic.pattern.relationship" | {}                                                                                               |
| "invalid_topics"                          | []                                                                                               |
+----------------------------------------------------------------------------------------------------------------------------------------------+
9 rows

Executing: CALL streams.sink.stop()
+----------------------+
| name     | value     |
+----------------------+
| "status" | "STOPPED" |
+----------------------+
1 row

Executing: CALL streams.sink.status()
+----------------------+
| name     | value     |
+----------------------+
| "status" | "STOPPED" |
+----------------------+
1 row

Executing: CALL streams.sink.start()
+----------------------+
| name     | value     |
+----------------------+
| "status" | "RUNNING" |
+----------------------+
1 row

Executing: CALL streams.sink.status()
+----------------------+
| name     | value     |
+----------------------+
| "status" | "RUNNING" |
+----------------------+
1 row

Executing: CALL streams.sink.restart()
+----------------------+
| name     | value     |
+----------------------+
| "status" | "RUNNING" |
+----------------------+
1 row

Given a cluster env, executing in a NON LEADER: CALL streams.sink.status()
+--------------------------------------------------------------------------------------------------+
| name    | value                                                                                  |
+--------------------------------------------------------------------------------------------------+
| "error" | "You can use this procedure only in the LEADER or in a single instance configuration." |
+--------------------------------------------------------------------------------------------------+
1 row