Export using Apache Arrow
The graphs in the Neo4j Graph Data Science Library support properties for nodes and relationships. One way to export those properties is using Cypher procedures, as documented in Streaming nodes and Streaming relationships. Similar to the procedures, GDS also supports exporting properties via Arrow Flight.
In this chapter, we assume that a Flight server has been set up and configured. To learn more about the installation, please refer to the installation chapter.
Arrow export features are versioned to allow for future changes. Please refer to the corresponding section in the Configure Apache Arrow server documentation for more details on versioned commands.
Arrow Ticket format
Flight streams to read properties from an in-memory graph are initiated by the Arrow client by calling the GET
function and providing a Flight ticket.
The general idea is to mirror the behaviour of the procedures for streaming properties from the in-memory graph.
To identify the graph and the procedure that we want to mirror, the ticket must contain the following keys:
Name | Type | Description |
---|---|---|
|
String |
The name of the graph in the graph catalog. |
|
String |
The database the graph is associated with. |
|
String |
The mirrored property stream procedure. |
|
Map |
The procedure specific configuration. |
The following image shows the client-server interaction for exporting data using node property streaming as an example.
Stream all node labels
To stream node labels for each node in a graph, the client needs to provide the following ticket:
{ name: "GET_COMMAND", version: "v1", body: { graph_name: "my_graph", database_name: "database_name", procedure_name: "gds.graph.nodeLabels.stream", configuration: { consecutive_ids: false } } }
The specific command configuration supports the following keys:
Name | Type | Description |
---|---|---|
|
Boolean |
Returns node ids mapped to a consecutive id space, i.e. |
The schema of the result records is as follows:
Name | Type | Description |
---|---|---|
|
nodeId |
Integer |
|
false |
labels |
|
The labels of the node. |
false |
Stream a single node property
To stream a single node property, the client needs to encode that information in the ticket as follows:
{ name: "GET_COMMAND", version: "v1", body: { graph_name: "my_graph", database_name: "database_name", procedure_name: "gds.graph.nodeProperty.stream", configuration: { node_labels: ["*"], node_property: "foo", list_node_labels: true, consecutive_ids: false } } }
The procedure_name
indicates that we mirror the behaviour of the existing procedure.
The specific configuration needs to include the following keys:
Name | Type | Description |
---|---|---|
|
String or List of Strings |
Stream only properties for nodes with the given labels. |
|
String |
The node property in the graph to stream. |
|
Boolean |
Whether to include node labels of the respective nodes in the result. |
|
Boolean |
Returns node ids mapped to a consecutive id space, i.e. |
The schema of the result records is identical to the corresponding procedure:
Name | Type | Description | Optional |
---|---|---|---|
nodeId |
Integer |
The id of the node. |
false |
propertyValue |
|
The stored property value. |
false |
labels |
List of Strings |
The labels of the node. If the |
true |
Stream multiple node properties
To stream multiple node properties, the client needs to encode that information in the ticket as follows:
{ name: "GET_COMMAND", version: "v1", body: { graph_name: "my_graph", database_name: "database_name", procedure_name: "gds.graph.nodeProperties.stream", configuration: { node_labels: ["*"], node_properties: ["foo", "bar", "baz"], list_node_labels: true, consecutive_ids: false } } }
The procedure_name
indicates that we mirror the behaviour of the existing procedure.
The specific configuration needs to include the following keys:
Name | Type | Description |
---|---|---|
|
String or List of Strings |
Stream only properties for nodes with the given labels. |
|
String or List of Strings |
The node properties in the graph to stream. |
|
Boolean |
Whether to include node labels of the respective nodes in the result. |
|
Boolean |
Returns node ids mapped to a consecutive id space, i.e. |
Note that the schema of the result records is not identical to the corresponding procedure. Instead of a separate column containing the property key, every property is returned in its own column. As a result, there is only one row per node which includes all its property values.
For example, given the node (a { foo: 42, bar: 1337, baz: [1,3,3,7] })
and assuming node id 0
for a
, the resulting record schema is as follows:
nodeId | foo | bar | baz |
---|---|---|---|
0 |
42 |
1337 |
[1,3,3,7] |
Stream a single relationship property
To stream a single relationship property, the client needs to encode that information in the ticket as follows:
{ name: "GET_COMMAND", version: "v1", body: { graph_name: "my_graph", database_name: "database_name", procedure_name: "gds.graph.relationshipProperty.stream", configuration: { relationship_types: "REL", relationship_property: "foo", consecutive_ids: false } } }
The procedure_name
indicates that we mirror the behaviour of the existing procedure.
The specific configuration needs to include the following keys:
Name | Type | Description |
---|---|---|
|
String or List of Strings |
Stream only properties for relationships with the given type. |
|
String |
The relationship property in the graph to stream. |
|
Boolean |
Returns node ids mapped to a consecutive id space, i.e. |
The schema of the result records is identical to the corresponding procedure:
Name | Type | Description |
---|---|---|
sourceNodeId |
Integer |
The source node id of the relationship. |
targetNodeId |
Integer |
The target node id of the relationship. |
relationshipType |
Integer |
Dictionary-encoded relationship type. |
propertyValue |
Float |
The stored property value. |
Note, that the relationship type column stores the relationship type encoded as an integer. The corresponding string value needs to be retrieved from the corresponding dictionary value vector. That vector can be loaded from the dictionary provider using the encoding id of the type field.
Stream multiple relationship properties
To stream multiple relationship properties, the client needs to encode that information in the ticket as follows:
{ name: "GET_COMMAND", version: "v1", body: { graph_name: "my_graph", database_name: "database_name", procedure_name: "gds.graph.relationshipProperties.stream", configuration: { relationship_types: "REL", relationship_property: ["foo", "bar"], consecutive_ids: false } } }
The procedure_name
indicates that we mirror the behaviour of the existing procedure.
The specific configuration needs to include the following keys:
Name | Type | Description |
---|---|---|
|
String or List of Strings |
Stream only properties for relationships with the given type. |
|
String or List of String |
The relationship properties in the graph to stream. |
|
Boolean |
Returns node ids mapped to a consecutive id space, i.e. |
Note that the schema of the result records is not identical to the corresponding procedure. Instead of a separate column containing the property key, every property is returned in its own column. As a result, there is only one row per relationship which includes all its property values.
For example, given the relationship [:REL { foo: 42.0, bar: 13.37 }]
that connects a source node with id 0
wit a target node with id 1
, the resulting record schema is as follows:
sourceNodeId | targetNodeId | relationshipType | foo | bar |
---|---|---|---|---|
0 |
1 |
0 |
42.0 |
13.37 |
Note, that the relationship type column stores the relationship type encoded as an integer. The corresponding string value needs to be retrieved from the corresponding dictionary value vector. That vector can be loaded from the dictionary provider using the encoding id of the type field.
Stream relationship topology
To stream the topology of one or more relationship types, the client needs to encode that information in the ticket as follows:
{ name: "GET_COMMAND", version: "v1", body: { graph_name: "my_graph", database_name: "database_name", procedure_name: "gds.graph.relationships.stream", configuration: { relationship_types: "REL", consecutive_ids: false } } }
The procedure_name
indicates that we mirror the behaviour of the existing procedure.
The specific configuration needs to include the following keys:
Name | Type | Description |
---|---|---|
|
String or List of Strings |
Stream only properties for relationships with the given type. |
|
Boolean |
Returns node ids mapped to a consecutive id space, i.e. |
The schema of the result records is identical to the corresponding procedure:
sourceNodeId | targetNodeId | relationshipType |
---|---|---|
0 |
1 |
0 |
Note, that the relationship type column stores the relationship type encoded as an integer. The corresponding string value needs to be retrieved from the corresponding dictionary value vector. That vector can be loaded from the dictionary provider using the encoding id of the type field.
Partitioning the data streams
Some use-cases require the data streams to be partitioned.
For example, if the data streams are consumed by a distributed system, the data streams need to be evenly distributed to the members of the distributed system.
To support this use-case, the client can request the data streams to be partitioned by sending the stream request to the FlightInfo
endpoint of the GDS Flight Server.
The server will then return a number of endpoints, where each endpoint and it’s accompanying ticket can be used to stream a partition of the data.
The concurrency
settings of the ticket can be used to control the number of partitions.
For example, to stream the topology of one or more relationship types, the client needs to encode that information in the ticket as follows:
{ name: "GET_COMMAND", version: "v1", body: { graph_name: "my_graph", database_name: "database_name", procedure_name: "gds.graph.relationships.stream", concurrency: 2, configuration: { relationship_types: "REL" } } }
This will create at most 2 partitions of the data streams. The server will answer with 2 tickets:
[ { graph_name: "my_graph", database_name: "database_name", procedure_name: "gds.graph.relationships.stream", concurrency: 4, partition_offset: 0, partition_size: 100, configuration: { relationship_types: "REL" } }, { graph_name: "my_graph", database_name: "database_name", procedure_name: "gds.graph.relationships.stream", partition_offset: 100, partition_size: 100, concurrency: 4, configuration: { relationship_types: "REL" } } ]
Each of the tickets can now be used to request a partition data via the GET
endpoint of the GDS Flight Server.