Split Relationships

This feature is in the alpha tier. For more information on feature tiers, see API Tiers.

Introduction

The Split relationships algorithm is a utility algorithm that is used to pre-process a graph for model training. It splits the relationships into a holdout set and a remaining set. The holdout set is divided into two classes: positive, i.e., existing relationships, and negative, i.e., non-existing relationships. The class is indicated by a label property on the relationships. This enables the holdout set to be used for training or testing a machine learning model. Both, the holdout and the remaining relationships are added to the projected graph.

If the configuration option relationshipWeightProperty is specified, then the corresponding relationship property is preserved on the remaining set of relationships. Note however that the holdout set only has the label property; it is not possible to induce relationship weights on the holdout set as it also contains negative samples.

Syntax

This section covers the syntax used to execute the Split Relationships algorithm in each of its execution modes. We are describing the named graph variant of the syntax. To learn more about general syntax variants, see Syntax overview.

Split Relationships syntax per mode
Run Split Relationships in mutate mode on a named graph.
CALL gds.alpha.ml.splitRelationships.mutate(
  graphName: String,
  configuration: Map
)
YIELD
  preProcessingMillis: Integer,
  computeMillis: Integer,
  mutateMillis: Integer,
  relationshipsWritten: Integer,
  configuration: Map
Table 1. Parameters
Name Type Default Optional Description

graphName

String

n/a

no

The name of a graph stored in the catalog.

configuration

Map

{}

yes

Configuration for algorithm-specifics and/or graph filtering.

Table 2. Configuration
Name Type Default Optional Description

sourceNodeLabels

List of String

['*']

yes

Filter the relationships where the sourceNode has at least one of the sourceNodeLabels.

targetNodeLabels

List of String

['*']

yes

Filter the relationships where the targetNode has at least one of the targetNodeLabels.

relationshipTypes

List of String

['*']

yes

Filter the named graph using the given relationship types.

concurrency

Integer

4

yes

The number of concurrent threads used for running the algorithm.

jobId

String

Generated internally

yes

An ID that can be provided to more easily track the algorithm’s progress.

holdoutFraction

Float

n/a

no

The fraction of valid relationships being used as holdout set. The remaining 1 - holdoutFraction of the valid relationships are added to the remaining set.

negativeSamplingRatio

Float

n/a

no

The desired ratio of negative to positive samples in holdout set.

holdoutRelationshipType

String

n/a

no

Relationship type used for the holdout set. Each relationship has a property label indicating whether it is a positive or negative sample.

remainingRelationshipType

String

n/a

no

Relationships where one node has none of the source or target labels will be omitted. All invalid relationship are added to the remaining set.

nonNegativeRelationshipTypes

List of String

n/a

yes

Additional relationship types that are not used for negative sampling.

relationshipWeightProperty

String

null

yes

Name of the relationship property that is inherited by the remainingRelationshipType.

randomSeed

Integer

n/a

yes

An optional seed value for the random selection of relationships.

Table 3. Results
Name Type Description

preProcessingMillis

Integer

Milliseconds for preprocessing the data.

computeMillis

Integer

Milliseconds for running the algorithm.

mutateMillis

Integer

Milliseconds for adding properties to the projected graph.

relationshipsWritten

Integer

The number of relationships created by the algorithm.

configuration

Map

The configuration used for running the algorithm.

Examples

All the examples below should be run in an empty database.

The examples use Cypher projections as the norm. Native projections will be deprecated in a future release.

In this section we will show examples of running the Split Relationships algorithm on a concrete graph. The intention is to illustrate what the results look like and to provide a guide in how to make use of the algorithm in a real setting. We will do this on a small graph of a handful nodes connected in a particular pattern. The example graph looks like this:

Visualization of the example graph

Consider the graph created by the following Cypher statement:

CREATE
    (n0:Label),
    (n1:Label),
    (n2:Label),
    (n3:Label),
    (n4:Label),
    (n5:Label),

    (n0)-[:TYPE { prop: 0} ]->(n1),
    (n1)-[:TYPE { prop: 1} ]->(n2),
    (n2)-[:TYPE { prop: 4} ]->(n3),
    (n3)-[:TYPE { prop: 9} ]->(n4),
    (n4)-[:TYPE { prop: 16} ]->(n5)

Given the above graph, we want to use 20% of the relationships as holdout set. The holdout set will be split into two same-sized classes: positive and negative. Positive relationships will be randomly selected from the existing relationships and marked with a property label: 1. Negative relationships will be randomly generated, i.e., they do not exist in the input graph, and are marked with a property label: 0.

MATCH (source:Label)-[r:TYPE]->(target:Label)
RETURN gds.graph.project(
  'graph',
  source,
  target,
  {
    sourceNodeLabels: ['Label'],
    targetNodeLabels: ['Label'],
    relationshipType: 'TYPE'
  },
  { undirectedRelationshipTypes: ['TYPE'] }
)

Now we can run the algorithm by specifying the appropriate ratio and the output relationship types. We use a random seed value in order to produce deterministic results.

CALL gds.alpha.ml.splitRelationships.mutate('graph', {
    holdoutRelationshipType: 'TYPE_HOLDOUT',
    remainingRelationshipType: 'TYPE_REMAINING',
    holdoutFraction: 0.2,
    negativeSamplingRatio: 1.0,
    randomSeed: 1337
}) YIELD relationshipsWritten
Table 4. Results
relationshipsWritten

10

The input graph consists of 5 relationships. We use 20% (1 relationship) of the relationships to create the 'TYPE_HOLDOUT' relationship type (holdout set). This creates 1 relationship with positive label. Because of the negativeSamplingRatio, one relationship with negative label is also created. Finally, the TYPE_REMAINING relationship type is formed with the remaining 80% (4 relationships). These are written as orientation UNDIRECTED which counts as writing 8 relationships.

The mutated graph will look like the following graph when filtered by the TEST and TRAIN relationship.
CREATE
    (n0:Label),
    (n1:Label),
    (n2:Label),
    (n3:Label),
    (n4:Label),
    (n5:Label),

    (n2)-[:TYPE_HOLDOUT { label: 0 } ]->(n5), // negative, non-existing
    (n3)-[:TYPE_HOLDOUT { label: 1 } ]->(n2), // positive, existing

    (n0)<-[:TYPE_REMAINING { prop: 0} ]-(n1),
    (n1)<-[:TYPE_REMAINING { prop: 1} ]-(n2),
    (n3)<-[:TYPE_REMAINING { prop: 9} ]-(n4),
    (n4)<-[:TYPE_REMAINING { prop: 16} ]-(n5),
    (n0)-[:TYPE_REMAINING { prop: 0} ]->(n1),
    (n1)-[:TYPE_REMAINING { prop: 1} ]->(n2),
    (n3)-[:TYPE_REMAINING { prop: 9} ]->(n4),
    (n4)-[:TYPE_REMAINING { prop: 16} ]->(n5)