Applying a trained model for prediction
This feature is in the beta tier. For more information on feature tiers, see API Tiers.
In the previous sections we have seen how to build up a Link Prediction training pipeline and train it to produce a predictive model.
After training, the runnable model is of type LinkPrediction
and resides in the model catalog.
The trained model can then be applied to a graph in the graph catalog to create a new relationship type containing the predicted links. The relationships also have a property which stores the predicted probability of the link, which can be seen as a relative measure of the model’s prediction confidence.
Since the model has been trained on features which are created using the feature pipeline, the same feature pipeline is stored within the model and executed at prediction time. As during training, intermediate node properties created by the node property steps in the feature pipeline are transient and not visible after execution.
When using the model for prediction, relationships in the input graph are separated according to the configuration. By default, the configuration will be the same as the configuration used for training the pipeline. Relationships marked as context relationships during training are again used for computing features in node property steps. The target relationship type is used to prevent predicting already existing relationships. This configuration may be overridden to specify a different context, or different set of relationships to exclude from prediction.
It is necessary that the predict graph contains the properties that the pipeline requires and that the used array properties have the same dimensions as in the train graph. If the predict and train graphs are distinct, it is also beneficial that they have similar origins and semantics, so that the model is able to generalize well.
Search strategies
To find the best possible new links, GDS offers two different search strategies.
Exhaustive Search
The exhaustive search will simply run through all possible new links, that is, check all node pairs that are not already connected by a relationship. For each such node pair the trained model is used to predict whether they should be connected by a link or not. The exhaustive search will find all the best links, but has a potentially long runtime.
Approximate Search
To avoid possibly having to run for a very long time considering all possible new links (due to the inherent quadratic complexity over node count), GDS offers an approximate search strategy.
The approximate search strategy lets us leverage the K-Nearest Neighbors algorithm with our model’s prediction function as its similarity measure to trade off lower runtime for accuracy. Accuracy in this context refers to how close the result is to the very best new possible links according to our models predictions, i.e. the best predictions that would be made by exhaustive search.
The initial set of considered links for each node is picked at random and then refined in multiple iterations based of previously predicted links. See the K-Nearest Neighbors documentation for more details on how the search works.
Syntax
CALL gds.beta.pipeline.linkPrediction.predict.mutate(
graphName: String,
configuration: Map
)
YIELD
preProcessingMillis: Integer,
computeMillis: Integer,
postProcessingMillis: Integer,
mutateMillis: Integer,
relationshipsWritten: Integer,
probabilityDistribution: Integer,
samplingStats: Map,
configuration: Map
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
modelName |
String |
|
no |
The name of a Link Prediction model in the model catalog. |
sourceNodeLabel |
String |
|
yes |
The name of the node label predicted links should start from. |
targetNodeLabel |
String |
|
yes |
The name of the node label predicted links should end at. |
relationshipTypes |
List of String |
|
yes |
The names of the existing relationships. As a default we use the |
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
mutateRelationshipType |
String |
|
no |
The relationship type used for the new relationships written to the projected graph. |
mutateProperty |
String |
|
yes |
The relationship property in the GDS graph to which the result is written. |
sampleRate |
Float |
|
no |
Sample rate to determine how many links are considered for each node. If set to 1, all possible links are considered, i.e., exhaustive-search. Otherwise, an approximate search strategy will be used. Value must be between 0 (exclusive) and 1 (inclusive). |
topN [1] |
Integer |
|
no |
Limit on predicted relationships to output. |
threshold [1] |
Float |
|
yes |
Minimum predicted probability on relationships to output. |
topK [2] |
Integer |
|
yes |
Limit on number of predicted relationships to output for each node. This value cannot be lower than 1. |
deltaThreshold [2] |
Float |
|
yes |
Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive). |
Integer |
|
yes |
Hard limit to stop the algorithm after that many iterations. |
|
randomJoins [2] |
Integer |
|
yes |
Between every iteration, how many attempts are being made to connect new node neighbors based on random selection. |
String |
|
yes |
The method used to sample the first |
|
randomSeed [2] |
Integer |
|
yes |
The seed value to control the randomness of the algorithm. Note that |
1. Only applicable in the exhaustive-search. 2. Only applicable in the approximate search strategy. For more details look at the syntax section of kNN |
Name | Type | Description |
---|---|---|
preProcessingMillis |
Integer |
Milliseconds for preprocessing the graph. |
computeMillis |
Integer |
Milliseconds for running the algorithm. |
postProcessingMillis |
Integer |
Milliseconds for computing the global metrics. |
mutateMillis |
Integer |
Milliseconds for adding properties to the projected graph. |
relationshipsWritten |
Integer |
Number of relationships created. |
probabilityDistribution |
Map |
Description of distribution of predicted probabilities. |
samplingStats |
Map |
Description of how predictions were sampled. |
configuration |
Map |
Configuration used for running the algorithm. |
CALL gds.beta.pipeline.linkPrediction.predict.stream(
graphName: String,
configuration: Map
)
YIELD
node1: Integer,
node2: Integer,
probability: Float
Name | Type | Default | Optional | Description |
---|---|---|---|---|
graphName |
String |
|
no |
The name of a graph stored in the catalog. |
configuration |
Map |
|
yes |
Configuration for algorithm-specifics and/or graph filtering. |
Name | Type | Default | Optional | Description |
---|---|---|---|---|
modelName |
String |
|
no |
The name of a LinkPrediction model in the model catalog. |
List of String |
|
yes |
Filter the named graph using the given node labels. Nodes with any of the given labels will be included. |
|
List of String |
|
yes |
Filter the named graph using the given relationship types. Relationships with any of the given types will be included. |
|
Integer |
|
yes |
The number of concurrent threads used for running the algorithm. |
|
String |
|
yes |
An ID that can be provided to more easily track the algorithm’s progress. |
|
Boolean |
|
yes |
If disabled the progress percentage will not be logged. |
|
sampleRate |
Float |
|
no |
Sample rate to determine how many links are considered for each node. If set to 1, all possible links are considered, i.e., exhaustive-search. Otherwise, an approximate search strategy will be used. Value must be between 0 (exclusive) and 1 (inclusive). |
topN [3] |
Integer |
|
no |
Limit on predicted relationships to output. |
threshold [3] |
Float |
|
yes |
Minimum predicted probability on relationships to output. |
topK [4] |
Integer |
|
yes |
Limit on number of predicted relationships to output for each node. This value cannot be lower than 1. |
deltaThreshold [4] |
Float |
|
yes |
Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops. Value must be between 0 (exclusive) and 1 (inclusive). |
Integer |
|
yes |
Hard limit to stop the algorithm after that many iterations. |
|
randomJoins [4] |
Integer |
|
yes |
Between every iteration, how many attempts are being made to connect new node neighbors based on random selection. |
String |
|
yes |
The method used to sample the first |
|
randomSeed [4] |
Integer |
|
yes |
The seed value to control the randomness of the algorithm. Note that |
3. Only applicable in the exhaustive-search. 4. Only applicable in the approximate search strategy. For more details look at the syntax section of kNN |
Name | Type | Description |
---|---|---|
node1 |
Integer |
Node ID of the first node. |
node2 |
Integer |
Node ID of the second node. |
probability |
Float |
Predicted probability of a link between the nodes. |
Example
In this example we will show how to use a trained model to predict new relationships in your projected graph.
In order to do this, we must first have an already trained model registered in the Model Catalog.
We will use the model which we trained in the train example which we gave the name lp-pipeline-model
.
The algorithm excludes predictions for existing relationships in the graph as well as self-loops.
There are two different strategies for choosing which node pairs to consider when predicting new links, exhaustive search and approximate search. Whereas the former considers all possible new links, the latter will use a randomized strategy that only considers a subset of them in order to run faster. We will explain each individually with examples in the mutate examples below.
The relationships that are produced by the write and mutate procedures are undirected, just like the input.
However, no parallel relationships are produced.
So for example if when doing approximate search, |
Memory Estimation
First off, we will estimate the cost of running the algorithm using the estimate
procedure.
This can be done with any execution mode.
We will use the stream
mode in this example.
Estimating the algorithm is useful to understand the memory impact that running the algorithm on your graph will have.
When you later actually run the algorithm in one of the execution modes the system will perform an estimation.
If the estimation shows that there is a very high probability of the execution going over its memory limitations, the execution is prohibited.
To read more about this, see Automatic estimation and execution blocking.
For more details on estimate
in general, see Memory Estimation.
CALL gds.beta.pipeline.linkPrediction.predict.stream.estimate('myGraph', {
modelName: 'lp-pipeline-model',
topN: 5,
threshold: 0.5
})
YIELD requiredMemory
requiredMemory |
---|
"24 KiB" |
Stream
In the stream
execution mode, the algorithm returns the probability of a link for each node pair.
This allows us to inspect the results directly or post-process them in Cypher without any side effects.
For more details on the stream
mode in general, see Stream.
CALL gds.beta.pipeline.linkPrediction.predict.stream('myGraph', {
modelName: 'lp-pipeline-model',
topN: 5,
threshold: 0.5
})
YIELD node1, node2, probability
RETURN gds.util.asNode(node1).name AS person1, gds.util.asNode(node2).name AS person2, probability
ORDER BY probability DESC, person1
We specify threshold
to include only predictions with probability greater than 50%, and topN
to further limit output to the top 5 relationships.
As the default samplingRate is 1
, we use the exhaustive-search.
person1 | person2 | probability |
---|---|---|
"Alice" |
"Chris" |
0.6207332784 |
"Mark" |
"Chris" |
0.5986795167 |
"Mark" |
"Alice" |
0.5662968371 |
"Alice" |
"Karin" |
0.5607462866 |
"Alice" |
"Greg" |
0.5579582123 |
We can see that the model thinks that the above pairs should be connected. Other node pairs have a lower probability and are filtered out of the result.
Mutate
In this example we will show how to write the predictions to your projected graph.
We will use the model lp-pipeline-model
, that we trained in the train example.
CALL gds.beta.pipeline.linkPrediction.predict.mutate('myGraph', {
modelName: 'lp-pipeline-model',
relationshipTypes: ['KNOWS'],
mutateRelationshipType: 'KNOWS_EXHAUSTIVE_PREDICTED',
topN: 5,
threshold: 0.5
}) YIELD relationshipsWritten, samplingStats
We specify threshold
to include only predictions with probability greater than 50%, and topN
to further limit output to the top 5 relationships.
As the default samplingRate is 1
, we use the exhaustive-search.
Because we are using the UNDIRECTED
orientation, we will write twice as many relationships to the in-memory graph.
relationshipsWritten | samplingStats |
---|---|
10 |
{linksConsidered=16, strategy="exhaustive"} |
As we can see in the samplingStats
, we used the exhaustive search strategy and checked 16 possible links during the prediction.
Indeed, since there are a total of 8 * (8 - 1) / 2 = 28
possible links in the graph and we already have 12, that means we check all possible new links.
Although 16 links were considered, we only mutate the best five (since topN = 5
) that are above our threshold, and in fact only one link did pass the threshold (see Stream).
If our graph is very large there may be a lot of possible new links. As such it may take a very long time to run the predictions. It may therefore be a more viable option to use a search strategy that only looks at a subset of all possible new links.
Approximate search
To avoid possibly having to run for a very long time considering all possible new links, we can use the approximate search strategy.
CALL gds.beta.pipeline.linkPrediction.predict.mutate('myGraph', {
modelName: 'lp-pipeline-model',
relationshipTypes: ['KNOWS'],
mutateRelationshipType: 'KNOWS_APPROX_PREDICTED',
sampleRate: 0.5,
topK: 1,
randomJoins: 2,
maxIterations: 3,
// necessary for deterministic results
concurrency: 1,
randomSeed: 42
})
YIELD relationshipsWritten, samplingStats
In order to use the approximate strategy we make sure to set the sampleRate
explicitly to a value < 1.0
.
For this small example, we limit the search by setting the maxIterations
to 3
and randomJoins
to 2
.
Also, we set topK = 1
to get one predicted link for each node.
Because we are using the UNDIRECTED
orientation, we will write twice as many relationships to the in-memory graph.
relationshipsWritten | samplingStats |
---|---|
16 |
{didConverge=true, linksConsidered=56, ranIterations=3, strategy="approximate"} |
As we can see in the samplingStats
, we use the approximate search strategy and check 56 possible links during the prediction.
Though in this small example we actually consider more links that in the exhaustive case, this will typically not be the case for larger graphs.
Since the relationships we write are undirected, reported relationshipsWritten
is 16 when we search for the best (topK = 1
) prediction for each node.
Predict with context filtering
In Training with context filters, we trained another model lp-pipeline-model-filtered
on fullGraph
which uses context City
nodes and context LIVES
and BORN
relationships.
We can leverage this model in prediction, optionally overriding node label or relationship type filter configuration in prediction.
In this case we do not, and instead inherit the filtering configuration from the train configuration of the lp-pipeline-model-filtering
model.
In other words, we predict Person-KNOWS-Person relationships, additionally using City
nodes and LIVES
and BORN
relationships for the node property steps.
CALL gds.beta.pipeline.linkPrediction.predict.stream('fullGraph', {
modelName: 'lp-pipeline-model-filtered',
topN: 5,
threshold: 0.5
})
YIELD node1, node2, probability
RETURN gds.util.asNode(node1).name AS person1, gds.util.asNode(node2).name AS person2, probability
ORDER BY probability DESC, person1
We specify threshold
to include only predictions with probability greater than 50%, and topN
to further limit output to the top 5 relationships.
As the default samplingRate is 1
, we use the exhaustive-search.
person1 | person2 | probability |
---|---|---|
"Alice" |
"Chris" |
0.7174928596 |
"Mark" |
"Chris" |
0.6573294858 |
"Mark" |
"Alice" |
0.5617915136 |
"Alice" |
"Karin" |
0.54489345 |
"Alice" |
"Greg" |
0.5364068805 |
We can see that our model predicts the same top 5 links as it did with the unfiltered model lp-pipeline-model
.
However, the probabilities vary slightly, due to the additional context information used in training and prediction.