Product recommendations with kNN based on FastRP embeddings

Open In Colab

This Jupyter notebook is hosted here in the Neo4j Graph Data Science Client Github repository.

The notebook exemplifies how to use the graphdatascience Python library to operate Neo4j GDS. It shows an adapted version of the FastRP and kNN end-to-end example from the GDS Manual, found here.

We consider a graph of products and customers, and we want to find new products to recommend for each customer. We want to use the K-Nearest Neighbors algorithm (kNN) to identify similar customers and base our product recommendations on that. In order to be able to leverage topological information about the graph in kNN, we will first create node embeddings using FastRP. These embeddings will be the input to the kNN algorithm.

We will then use a Cypher query to generate recommendations for each pair of similar customers, where products that have been purchased by one of the customers will be recommended to the other.

1. Prerequisites

Running this notebook requires a Neo4j server with a recent version (2.0+) of GDS installed. We recommend using Neo4j Desktop with GDS, or AuraDS.

The graphdatascience Python library needs to be installed as well. See the examples in the Setup section below and in the client installation instructions.

2. Setup

We start by installing and importing our dependencies, and setting up our GDS client connection to the database.

# Install necessary dependencies
%pip install graphdatascience
import os

from graphdatascience import GraphDataScience

# Get Neo4j DB URI and credentials from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_AUTH = None
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
    NEO4J_AUTH = (
        os.environ.get("NEO4J_USER"),
        os.environ.get("NEO4J_PASSWORD"),
    )

gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)
from graphdatascience import ServerVersion

assert gds.server_version() >= ServerVersion(1, 8, 0)

3. Example graph creation

We now create a graph of products and customers in the database. The amount relationship property represents the average weekly amount of money spent by a customer on a given product.

# The `run_cypher` method can be used to run arbitrary Cypher queries on the database.
_ = gds.run_cypher(
    """
        CREATE
         (dan:Person {name: 'Dan'}),
         (annie:Person {name: 'Annie'}),
         (matt:Person {name: 'Matt'}),
         (jeff:Person {name: 'Jeff'}),
         (brie:Person {name: 'Brie'}),
         (elsa:Person {name: 'Elsa'}),

         (cookies:Product {name: 'Cookies'}),
         (tomatoes:Product {name: 'Tomatoes'}),
         (cucumber:Product {name: 'Cucumber'}),
         (celery:Product {name: 'Celery'}),
         (kale:Product {name: 'Kale'}),
         (milk:Product {name: 'Milk'}),
         (chocolate:Product {name: 'Chocolate'}),

         (dan)-[:BUYS {amount: 1.2}]->(cookies),
         (dan)-[:BUYS {amount: 3.2}]->(milk),
         (dan)-[:BUYS {amount: 2.2}]->(chocolate),

         (annie)-[:BUYS {amount: 1.2}]->(cucumber),
         (annie)-[:BUYS {amount: 3.2}]->(milk),
         (annie)-[:BUYS {amount: 3.2}]->(tomatoes),

         (matt)-[:BUYS {amount: 3}]->(tomatoes),
         (matt)-[:BUYS {amount: 2}]->(kale),
         (matt)-[:BUYS {amount: 1}]->(cucumber),

         (jeff)-[:BUYS {amount: 3}]->(cookies),
         (jeff)-[:BUYS {amount: 2}]->(milk),

         (brie)-[:BUYS {amount: 1}]->(tomatoes),
         (brie)-[:BUYS {amount: 2}]->(milk),
         (brie)-[:BUYS {amount: 2}]->(kale),
         (brie)-[:BUYS {amount: 3}]->(cucumber),
         (brie)-[:BUYS {amount: 0.3}]->(celery),

         (elsa)-[:BUYS {amount: 3}]->(chocolate),
         (elsa)-[:BUYS {amount: 3}]->(milk)
    """
)

4. Projecting into GDS

In order to be able to analyze the data in our database, we proceed to projecting it into memory where GDS can operate on it.

# We define how we want to project our database into GDS
node_projection = ["Person", "Product"]
relationship_projection = {"BUYS": {"orientation": "UNDIRECTED", "properties": "amount"}}

# Before actually going through with the projection, let's check how much memory is required
result = gds.graph.project.estimate(node_projection, relationship_projection)

print(f"Required memory for native loading: {result['requiredMemory']}")
# For this small graph memory requirement is low. Let us go through with the projection
G, result = gds.graph.project("purchases", node_projection, relationship_projection)

print(f"The projection took {result['projectMillis']} ms")

# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")

5. Creating FastRP node embeddings

Next we use the FastRP algorithm to generate node embeddings that capture topological information from the graph. We choose to work with embeddingDimension set to 4 which is sufficient since our example graph is very small. The iterationWeights are chosen empirically to yield sensible results. Please see the syntax section of the FastRP documentation for more information on these parameters.

Since we want to use the embeddings as input when we run kNN later we use FastRP’s mutate mode.

# We can also estimate memory of running algorithms like FastRP, so let's do that first
result = gds.fastRP.mutate.estimate(
    G,
    mutateProperty="embedding",
    randomSeed=42,
    embeddingDimension=4,
    relationshipWeightProperty="amount",
    iterationWeights=[0.8, 1, 1, 1],
)

print(f"Required memory for running FastRP: {result['requiredMemory']}")
# Now let's run FastRP and mutate our projected graph 'purchases' with the results
result = gds.fastRP.mutate(
    G,
    mutateProperty="embedding",
    randomSeed=42,
    embeddingDimension=4,
    relationshipWeightProperty="amount",
    iterationWeights=[0.8, 1, 1, 1],
)

# Let's make sure we got an embedding for each node
print(f"Number of embedding vectors produced: {result['nodePropertiesWritten']}")

6. Similarities with kNN

Now we can run kNN to identify similar nodes by using the node embeddings that we generated with FastRP as nodeProperties. Since we are working with a small graph, we can set sampleRate to 1 and deltaThreshold to 0 without having to worry about long computation times. The concurrency parameter is set to 1 (along with the fixed randomSeed) in order to get a deterministic result. Please see the syntax section of the kNN documentation for more information on these parameters.

Note that we will use the algorithm’s write mode to write the properties and relationships back to our database, so that we can analyze them later using Cypher.

# Run kNN and write back to db (we skip memory estimation this time...)
result = gds.knn.write(
    G,
    topK=2,
    nodeProperties=["embedding"],
    randomSeed=42,
    concurrency=1,
    sampleRate=1.0,
    deltaThreshold=0.0,
    writeRelationshipType="SIMILAR",
    writeProperty="score",
)

print(f"Relationships produced: {result['relationshipsWritten']}")
print(f"Nodes compared: {result['nodesCompared']}")
print(f"Mean similarity: {result['similarityDistribution']['mean']}")

As we can see the mean similarity between nodes is quite high. This is due to the fact that we have a small example where there are no long paths between nodes leading to many similar FastRP node embeddings.

7. Exploring the results

Let us now inspect the results of our kNN call by using Cypher. We can use the SIMILARITY relationship type to filter out the relationships we are interested in. And since we just care about similarities between people for our product recommendation engine, we make sure to only match nodes with the Person label.

Please see the Cypher manual for documentation on how to use Cypher.

gds.run_cypher(
    """
        MATCH (p1:Person)-[r:SIMILAR]->(p2:Person)
        RETURN p1.name AS person1, p2.name AS person2, r.score AS similarity
        ORDER BY similarity DESCENDING, person1, person2
    """
)

Our kNN results indicate among other things that the Person nodes named “Annie” and “Matt” are very similar. Looking at the BUYS relationships for these two nodes we can see that such a conclusion makes sense. They both buy three products, two of which are the same (Product nodes named “Cucumber” and “Tomatoes”) for both people and with similar amounts. We can therefore have high confidence in our approach.

8. Making recommendations

Using the information we derived that the Person nodes named “Annie” and “Matt” are similar, we can make product recommendations for each of them. Since they are similar, we can assume that products purchased by only one of the people may be of interest to buy also for the other person not already buying the product. By this principle we can derive product recommendations for the Person named “Matt” using a simple Cypher query.

gds.run_cypher(
    """
        MATCH (:Person {name: "Annie"})-[:BUYS]->(p1:Product)
        WITH collect(p1) as products
        MATCH (:Person {name: "Matt"})-[:BUYS]->(p2:Product)
        WHERE not p2 in products
        RETURN p2.name as recommendation
    """
)

Indeed, “Kale” is the one product that the Person named “Annie” buys that is also not purchased by the Person named “Matt”.

9. Cleaning up

Before finishing we can clean up the example data from both the GDS in-memory state and the database.

# Remove our projection from the GDS graph catalog
G.drop()

# Remove all the example data from the database
_ = gds.run_cypher("MATCH (n) DETACH DELETE n")

10. Conclusion

Using two GDS algorithms and some basic Cypher we were easily able to derive some sensible product recommendations for a customer in our small example.

To make sure to get similarities to other customers for every customer in our graph with kNN, we could play around with increasing the topK parameter.