Prerequisites

This page gives you an overview of all the steps you need to follow before you can run a Dataflow job to import data into Neo4j.

Neo4j instance

You need a running Neo4j instance which the data can flow into.

If you don’t have an instance yet, you have two options:

sign-up for a free AuraDB instance
install and self-host Neo4j in a location that is publicly accessible (see Neo4j → Installation) with port 7687 open (Bolt protocol)

The template uses constraints, some of which are only available in Neo4j/Aura Enterprise Edition installations. Although the Dataflow jobs are able to run on Neo4j Community Edition instances, most constraints will not be created. You have thus to ensure that the source data and job specification are prepared accordingly.

Google Cloud Storage bucket

You need a Google Cloud Storage bucket. This is the one and only location from where the Dataflow job can source files (both configuration files and source CSVs, if any).

Upload connection information

Regardless of how your Neo4j instance is deployed, you need to create a file containing your database connection information in JSON format. We will refer to this file as neo4j-connection-info.json. Dataflow will use the information contained in this file to connect to the Neo4j instance.

The basic authentication scheme relies on traditional username and password. This scheme can also be used to authenticate against an LDAP server.

{
  "server_url": "neo4j+s://xxxx.databases.neo4j.io",
  "database": "neo4j",
  "username": "<username>",
  "pwd": "<password>"
}

If authentication is disabled on the server, credentials can be omitted.

{
  "server_url": "neo4j+s://xxxx.databases.neo4j.io",
  "database": "neo4j",
  "auth_type": "none"
}

The Kerberos authentication scheme requires a base64-encoded ticket. It can only be used if the server has the Kerberos Add-on installed.

{
  "server_url": "neo4j+s://xxxx.databases.neo4j.io",
  "database": "neo4j",
  "auth_type": "kerberos",
  "ticket": "<base 64 encoded Kerberos ticket>"
}

The bearer authentication scheme requires a base64-encoded token provided by an Identity Provider through Neo4j’s Single Sign-On feature.

{
  "server_url": "neo4j+s://xxxx.databases.neo4j.io",
  "database": "neo4j",
  "auth_type": "bearer",
  "token": "<bearer token>"
}

To log into a server having a custom authentication scheme.

{
  "server_url": "neo4j+s://xxxx.databases.neo4j.io",
  "database": "neo4j",
  "auth_type": "custom",
  "principal": "<principal>",
  "credentials": "<credentials>",
  "realm": "<realm>",
  "scheme": "<scheme>",
  "parameters": {"<key>": "<value>"}
}

The connection file can be uploaded either as a secret to Google Cloud Secret Manager or directly into your Google Cloud Storage bucket:

Google Secret Manager — Create a new secret and upload the neo4j-connection-info.json file as value.
Google Cloud Storage — Upload the neo4j-connection-info.json file to your Cloud Storage bucket.

Dataset to import

You need a dataset that you want to import into Neo4j. This should consist of a number of CSV files located in your Google Cloud Storage bucket. This guide provides you with a set of CSV files to get started with.

Source CSV files must fulfill some constraints:

They should not contain empty rows.
They should not contain a header row.
Specify column names in the source object definition, and leave only data rows in the files. A CSV with a header row will result in an extra imported entity, with the column names as data values.

Since you are moving data from a relational database into a graph database, the data model is likely to change. Checkout Graph data modeling guidelines to learn how to model for graph databases.

Google Dataflow job

The Google Dataflow job glues all the pieces together and performs the data import. You need to craft a job specification file to provide Dataflow with all the information it needs to load the data into Neo4j.

All Google-related resources (Cloud project, Cloud Storage buckets, Dataflow job) should either belong to the same account, or to one which the Dataflow job has permissions to access.