Backing up Neo4j Containers

This approach supports Google Cloud, AWS, and Azure storage, and assumes you have credentials and wish to store your backups on those cloud storage systems. If this is not the case, you will need to adjust the backup script for your desired cloud storage method, but the approach will work for any backup location.

This approach works only for Neo4j 4.0+. The backup tool and the DBMS itself changed quite a lot between 3.5 and 4.0, and the approach here will likely not work for older databases without substantial modification.

If you are upgrading to the helm chart 4.1.3-1 or later from an earlier version, double check this documentation; the syntax of using the backup chart has changed a bit to accomodate multiple clouds. This documentation applies only to 4.1.3-1 and forward.

Background & Important Information

Required Neo4j Config

This is provided for you out of the box by the helm chart, but if you customize you should bear these requirements in mind:

dbms.backup.enabled=true
dbms.backup.listen_address=0.0.0.0:6362

The default for Neo4j is to listen only on 127.0.0.1, which will not work as other containers would not be able to access the backup port.

Backup Storage

Backups are stored in a temporary local volume before they are uploaded to cloud. By default an ephemeral Kubernetes emptyDir Volume is used.

If the database being backed up is large there might not be sufficient space in local storage. To use alternative storage set tempVolume in values.yaml to a different Kubernetes Volume object.

Backup Pointers

All backups will turn into .tar.gz files with date strings when they were taken, such as: neo4j-2020-06-16-12:32:57.tar.gz. They are named after the database they are a backup of.

When you take a backup, you will get both the dated version, and a "latest" copy, e.g. the above file will also be copied to neo4j/neo4j-latest.tar.gz in the same bucket.

Reminder: Each time you take a backup, the latest file will be overwritten.

The purpose of doing this is to have a stable name in storage where the latest backup can always be found, without losing any of the previous backups.

Neo4j Backs Up Databases, Not the DBMS

In Neo4j 4.0, the system can be multidatabase; most systems have at least 2 DBs, "system" and "neo4j". These need to be backed up and restored individually.

Steps to Take a Backup

Use a Service Account to access cloud storage (Google Cloud only)

GCP

Workload Identity is the recommended way to access Google Cloud services from applications running within GKE due to its improved security properties and manageability.

Follow the GCP instructions to:

Enable Workload Identity on your GKE cluster
Create a Google Cloud IAMServiceAccount that has read and write permissions for your backup location
Bind the IAMServiceAccount to the Neo4j deployment’s Kubernetes ServiceAccount*

[*] you can configure the name of the Kubernetes ServiceAccount that a Neo4j deployment uses by setting serviceAccountName in values.yaml. To check the name of the Kubernetes ServiceAccount that a Neo4j deployment is using run kubectl get pods -o=jsonpath='{.spec.serviceAccountName}{"\n"}' <your neo4j pod name>

If you are unable to use Workload Identity with GKE then you can create a service key secret instead as described in the next section.

Create a service key secret to access cloud storage

First you want to create a kubernetes secret that contains the content of your account service key. This key must have permissions to access the bucket and backup set that you’re trying to restore.

Azure

You must create the credential file and this file should look like this:

export ACCOUNT_NAME=<NAME_STORAGE_ACCOUNT>
export ACCOUNT_KEY=<STORAGE_ACCOUNT_KEY>

If you’re unsure what the account key secret should be, you can recover it with the following command:

ACCOUNT_KEY=$(az storage account keys list --resource-group "$AKS_RESOURCE_GROUP" --account-name "$STORAGE_ACCOUNT" --query [0].value -o tsv)

You have to create a secret for this file

kubectl create secret generic neo4j-azure-credentials \
    --from-file=credentials=azure-credentials.sh

AWS

You must create the credential file and this file should look like this:

[default]
region=
aws_access_key_id=
aws_secret_access_key=

You have to create a secret for this file

kubectl create secret generic neo4j-aws-credentials \
    --from-file=credentials=aws-credentials

GCP

You do NOT need to follow the steps in this section if you are using Workload Identity for GCP.

Download a JSON formatted service key from Google Cloud that looks like this:

{
  "type": "",
  "project_id": "",
  "private_key_id": "",
  "private_key": "",
  "client_email": "",
  "client_id": "",
  "auth_uri": "",
  "token_uri": "",
  "auth_provider_x509_cert_url": "",
  "client_x509_cert_url": ""
}

You have to create a secret for this file

kubectl create secret generic neo4j-gcp-credentials \
    --from-file=credentials=gcp-credentials.json

'--from-file=credentials=<your-config-path>' here is important; the credentials under the secret must be named credentials

Running a Backup

The backup method is itself a mini-helm chart, and so to run a backup, you just do this as a minimal required example:

This command must be run in 'https://github.com/neo4j-contrib/neo4j-helm/tree/master/tools/backup'

AWS

helm install my-neo4j-backup . \
    --set neo4jaddr=my-neo4j.default.svc.cluster.local:6362 \
    --set bucket=s3://my-bucket \
    --set database="neo4j\,system" \
    --set cloudProvider=aws \
    --set secretName=neo4j-aws-credentials \
    --set jobSchedule="0 */12 * * *"

GCP with Workload Identity

helm install my-neo4j-backup . \
    --set neo4jaddr=my-neo4j.default.svc.cluster.local:6362 \
    --set bucket=gs://my-bucket \
    --set database="neo4j\,system" \
    --set cloudProvider=gcp \
    --set secretName=NULL \
    --set serviceAccountName=my-neo4j-backup-sa \
    --set jobSchedule="0 */12 * * *"

GCP with service key secret

helm install my-neo4j-backup . \
    --set neo4jaddr=my-neo4j.default.svc.cluster.local:6362 \
    --set bucket=gs://my-bucket \
    --set database="neo4j\,system" \
    --set cloudProvider=gcp \
    --set secretName=neo4j-gcp-credentials \
    --set jobSchedule="0 */12 * * *"

Azure

helm install my-neo4j-backup . \
    --set neo4jaddr=my-neo4j.default.svc.cluster.local:6362 \
    --set bucket=my-blob-container-name \
    --set database="neo4j\,system" \
    --set cloudProvider=azure \
    --set secretName=neo4j-azure-credentials \
    --set jobSchedule="0 */12 * * *"

Special notes for Azure Storage. The chart requires a "bucket" but for Azure storage, the naming is slightly different; the bucket specified is the "blob container name" where the files will be placed. Relative paths will be respected; if you set bucket to be container/path/to/directory, then you will find your backup files stored in container at the path /path/to/directory/db/db-latest.tar.gz where "db" is the name of the database being backed up (i.e. neo4j and system).

If all goes well, after a period of time when the Kubernetes Job is complete, you will simply see the backup files appear in the designated bucket, under directories named after the databases you backed up.

If your backup does not appear, consult the job’s pod container logs to find out why

If you want to get a hot backup before schedule, you can use this command:

kubectl create job --from=cronjob/my-neo4j-backup-job neo4j-hot-backup

Required parameters

neo4jaddr pointing to an address where your cluster is running, ideally the discovery address.
bucket where you want the backup copied to. It should be gs://bucketname or s3://bucketname.
databases a comma separated list of databases to back up. The default is neo4j,system. If your DBMS has many individual databases, you should change this.
cloudProvider Which cloud service do you want to keep backups on?(gcp or aws)
jobSchedule what intervals do you want to take backup? It should be cron like "0 /12 * * *". You can set your own schedule(https://crontab.guru/#0_/12_***)

At least one of secretName and serviceAccountName must be set. * secretName the name of the secret you created (set NULL if using Workload Identity on GKE) * serviceAccountName the name of the Kubernetes ServiceAccount to use for the backup Job (required if using Workload Identity on GKE)

Optional environment variables

All of the following variables mimic the command line options for neo4j-admin backup documented here

pageCache
heapSize
fallbackToFull (true/false), default=true
checkConsistency (true/false), default=true
checkIndexes (true/false) default=true
checkGraph (true/false), default=true
checkLabelScanStore (true/false), default=true
checkPropertyOwners (true/false), default=false

Exit Conditions

If the backup of any of the individual databases mentioned in the database parameters fails, the entire container will exit with a non-zero exit code and fail.

Note: it is possible for Neo4j backups to succeed, but with failed consistency checks. This will be noted in the logs, but will operationally behave as a successful backup.