Backing up Neo4j Containers
This approach supports Google Cloud, AWS, and Azure storage, and assumes you have credentials and wish to store your backups on those cloud storage systems. If this is not the case, you will need to adjust the backup script for your desired cloud storage method, but the approach will work for any backup location. |
This approach works only for Neo4j 4.0+. The backup tool and the DBMS itself changed quite a lot between 3.5 and 4.0, and the approach here will likely not work for older databases without substantial modification. |
If you are upgrading to the helm chart 4.1.3-1 or later from an earlier version, double check this documentation; the syntax of using the backup chart has changed a bit to accomodate multiple clouds. This documentation applies only to 4.1.3-1 and forward. |
Background & Important Information
Required Neo4j Config
This is provided for you out of the box by the helm chart, but if you customize you should bear these requirements in mind:
-
dbms.backup.enabled=true
-
dbms.backup.listen_address=0.0.0.0:6362
The default for Neo4j is to listen only on 127.0.0.1, which will not work as other containers would not be able to access the backup port.
Backup Storage
Backups are stored in a temporary local volume before they are uploaded to cloud. By default an ephemeral Kubernetes emptyDir
Volume is used.
If the database being backed up is large there might not be sufficient space in local storage. To use alternative storage set tempVolume
in values.yaml to a different Kubernetes Volume object.
Backup Pointers
All backups will turn into .tar.gz files with date strings when they were taken, such as: neo4j-2020-06-16-12:32:57.tar.gz
. They are named after the database
they are a backup of.
When you take a backup, you will get both the dated version, and a "latest" copy,
e.g. the above file will also be copied to neo4j/neo4j-latest.tar.gz
in the same bucket.
Reminder: Each time you take a backup, the latest file will be overwritten. |
The purpose of doing this is to have a stable name in storage where the latest backup can always be found, without losing any of the previous backups.
Steps to Take a Backup
Use a Service Account to access cloud storage (Google Cloud only)
GCP
Workload Identity is the recommended way to access Google Cloud services from applications running within GKE due to its improved security properties and manageability.
Follow the GCP instructions to:
-
Enable Workload Identity on your GKE cluster
-
Create a Google Cloud IAMServiceAccount that has read and write permissions for your backup location
-
Bind the IAMServiceAccount to the Neo4j deployment’s Kubernetes ServiceAccount*
[*] you can configure the name of the Kubernetes ServiceAccount that a Neo4j deployment uses by setting serviceAccountName
in values.yaml. To check the name of the Kubernetes ServiceAccount that a Neo4j deployment is using run kubectl get pods -o=jsonpath='{.spec.serviceAccountName}{"\n"}' <your neo4j pod name>
If you are unable to use Workload Identity with GKE then you can create a service key secret instead as described in the next section.
Create a service key secret to access cloud storage
First you want to create a kubernetes secret that contains the content of your account service key. This key must have permissions to access the bucket and backup set that you’re trying to restore.
Azure
-
You must create the credential file and this file should look like this:
export ACCOUNT_NAME=<NAME_STORAGE_ACCOUNT>
export ACCOUNT_KEY=<STORAGE_ACCOUNT_KEY>
If you’re unsure what the account key secret should be, you can recover it with the following command:
ACCOUNT_KEY=$(az storage account keys list --resource-group "$AKS_RESOURCE_GROUP" --account-name "$STORAGE_ACCOUNT" --query [0].value -o tsv)
-
You have to create a secret for this file
kubectl create secret generic neo4j-azure-credentials \
--from-file=credentials=azure-credentials.sh
AWS
-
You must create the credential file and this file should look like this:
[default]
region=
aws_access_key_id=
aws_secret_access_key=
-
You have to create a secret for this file
kubectl create secret generic neo4j-aws-credentials \
--from-file=credentials=aws-credentials
GCP
You do NOT need to follow the steps in this section if you are using Workload Identity for GCP.
Download a JSON formatted service key from Google Cloud that looks like this:
{
"type": "",
"project_id": "",
"private_key_id": "",
"private_key": "",
"client_email": "",
"client_id": "",
"auth_uri": "",
"token_uri": "",
"auth_provider_x509_cert_url": "",
"client_x509_cert_url": ""
}
-
You have to create a secret for this file
kubectl create secret generic neo4j-gcp-credentials \
--from-file=credentials=gcp-credentials.json
'--from-file=credentials=<your-config-path>' here is important; the credentials under the secret must be named credentials
|
Running a Backup
The backup method is itself a mini-helm chart, and so to run a backup, you just do this as a minimal required example:
This command must be run in 'https://github.com/neo4j-contrib/neo4j-helm/tree/master/tools/backup' |
AWS
helm install my-neo4j-backup . \
--set neo4jaddr=my-neo4j.default.svc.cluster.local:6362 \
--set bucket=s3://my-bucket \
--set database="neo4j\,system" \
--set cloudProvider=aws \
--set secretName=neo4j-aws-credentials \
--set jobSchedule="0 */12 * * *"
GCP with Workload Identity
helm install my-neo4j-backup . \
--set neo4jaddr=my-neo4j.default.svc.cluster.local:6362 \
--set bucket=gs://my-bucket \
--set database="neo4j\,system" \
--set cloudProvider=gcp \
--set secretName=NULL \
--set serviceAccountName=my-neo4j-backup-sa \
--set jobSchedule="0 */12 * * *"
GCP with service key secret
helm install my-neo4j-backup . \
--set neo4jaddr=my-neo4j.default.svc.cluster.local:6362 \
--set bucket=gs://my-bucket \
--set database="neo4j\,system" \
--set cloudProvider=gcp \
--set secretName=neo4j-gcp-credentials \
--set jobSchedule="0 */12 * * *"
Azure
helm install my-neo4j-backup . \
--set neo4jaddr=my-neo4j.default.svc.cluster.local:6362 \
--set bucket=my-blob-container-name \
--set database="neo4j\,system" \
--set cloudProvider=azure \
--set secretName=neo4j-azure-credentials \
--set jobSchedule="0 */12 * * *"
Special notes for Azure Storage. The chart requires a "bucket" but for Azure storage,
the naming is slightly different; the bucket specified is the "blob container name"
where the files will be placed. Relative paths will be respected; if you set bucket
to be container/path/to/directory , then you will find your backup files stored in
container at the path /path/to/directory/db/db-latest.tar.gz where "db" is the
name of the database being backed up (i.e. neo4j and system).
|
If all goes well, after a period of time when the Kubernetes Job is complete, you will simply see the backup files appear in the designated bucket, under directories named after the databases you backed up.
If your backup does not appear, consult the job’s pod container logs to find out why |
If you want to get a hot backup before schedule, you can use this command:
kubectl create job --from=cronjob/my-neo4j-backup-job neo4j-hot-backup
Required parameters
-
neo4jaddr
pointing to an address where your cluster is running, ideally the discovery address. -
bucket
where you want the backup copied to. It should begs://bucketname
ors3://bucketname
. -
databases
a comma separated list of databases to back up. The default isneo4j,system
. If your DBMS has many individual databases, you should change this. -
cloudProvider
Which cloud service do you want to keep backups on?(gcp or aws) -
jobSchedule
what intervals do you want to take backup? It should be cron like "0 /12 * * *". You can set your own schedule(https://crontab.guru/#0_/12_***)
At least one of secretName
and serviceAccountName
must be set.
* secretName
the name of the secret you created (set NULL if using Workload Identity on GKE)
* serviceAccountName
the name of the Kubernetes ServiceAccount to use for the backup Job (required if using Workload Identity on GKE)
Optional environment variables
All of the following variables mimic the command line options for neo4j-admin backup documented here
-
pageCache
-
heapSize
-
fallbackToFull
(true/false), default=true -
checkConsistency
(true/false), default=true -
checkIndexes
(true/false) default=true -
checkGraph
(true/false), default=true -
checkLabelScanStore
(true/false), default=true -
checkPropertyOwners
(true/false), default=false
Exit Conditions
If the backup of any of the individual databases mentioned in the database parameters fails, the entire container will exit with a non-zero exit code and fail.
Note: it is possible for Neo4j backups to succeed, but with failed consistency checks. This will be noted in the logs, but will operationally behave as a successful backup.