Disaster recovery
A database can become unavailable due to issues on different system levels. For example, a data center failover may lead to the loss of multiple servers, which may cause a set of databases to become unavailable. It is also possible for databases to become quarantined due to a critical failure in the system, which may lead to unavailability even without the loss of servers.
This section contains a step-by-step guide on how to recover unavailable databases that are incapable of serving writes, while still may be able to serve reads. However, if a database is not performing as expected for other reasons, this section cannot help. By following the steps outlined here, you can recover the unavailable databases and make them fully operational with minimal impact on the other databases in the cluster.
If all servers in a Neo4j cluster are lost in a data center failover, it is not possible to recover the current cluster. You have to create a new cluster and restore the databases. See Deploy a basic cluster and Seed a database for more information. |
Faults in clusters
Databases in clusters follow an allocation strategy. This means that they are allocated differently within the cluster and may also have different numbers of primaries and secondaries. The consequence of this is that all servers are different in which databases they are hosting. Losing a server in a cluster may cause some databases to lose a member while others are unaffected. Consequently, in a disaster where multiple servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources.
Guide to disaster recovery
There are three main steps to recovering a cluster from a disaster. Completing each step, regardless of the disaster scenario, is recommended to ensure the cluster is fully operational.
-
Ensure the
system
database is available in the cluster. Thesystem
database defines the configuration for the other databases; therefore, it is vital to ensure it is available before doing anything else. -
After the
system
database’s availability is verified, whether recovered or unaffected by the disaster, recover the lost servers to ensure the cluster’s topology meets the requirements. -
After the
system
database is available and the cluster’s topology is satisfied, you can manage the databases.
The steps are described in detail in the following sections.
In this section, an offline server is a server that is not running but may be restartable. A lost server, however, is a server that is currently not running and cannot be restarted. |
Disasters may sometimes affect the routing capabilities of the driver and may prevent the use of the |
Restore the system
database
The first step of recovery is to ensure that the system
database is available.
The system
database is required for clusters to function properly.
-
Start all servers that are offline. (If a server is unable to start, inspect the logs and contact support personnel. The server may have to be considered indefinitely lost.)
-
Validate the
system
database’s availability.-
Run
SHOW DATABASE system
. If the response does not contain a writer, thesystem
database is unavailable and needs to be recovered, continue to step 3. -
Optionally, you can create a temporary user to validate the
system
database’s writability by runningCREATE USER 'temporaryUser' SET PASSWORD 'temporaryPassword'
. -
Confirm that the temporary user is created as expected, by running
SHOW USERS
, then continue to Recover servers. If not, continue to step 3.
-
-
Restore the
system
database.Only do the steps below if the
system
database’s availability cannot be validated by the first two steps in this section.Recall that the cluster remains fault tolerant, and thus able to serve both reads and writes, as long as a majority of the primaries are available. In case of a disaster affecting one or more server(s), but where the majority of servers are still available, it is possible to add a new server to the cluster and recover the
system
database (and any other affected user databases) on it by copying thesystem
database (and affected user databases) from one of the available servers. This method prevents downtime for the other databases in the cluster. If this is the case, ie. if a majority of servers are still available, follow the instructions in Recover servers.The following steps create a new
system
database from a backup of the currentsystem
database. This is required since the currentsystem
database has lost too many members in the server failover.-
Shut down the Neo4j process on all servers. Note that this causes downtime for all databases in the cluster.
-
On each server, run the following
neo4j-admin
commandbin/neo4j-admin dbms unbind-system-db
to reset thesystem
database state on the servers. Seeneo4j-admin
commands for more information. -
On each server, run the following
neo4j-admin
commandbin/neo4j-admin database info system
to find out which server is most up-to-date, ie. has the highest last-committed transaction id. -
On the most up-to-date server, take a dump of the current
system
database by runningbin/neo4j-admin database dump system --to-path=[path-to-dump]
and store the dump in an accessible location. Seeneo4j-admin
commands for more information. -
Ensure there are enough
system
database primaries to create the newsystem
database with fault tolerance. Either:-
Add completely new servers (see Add a server to the cluster) or
-
Change the
system
database mode (server.cluster.system_database_mode
) on the currentsystem
database’s secondary servers to allow them to be primaries for the newsystem
database.
-
-
On each server, run
bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true
to load the currentsystem
database dump. -
Ensure that
dbms.cluster.discovery.endpoints
are set correctly on all servers, see Cluster server discovery for more information. -
Return to step 1.
-
Recover servers
Once the system
database is available, the cluster can be managed.
Following the loss of one or more servers, the cluster’s view of servers must be updated, ie. the lost servers must be replaced by new servers.
The steps here identify the lost servers and safely detach them from the cluster.
-
Run
SHOW SERVERS
. If all servers show healthAVAILABLE
and statusENABLED
continue to Recover databases. -
For each
UNAVAILABLE
server, runCALL dbms.cluster.cordonServer("unavailable-server-id")
on one of the available servers. -
For each
CORDONED
server, runDEALLOCATE DATABASES FROM SERVER cordoned-server-id
on one of the available servers. -
For each server that failed to deallocate with one of the following messages:
-
Could not deallocate server(s) 'serverId'. Unable to reallocate 'DatabaseId.*'.
Required topology for 'DatabaseId.*' is 3 primaries and 0 secondaries.
Consider running SHOW SERVERS to determine what action is suitable to resolve this issue.or
Could not deallocate server(s) `serverId
. Database [database] has lost quorum of servers, only found [existing number of primaries] of [expected number of primaries]. Cannot be safely reallocated.`First ensure that there is a backup for the database in question (see Online backup), and then drop the database by running
DROP DATABASE database-name
. Return to step 3. -
Could not deallocate server [server]. Cannot change allocations for database [stopped-db] because it is offline.
Try to start the offline database by running
START DATABASE stopped-db WAIT
. If it starts successfully, return to step 3. Otherwise, ensure that there is a backup for the database before dropping it withDROP DATABASE stopped-db
. Return to step 3.A database can be set to
READ-ONLY
-mode before it is started to avoid updates on a database that is desired to be stopped with the following:ALTER DATABASE database-name SET ACCESS READ ONLY
. -
Could not deallocate server [server]. Reallocation of [database] not possible, no new target found. All existing servers: [existing-servers]. Actual allocated server with mode [mode] is [current-hostings].
Add new servers and enable them and then return to step 3, see Add a server to the cluster for more information.
-
-
Run
SHOW SERVERS YIELD *
once all enabled servers host the requested databases (hosting
-field contains exactly the databases in therequestedHosting
field), and proceed to the next step. Note that this may take a few minutes. -
For each deallocated server, run
DROP SERVER deallocated-server-id
. -
Return to step 1.
Recover databases
Once the system
database is verified available, and all servers are online, the databases can be managed.
The steps here aim to make the unavailable databases available.
-
If you have previously dropped databases as part of this guide, re-create each one from a backup. See the Create databases section for more information on how to create a database.
-
Run
SHOW DATABASES
. If all databases are in desired states on all servers (requestedStatus
=currentStatus
), disaster recovery is complete.