Neo4j-admin import
This tutorial provides detailed examples to illustrate the capabilities of importing data from CSV files with the command neo4j-admin database import
.
The import command is used for loading large amounts of data from CSV files and supports full and incremental import into a running or stopped Neo4j DBMS.
The |
Relationships are created by connecting node IDs, each node should have a unique ID to be able to be referenced when creating relationships between nodes.
In the following examples, the node IDs are stored as properties on the nodes.
If you do not want the IDs to persist as properties after the import completes, then do not specify a property name in the :ID
field.
The examples show how to import data in a standalone Neo4j DBMS. They use:
-
The Neo4j tarball (Unix console application).
-
$NEO4J_HOME
as the current working directory. -
The default database
neo4j
. -
The import directory of the Neo4j installation to store all the CSV files. However, the CSV files can be located in any directory of your file system.
-
UNIX-styled paths.
-
The
neo4j-admin database import
command.
Handy tips:
|
Import a small data set
In this example, you will import a small data set containing nodes and relationships. The data set is split into three CSV files, where each file has a header row describing the data.
The data
The data set contains information about movies, actors, and roles. Data for movies and actors are stored as nodes and the roles are stored as relationships.
The files you want to import data from are:
-
movies.csv
-
actors.csv
-
roles.csv
Each movie in movies.csv
has an ID
, a title
, and a year
, stored as properties in the node.
All the nodes in movies.csv
also have the label Movie
.
A node can have several labels, as you can see in movies.csv
there are nodes that also have the label Sequel
.
The node labels are optional, they are very useful for grouping nodes into sets where all nodes that have a certain label belong to the same set.
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
The actors' data in actors.csv
consist of an ID
and a name
, stored as properties in the node.
The ID in this case is a shorthand for the actor’s name.
All the nodes in actors.csv
have the label Actor
.
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
The roles data in roles.csv
have only one property, role
.
Roles are represented by relationship data that connects actor nodes with movie nodes.
There are three mandatory fields for relationship data:
-
:START_ID
— ID referring to a node. -
:END_ID
— ID referring to a node. -
:TYPE
— The relationship type.
To create a relationship between two nodes, the IDs defined in actors.csv
and movies.csv
are used for the :START_ID
and :END_ID
fields.
You also need to provide a relationship type (in this case ACTED_IN
) for the :TYPE
field.
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
Importing the data
-
Paths to node data are defined with the
--nodes
option. -
Paths to relationship data is defined with the
--relationships
option.
The call to neo4j-admin database import
would look like this:
bin/neo4j-admin database import full neo4j --nodes=import/movies.csv --nodes=import/actors.csv --relationships=import/roles.csv
Query the data
To query the data. Start Neo4j.
The default username and password are |
bin/neo4j start
To query the imported data in the graph, try a simple Cypher query.
bin/cypher-shell --database=neo4j "MATCH (n) RETURN count(n) as nodes"
Stop Neo4j.
bin/neo4j stop
CSV file delimiters
You can customize the configuration options that the import tool uses (see Options) if your data does not fit the default format.
The details of a CSV file header format can be found at CSV header format.
The data
The following CSV files have:
-
--delimiter=";"
-
--array-delimiter="U+007C"
(U+007C
is the Unicode code point for the character|
) -
--quote="'"
movieId:ID;title;year:int;:LABEL
tt0133093;'The Matrix';1999;Movie
tt0234215;'The Matrix Reloaded';2003;Movie|Sequel
tt0242653;'The Matrix Revolutions';2003;Movie|Sequel
personId:ID;name;:LABEL
keanu;'Keanu Reeves';Actor
laurence;'Laurence Fishburne';Actor
carrieanne;'Carrie-Anne Moss';Actor
:START_ID;role;:END_ID;:TYPE
keanu;'Neo';tt0133093;ACTED_IN
keanu;'Neo';tt0234215;ACTED_IN
keanu;'Neo';tt0242653;ACTED_IN
laurence;'Morpheus';tt0133093;ACTED_IN
laurence;'Morpheus';tt0234215;ACTED_IN
laurence;'Morpheus';tt0242653;ACTED_IN
carrieanne;'Trinity';tt0133093;ACTED_IN
carrieanne;'Trinity';tt0234215;ACTED_IN
carrieanne;'Trinity';tt0242653;ACTED_IN
Importing the data
The call to neo4j-admin database import
would look like this:
bin/neo4j-admin database import full neo4j --delimiter=";" --array-delimiter="U+007C" --quote="'" --nodes=import/movies2.csv --nodes=import/actors2.csv --relationships=import/roles2.csv
Using separate header files
When dealing with very large CSV files, it is more convenient to have the header in a separate file. This makes it easier to edit the header as you avoid having to open a huge data file just to change it. The header file must be specified before the rest of the files in each file group.
The import tool can also process single-file compressed archives, for example:
-
--nodes=import/nodes.csv.gz
-
--relationships=import/relationships.zip
The data
You will use the same data set as in the previous example but with the headers in separate files.
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
Importing the data
The call to neo4j-admin database import
would look as follows:
The header line for a file group, whether it is the first line of a file in the group or a dedicated header file, must be the first line in the file group. |
bin/neo4j-admin database import full neo4j --nodes=import/movies3-header.csv,import/movies3.csv --nodes=import/actors3-header.csv,import/actors3.csv --relationships=import/roles3-header.csv,import/roles3.csv
Multiple input files
In addition to using a separate header file, you can also provide multiple node or relationship files.
Files within such an input group can be specified with multiple match strings, delimited by ,
, where each matched string can be either the exact file name or a regular expression matching one or more files.
Multiple matching files will be sorted according to their characters and their natural number sort order for file names containing numbers.
The data
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
Importing the data
The call to neo4j-admin database import
would look like this:
bin/neo4j-admin database import full neo4j --nodes=import/movies4-header.csv,import/movies4-part1.csv,import/movies4-part2.csv --nodes=import/actors4-header.csv,import/actors4-part1.csv,import/actors4-part2.csv --relationships=import/roles4-header.csv,import/roles4-part1.csv,import/roles4-part2.csv
Regular expressions
File names can be specified using regular expressions when there are many data source files. Each file name that matches the regular expression will be included.
If using separate header files, for the import to work correctly, the header file must be the first in the file group. When using regular expressions to specify the input files, the list of files will be sorted according to the names of the files that match the expression. The matching is aware of the numbers inside the file names and will sort them accordingly, without the need for padding with zeros.
For example, let’s assume that you have the following files:
-
movies4-header.csv
-
movies4-data1.csv
-
movies4-data2.csv
-
movies4-data12.csv
If you use the regular expression movies4.*
, the sorting will place the header file last and the import will fail.
A better alternative would be to name the header file explicitly and use a regular expression that only matches the names of the data files.
For example: --nodes "import/movies4-header.csv,movies-data.*"
will accomplish this.
Importing the data using regular expressions, the call to neo4j-admin database import
can be simplified to:
bin/neo4j-admin database import full neo4j --nodes="import/movies4-header.csv,import/movies4-part.*" --nodes="import/actors4-header.csv,import/actors4-part.*" --relationships="import/roles4-header.csv,import/roles4-part.*"
The use of regular expressions should not be confused with file globbing. The expression Another important difference to pay attention to is the sorting order.
The result of a regular expression matching will place the file |
Using the same label for every node
If you want to use the same node label(s) for every node in your nodes file you can do this by specifying the appropriate value as an option to neo4j-admin database import
.
There is then no need to specify the :LABEL
column in the header file and each row (node) will apply the specified labels from the command line option.
--nodes=LabelOne:LabelTwo=import/example-header.csv,import/example-data1.csv
It is possible to apply both the label provided in the file and the one provided on the command line to the node. |
The data
In this example, you want to have the label Movie
on every node specified in movies5a.csv
, and you put the labels Movie
and Sequel
on the nodes specified in sequels5a.csv
.
movieId:ID,title,year:int
tt0133093,"The Matrix",1999
movieId:ID,title,year:int
tt0234215,"The Matrix Reloaded",2003
tt0242653,"The Matrix Revolutions",2003
personId:ID,name
keanu,"Keanu Reeves"
laurence,"Laurence Fishburne"
carrieanne,"Carrie-Anne Moss"
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
Importing the data
The call to neo4j-admin database import
would look like this:
bin/neo4j-admin database import full neo4j --nodes=Movie=import/movies5a.csv --nodes=Movie:Sequel=import/sequels5a.csv --nodes=Actor=import/actors5a.csv --relationships=import/roles5a.csv
Using the same relationship type for every relationship
If you want to use the same relationship type for every relationship in your relationships file this can be done by specifying the appropriate value as an option to neo4j-admin database import
.
--relationships=TYPE=import/example-header.csv,import/example-data1.csv
If you provide a relationship type both on the command line and in the relationships file, the one in the file will be applied. |
The data
In this example, you want the relationship type ACTED_IN
to be applied on every relationship specified in roles5b.csv
.
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
:START_ID,role,:END_ID
keanu,"Neo",tt0133093
keanu,"Neo",tt0234215
keanu,"Neo",tt0242653
laurence,"Morpheus",tt0133093
laurence,"Morpheus",tt0234215
laurence,"Morpheus",tt0242653
carrieanne,"Trinity",tt0133093
carrieanne,"Trinity",tt0234215
carrieanne,"Trinity",tt0242653
Importing the data
The call to neo4j-admin database import
would look like this:
bin/neo4j-admin database import full neo4j --nodes=import/movies5b.csv --nodes=import/actors5b.csv --relationships=ACTED_IN=import/roles5b.csv
Properties
Nodes and relationships can have properties. The property type is specified in the CSV header row, see CSV header format.
The data
The following example creates a small graph containing one actor and one movie connected by one relationship.
There is a roles
property on the relationship which contains an array of the characters played by the actor in a movie:
movieId:ID,title,year:int,:LABEL
tt0099892,"Joe Versus the Volcano",1990,Movie
personId:ID,name,:LABEL
meg,"Meg Ryan",Actor
:START_ID,roles:string[],:END_ID,:TYPE
meg,"DeDe;Angelica Graynamore;Patricia Graynamore",tt0099892,ACTED_IN
Importing the data
The call to neo4j-admin database import
would look like this:
bin/neo4j-admin database import full neo4j --nodes=import/movies6.csv --nodes=import/actors6.csv --relationships=import/roles6.csv
ID space
The import tool assumes that identifiers are unique across node files. This may not be the case for data sets that use sequential, auto-incremented, or otherwise colliding identifiers. Those data sets can define ID spaces where identifiers are unique within their respective ID space.
In cases where the node ID is unique only within files, using ID spaces is a way to ensure uniqueness across all node files. See Using ID spaces.
Each node processed by neo4j-admin database import
must provide an ID if it is to be connected in any relationships.
The node ID is used to find the start node and end node when creating a relationship.
From Neo4j 5.3, a node header can also contain multiple |
To define an ID space Movie-ID
for movieId:ID
, use the syntax movieId:ID(Movie-ID)
.
The data
For example, if movies and people both use sequential identifiers, then you would define Movie
and Actor
ID spaces.
movieId:ID(Movie-ID),title,year:int,:LABEL
1,"The Matrix",1999,Movie
2,"The Matrix Reloaded",2003,Movie;Sequel
3,"The Matrix Revolutions",2003,Movie;Sequel
personId:ID(Actor-ID),name,:LABEL
1,"Keanu Reeves",Actor
2,"Laurence Fishburne",Actor
3,"Carrie-Anne Moss",Actor
You also need to reference the appropriate ID space in your relationships file so it knows which nodes to connect.
:START_ID(Actor-ID),role,:END_ID(Movie-ID)
1,"Neo",1
1,"Neo",2
1,"Neo",3
2,"Morpheus",1
2,"Morpheus",2
2,"Morpheus",3
3,"Trinity",1
3,"Trinity",2
3,"Trinity",3
Importing the data
The call to neo4j-admin database import
would look like this:
bin/neo4j-admin database import full neo4j --nodes=import/movies7.csv --nodes=import/actors7.csv --relationships=ACTED_IN=import/roles7.csv
Skip relationships referring to missing nodes
The import tool has no tolerance for bad entities (relationships or nodes) and will fail the import on the first bad entity. You can specify explicitly that you want it to ignore rows that contain bad entities.
There are two different types of bad input:
-
Bad relationships.
-
Bad nodes.
Relationships that refer to missing node IDs, either for :START_ID
or :END_ID
are considered bad relationships.
Whether or not such relationships are skipped is controlled with the --skip-bad-relationships
flag, which can have the values true
or false
or no value, which means true
.
The default is false
, which means that any bad relationship is considered an error and will fail the import.
For more information, see the --skip-bad-relationships
option.
The data
In the following example, there is a missing emil
node referenced in the roles file.
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
emil,"Emil",tt0133093,ACTED_IN
Importing the data
The call to neo4j-admin database import
would look like this:
bin/neo4j-admin database import full neo4j --nodes=import/movies8a.csv --nodes=import/actors8a.csv --relationships=import/roles8a.csv
Because there is a bad relationship in the input data, the import process will fail.
Let’s see what happens if you append the --skip-bad-relationships
flag:
bin/neo4j-admin database import full neo4j --skip-bad-relationships --nodes=import/movies8a.csv --nodes=import/actors8a.csv --relationships=import/roles8a.csv
The data files are successfully imported and the bad relationship is ignored.
An entry is written to the import.report
file.
InputRelationship:
source: roles8a.csv:11
properties: [role, Emil]
startNode: emil (global id space)
endNode: tt0133093 (global id space)
type: ACTED_IN
referring to missing node emil
Skip nodes with the same ID
Nodes that specify :ID
, which has already been specified within the ID space are considered bad nodes.
Whether or not such nodes are skipped is controlled with --skip-duplicate-nodes
flag which can have the values true
or false
or no value, which means true
.
The default is false
, which means that any duplicate node is considered an error and will fail the import.
For more information, see the --skip-duplicate-nodes
option.
The data
In the following example there is a node ID, laurence
, that is specified twice within the same ID space.
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
laurence,"Laurence Harvey",Actor
Importing the data
The call to neo4j-admin database import
would look like this:
bin/neo4j-admin database import full neo4j --database=neo4j --nodes=import/actors8b.csv
Because there is a bad node in the input data, the import process will fail.
Let’s see what happens if you append the --skip-duplicate-nodes
flag:
bin/neo4j-admin database import full neo4j --skip-duplicate-nodes --nodes=import/actors8b.csv
The data files are successfully imported and the bad node is ignored.
An entry is written to the import.report
file.
ID 'laurence' is defined more than once in global ID space, at least at actors8b.csv:3 and actors8b.csv:5