Neo4j-admin import

This tutorial provides detailed examples to illustrate the capabilities of importing data from CSV files with the command neo4j-admin database import. The import command is used for loading large amounts of data from CSV files and supports full and incremental import into a running or stopped Neo4j DBMS.

The neo4j-admin database import command does not create a database, the command only imports data and makes it available for the database. The database must not exist before the neo4j-admin database import command has been executed, and the database should be created afterward. The command will exit with an error message if the database already exists.

Relationships are created by connecting node IDs, each node should have a unique ID to be able to be referenced when creating relationships between nodes. In the following examples, the node IDs are stored as properties on the nodes. If you do not want the IDs to persist as properties after the import completes, then do not specify a property name in the :ID field.

The examples show how to import data in a standalone Neo4j DBMS. They use:

The Neo4j tarball (Unix console application).
$NEO4J_HOME as the current working directory.
The default database neo4j.
The import directory of the Neo4j installation to store all the CSV files. However, the CSV files can be located in any directory of your file system.
UNIX-styled paths.
The Neo4j Admin and Neo4j CLI[neo4j-admin database import] command.

Handy tips:

The details of a CSV file header format can be found at CSV header format.
To show available databases, use the Cypher query SHOW DATABASES against the system database.
To remove a database, use the Cypher query DROP DATABASE database_name against the system database.
To create a database, use the Cypher query CREATE DATABASE database_name against the system database.

Import a small data set

In this example, you will import a small data set containing nodes and relationships. The data set is split into three CSV files, where each file has a header row describing the data.

The data

The data set contains information about movies, actors, and roles. Data for movies and actors are stored as nodes and the roles are stored as relationships.

The files you want to import data from are:

movies.csv
actors.csv
roles.csv

Each movie in movies.csv has an ID, a title, and a year, stored as properties in the node. All the nodes in movies.csv also have the label Movie. A node can have several labels, as you can see in movies.csv there are nodes that also have the label Sequel. The node labels are optional, they are very useful for grouping nodes into sets where all nodes that have a certain label belong to the same set.

movies.csv

movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

The actors' data in actors.csv consist of an ID and a name, stored as properties in the node. The ID in this case is a shorthand for the actor’s name. All the nodes in actors.csv have the label Actor.

actors.csv

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

The roles data in roles.csv have only one property, role. Roles are represented by relationship data that connects actor nodes with movie nodes.

There are three mandatory fields for relationship data:

:START_ID — ID referring to a node.
:END_ID — ID referring to a node.
:TYPE — The relationship type.

To create a relationship between two nodes, the IDs defined in actors.csv and movies.csv are used for the :START_ID and :END_ID fields. You also need to provide a relationship type (in this case ACTED_IN) for the :TYPE field.

roles.csv

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

Importing the data

Paths to node data are defined with the --nodes option.
Paths to relationship data is defined with the --relationships option.

The call to neo4j-admin database import would look like this:

shell

bin/neo4j-admin database import full neo4j --nodes=import/movies.csv --nodes=import/actors.csv --relationships=import/roles.csv

Query the data

To query the data. Start Neo4j.

The default username and password are neo4j and neo4j.

shell

bin/neo4j start

To query the imported data in the graph, try a simple Cypher query.

shell

bin/cypher-shell --database=neo4j "MATCH (n) RETURN count(n) as nodes"

Stop Neo4j.

shell

bin/neo4j stop

CSV file delimiters

You can customize the configuration options that the import tool uses (see Options) if your data does not fit the default format.

The details of a CSV file header format can be found at CSV header format.

The data

The following CSV files have:

--delimiter=";"
--array-delimiter="U+007C" (U+007C is the Unicode code point for the character |)
--quote="'"

movies2.csv

movieId:ID;title;year:int;:LABEL
tt0133093;'The Matrix';1999;Movie
tt0234215;'The Matrix Reloaded';2003;Movie|Sequel
tt0242653;'The Matrix Revolutions';2003;Movie|Sequel

actors2.csv

personId:ID;name;:LABEL
keanu;'Keanu Reeves';Actor
laurence;'Laurence Fishburne';Actor
carrieanne;'Carrie-Anne Moss';Actor

roles2.csv

:START_ID;role;:END_ID;:TYPE
keanu;'Neo';tt0133093;ACTED_IN
keanu;'Neo';tt0234215;ACTED_IN
keanu;'Neo';tt0242653;ACTED_IN
laurence;'Morpheus';tt0133093;ACTED_IN
laurence;'Morpheus';tt0234215;ACTED_IN
laurence;'Morpheus';tt0242653;ACTED_IN
carrieanne;'Trinity';tt0133093;ACTED_IN
carrieanne;'Trinity';tt0234215;ACTED_IN
carrieanne;'Trinity';tt0242653;ACTED_IN

Importing the data

The call to neo4j-admin database import would look like this:

shell

bin/neo4j-admin database import full neo4j --delimiter=";" --array-delimiter="U+007C" --quote="'" --nodes=import/movies2.csv --nodes=import/actors2.csv --relationships=import/roles2.csv

Using separate header files

When dealing with very large CSV files, it is more convenient to have the header in a separate file. This makes it easier to edit the header as you avoid having to open a huge data file just to change it. The header file must be specified before the rest of the files in each file group.

The import tool can also process single-file compressed archives, for example:

--nodes=import/nodes.csv.gz
--relationships=import/relationships.zip

The data

You will use the same data set as in the previous example but with the headers in separate files.

movies3-header.csv

movieId:ID,title,year:int,:LABEL

movies3.csv

tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

actors3-header.csv

personId:ID,name,:LABEL

actors3.csv

keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

roles3-header.csv

:START_ID,role,:END_ID,:TYPE

roles3.csv

keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

Importing the data

The call to neo4j-admin database import would look as follows:

The header line for a file group, whether it is the first line of a file in the group or a dedicated header file, must be the first line in the file group.

shell

bin/neo4j-admin database import full neo4j --nodes=import/movies3-header.csv,import/movies3.csv --nodes=import/actors3-header.csv,import/actors3.csv --relationships=import/roles3-header.csv,import/roles3.csv

Multiple input files

In addition to using a separate header file, you can also provide multiple node or relationship files. Files within such an input group can be specified with multiple match strings, delimited by ,, where each matched string can be either the exact file name or a regular expression matching one or more files. Multiple matching files will be sorted according to their characters and their natural number sort order for file names containing numbers.

The data

movies4-header.csv

movieId:ID,title,year:int,:LABEL

movies4-part1.csv

tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel

movies4-part2.csv

tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

actors4-header.csv

personId:ID,name,:LABEL

actors4-part1.csv

keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor

actors4-part2.csv

carrieanne,"Carrie-Anne Moss",Actor

roles4-header.csv

:START_ID,role,:END_ID,:TYPE

roles4-part1.csv

keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN

roles4-part2.csv

laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

Importing the data

The call to neo4j-admin database import would look like this:

shell

bin/neo4j-admin database import full neo4j --nodes=import/movies4-header.csv,import/movies4-part1.csv,import/movies4-part2.csv --nodes=import/actors4-header.csv,import/actors4-part1.csv,import/actors4-part2.csv --relationships=import/roles4-header.csv,import/roles4-part1.csv,import/roles4-part2.csv

Regular expressions

File names can be specified using regular expressions when there are many data source files. Each file name that matches the regular expression will be included.

If using separate header files, for the import to work correctly, the header file must be the first in the file group. When using regular expressions to specify the input files, the list of files will be sorted according to the names of the files that match the expression. The matching is aware of the numbers inside the file names and will sort them accordingly, without the need for padding with zeros.

Example 1. Match order

For example, let’s assume that you have the following files:

movies4-header.csv
movies4-data1.csv
movies4-data2.csv
movies4-data12.csv

If you use the regular expression movies4.*, the sorting will place the header file last and the import will fail. A better alternative would be to name the header file explicitly and use a regular expression that only matches the names of the data files. For example: --nodes "import/movies4-header.csv,movies-data.*" will accomplish this.

Importing the data using regular expressions, the call to neo4j-admin database import can be simplified to:

shell

bin/neo4j-admin database import full neo4j --nodes="import/movies4-header.csv,import/movies4-part.*" --nodes="import/actors4-header.csv,import/actors4-part.*" --relationships="import/roles4-header.csv,import/roles4-part.*"

The use of regular expressions should not be confused with file globbing.

The expression .* means: "zero or more occurrences of any character except line break". Therefore, the regular expression movies4.* will list all files starting with movies4. Conversely, with file globbing, ls movies4.* will list all files starting with movies4..

Another important difference to pay attention to is the sorting order. The result of a regular expression matching will place the file movies4-part2.csv before the file movies4-part12.csv. If doing ls movies4-part* in a directory containing the above-listed files, the file movies4-part12.csv will be listed before the file movies4-part2.csv.

Using the same label for every node

If you want to use the same node label(s) for every node in your nodes file you can do this by specifying the appropriate value as an option to neo4j-admin database import. There is then no need to specify the :LABEL column in the header file and each row (node) will apply the specified labels from the command line option.

Example 2. Specify node labels option

--nodes=LabelOne:LabelTwo=import/example-header.csv,import/example-data1.csv

It is possible to apply both the label provided in the file and the one provided on the command line to the node.

The data

In this example, you want to have the label Movie on every node specified in movies5a.csv, and you put the labels Movie and Sequel on the nodes specified in sequels5a.csv.

movies5a.csv

movieId:ID,title,year:int
tt0133093,"The Matrix",1999

sequels5a.csv

movieId:ID,title,year:int
tt0234215,"The Matrix Reloaded",2003
tt0242653,"The Matrix Revolutions",2003

actors5a.csv

personId:ID,name
keanu,"Keanu Reeves"
laurence,"Laurence Fishburne"
carrieanne,"Carrie-Anne Moss"

roles5a.csv

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

Importing the data

The call to neo4j-admin database import would look like this:

shell

bin/neo4j-admin database import full neo4j --nodes=Movie=import/movies5a.csv --nodes=Movie:Sequel=import/sequels5a.csv --nodes=Actor=import/actors5a.csv --relationships=import/roles5a.csv

Using the same relationship type for every relationship

If you want to use the same relationship type for every relationship in your relationships file this can be done by specifying the appropriate value as an option to neo4j-admin database import.

Example 3. Specify relationship type option

--relationships=TYPE=import/example-header.csv,import/example-data1.csv

If you provide a relationship type both on the command line and in the relationships file, the one in the file will be applied.

The data

In this example, you want the relationship type ACTED_IN to be applied on every relationship specified in roles5b.csv.

movies5b.csv

movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

actors5b.csv

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

roles5b.csv

:START_ID,role,:END_ID
keanu,"Neo",tt0133093
keanu,"Neo",tt0234215
keanu,"Neo",tt0242653
laurence,"Morpheus",tt0133093
laurence,"Morpheus",tt0234215
laurence,"Morpheus",tt0242653
carrieanne,"Trinity",tt0133093
carrieanne,"Trinity",tt0234215
carrieanne,"Trinity",tt0242653

Importing the data

The call to neo4j-admin database import would look like this:

shell

bin/neo4j-admin database import full neo4j --nodes=import/movies5b.csv --nodes=import/actors5b.csv --relationships=ACTED_IN=import/roles5b.csv

Properties

Nodes and relationships can have properties. The property type is specified in the CSV header row, see CSV header format.

The data

The following example creates a small graph containing one actor and one movie connected by one relationship.

There is a roles property on the relationship which contains an array of the characters played by the actor in a movie:

movies6.csv

movieId:ID,title,year:int,:LABEL
tt0099892,"Joe Versus the Volcano",1990,Movie

actors6.csv

personId:ID,name,:LABEL
meg,"Meg Ryan",Actor

roles6.csv

:START_ID,roles:string[],:END_ID,:TYPE
meg,"DeDe;Angelica Graynamore;Patricia Graynamore",tt0099892,ACTED_IN

Importing the data

The call to neo4j-admin database import would look like this:

shell

bin/neo4j-admin database import full neo4j --nodes=import/movies6.csv --nodes=import/actors6.csv --relationships=import/roles6.csv

ID space

The import tool assumes that identifiers are unique across node files. This may not be the case for data sets that use sequential, auto-incremented, or otherwise colliding identifiers. Those data sets can define ID spaces where identifiers are unique within their respective ID space.

In cases where the node ID is unique only within files, using ID spaces is a way to ensure uniqueness across all node files. See Using ID spaces.

Each node processed by neo4j-admin database import must provide an ID if it is to be connected in any relationships. The node ID is used to find the start node and end node when creating a relationship.

A node header can also contain multiple ID columns, where the relationship data references the composite value of all those columns. This also implies using string as id-type. For each ID column, you can specify to store its values as different node properties. However, the composite value cannot be stored as a node property.

Example 4. ID space

To define an ID space Movie-ID for movieId:ID, use the syntax movieId:ID(Movie-ID).

The data

For example, if movies and people both use sequential identifiers, then you would define Movie and Actor ID spaces.

movies7.csv

movieId:ID(Movie-ID),title,year:int,:LABEL
1,"The Matrix",1999,Movie
2,"The Matrix Reloaded",2003,Movie;Sequel
3,"The Matrix Revolutions",2003,Movie;Sequel

actors7.csv

personId:ID(Actor-ID),name,:LABEL
1,"Keanu Reeves",Actor
2,"Laurence Fishburne",Actor
3,"Carrie-Anne Moss",Actor

You also need to reference the appropriate ID space in your relationships file so it knows which nodes to connect.

roles7.csv

:START_ID(Actor-ID),role,:END_ID(Movie-ID)
1,"Neo",1
1,"Neo",2
1,"Neo",3
2,"Morpheus",1
2,"Morpheus",2
2,"Morpheus",3
3,"Trinity",1
3,"Trinity",2
3,"Trinity",3

Importing the data

The call to neo4j-admin database import would look like this:

shell

bin/neo4j-admin database import full neo4j --nodes=import/movies7.csv --nodes=import/actors7.csv --relationships=ACTED_IN=import/roles7.csv

Skip relationships referring to missing nodes

The import tool has no tolerance for bad entities (relationships or nodes) and will fail the import on the first bad entity. You can specify explicitly that you want it to ignore rows that contain bad entities.

There are two different types of bad input:

Bad relationships.
Bad nodes.

Relationships that refer to missing node IDs, either for :START_ID or :END_ID are considered bad relationships. Whether or not such relationships are skipped is controlled with the --skip-bad-relationships flag, which can have the values true or false or no value, which means true. The default is false, which means that any bad relationship is considered an error and will fail the import. For more information, see the --skip-bad-relationships option.

The data

In the following example, there is a missing emil node referenced in the roles file.

movies8a.csv

movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

actors8a.csv

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

roles8a.csv

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
emil,"Emil",tt0133093,ACTED_IN

Importing the data

The call to neo4j-admin database import would look like this:

shell

bin/neo4j-admin database import full neo4j --nodes=import/movies8a.csv --nodes=import/actors8a.csv --relationships=import/roles8a.csv

Because there is a bad relationship in the input data, the import process will fail.

Let’s see what happens if you append the --skip-bad-relationships flag:

shell

bin/neo4j-admin database import full neo4j --skip-bad-relationships --nodes=import/movies8a.csv --nodes=import/actors8a.csv --relationships=import/roles8a.csv

The data files are successfully imported and the bad relationship is ignored. An entry is written to the import.report file.

ignore bad relationships

InputRelationship:
   source: roles8a.csv:11
   properties: [role, Emil]
   startNode: emil (global id space)
   endNode: tt0133093 (global id space)
   type: ACTED_IN
 referring to missing node emil

Skip nodes with the same ID

Nodes that specify :ID, which has already been specified within the ID space are considered bad nodes. Whether or not such nodes are skipped is controlled with --skip-duplicate-nodes flag which can have the values true or false or no value, which means true. The default is false, which means that any duplicate node is considered an error and will fail the import. For more information, see the --skip-duplicate-nodes option.

The data

In the following example there is a node ID, laurence, that is specified twice within the same ID space.

actors8b.csv

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
laurence,"Laurence Harvey",Actor

Importing the data

The call to neo4j-admin database import would look like this:

shell

bin/neo4j-admin database import full neo4j --database=neo4j --nodes=import/actors8b.csv

Because there is a bad node in the input data, the import process will fail.

Let’s see what happens if you append the --skip-duplicate-nodes flag:

shell

bin/neo4j-admin database import full neo4j --skip-duplicate-nodes --nodes=import/actors8b.csv

The data files are successfully imported and the bad node is ignored. An entry is written to the import.report file.

ignore bad nodes

ID 'laurence' is defined more than once in global ID space, at least at actors8b.csv:3 and actors8b.csv:5