Loading Data from Web-APIs

Supported protocols are file, http, https, s3, gs, hdfs with redirect allowed.

If no procedure is provided, this procedure will try to check whether the URL is actually a file.

As apoc.import.file.use_neo4j_config is enabled, the procedures check whether file system access is allowed and possibly constrained to a specific directory by reading the two configuration parameters dbms.security.allow_csv_import_from_file_urls and dbms.directories.import respectively. If you want to remove these constraints please set apoc.import.file.use_neo4j_config=false

CALL apoc.load.json('http://example.com/map.json', [path], [config]) YIELD value as person CREATE (p:Person) SET p = person

load from JSON URL (e.g. web-api) to import JSON as stream of values if the JSON was an array or a single value if it was a map

CALL apoc.load.xml('http://example.com/test.xml', ['xPath'], [config]) YIELD value as doc CREATE (p:Person) SET p.name = doc.name

load from XML URL (e.g. web-api) to import XML as single nested map with attributes and _type, _text and _children fields.

CALL apoc.load.csv('url',{sep:";"}) YIELD lineNo, list, strings, map, stringMap

load CSV fom URL as stream of values
config contains any of: {skip:1,limit:5,header:false,sep:'TAB',ignore:['aColumn'],arraySep:';',results:['map','list','strings','stringMap'], nullValues:[''],mapping:{years:{type:'int',arraySep:'-',array:false,name:'age',ignore:false,nullValues:['n.A.']}}

CALL apoc.load.xls('url','Sheet'/'Sheet!A2:B5',{config}) YIELD lineNo, list, map

load XLS fom URL as stream of values
config contains any of: {skip:1,limit:5,header:false,ignore:['aColumn'],arraySep:';'+ nullValues:[''],mapping:{years:{type:'int',arraySep:'-',array:false,name:'age',ignore:false,nullValues:['n.A.']}}

Load Single File From Compressed File (zip/tar/tar.gz/tgz)

When loading data from compressed files, we need to put the ! character before the file name or path in the compressed file. For example:

Loading a compressed CSV file

CALL apoc.load.csv("pathToCompressedFile/file.zip!pathToCsvFileInZip/fileName.csv")

Loading a compressed JSON file

CALL apoc.load.json("https://github.com/neo4j-contrib/neo4j-apoc-procedures/blob/4.3/core/src/test/resources/testload.tgz?raw=true!person.json");

Using S3 protocol

When using the S3 protocol we need to download and copy the following jars into the plugins directory:

aws-java-sdk-core-1.12.136.jar (https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-core/1.12.136)
aws-java-sdk-s3-1.12.136.jar (https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3/1.12.136)
httpclient-4.5.13.jar (https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient/4.5.13)
httpcore-4.4.15.jar (https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore/4.4.15)
joda-time-2.10.13.jar (https://mvnrepository.com/artifact/joda-time/joda-time/2.10.13)

Once those files have been copied we’ll need to restart the database.

The S3 URL must be in the following format:

s3://accessKey:secretKey[:sessionToken]@endpoint:port/bucket/key (where the sessionToken is optional) or
s3://endpoint:port/bucket/key?accessKey=accessKey&secretKey=secretKey[&sessionToken=sessionToken] (where the sessionToken is optional) or
s3://endpoint:port/bucket/key if the accessKey, secretKey, and the optional sessionToken are provided in the environment variables

Using hdfs protocol

To use the hdfs protocol we need to download and copy the additional jars not included in the APOC Library. We can download it from this link or locally downloading the apoc repository:

git clone http://github.com/neo4j-contrib/neo4j-apoc-procedures
cd neo4j-apoc-procedures/extra-dependencies
./gradlew shadow

and a jar named apoc-hadoop-dependencies-4.3.0.12.jar will be created into the neo4j-apoc-procedures/extra-dependencies/hadoop/build/lib folder.

Once that file is downloaded/created, it should be placed in the plugins directory and the Neo4j Server restarted.

Using Google Cloud Storage

In order to use Google Cloud Storage, you need to add the following Google Cloud dependencies in the plugins directory:

api-common-1.8.1.jar
failureaccess-1.0.1.jar
gax-1.48.1.jar
gax-httpjson-0.65.1.jar
google-api-client-1.30.2.jar
google-api-services-storage-v1-rev20190624-1.30.1.jar
google-auth-library-credentials-0.17.1.jar
google-auth-library-oauth2-http-0.17.1.jar
google-cloud-core-1.90.0.jar
google-cloud-core-http-1.90.0.jar
google-cloud-storage-1.90.0.jar
google-http-client-1.31.0.jar
google-http-client-appengine-1.31.0.jar
google-http-client-jackson2-1.31.0.jar
google-oauth-client-1.30.1.jar
grpc-context-1.19.0.jar
guava-28.0-android.jar
opencensus-api-0.21.0.jar
opencensus-contrib-http-util-0.21.0.jar
proto-google-common-protos-1.16.0.jar
proto-google-iam-v1-0.12.0.jar
protobuf-java-3.9.1.jar
protobuf-java-util-3.9.1.jar
threetenbp-1.3.3.jar

We’ve prepared an uber-jar that contains the above dependencies in a single file in order simplify the process. You can download it from here and copy it to your plugins directory.

You can use Google Cloud storage via the following url format:

gs://<bucket_name>/<file_path>

Moreover, you can also specify the authorization type via an additional authenticationType query parameter:

NONE: for public buckets (this is the default behavior if the parameter is not specified)
GCP_ENVIRONMENT: for passive authentication as a service account when Neo4j is running in the Google Cloud
PRIVATE_KEY: for using private keys generated for service accounts (requires setting GOOGLE_APPLICATION_CREDENTIALS environment variable pointing to a private key json file as described here: https://cloud.google.com/docs/authentication#strategies)

Example:

gs://andrea-bucket-1/test-privato.csv?authenticationType=GCP_ENVIRONMENT

failOnError

Adding the config parameter failOnError:false (by default true), will mean that in the case of an error the procedure will not fail, but just return zero rows.