apoc.load.html
Procedure Apoc Extended
apoc.load.html('url',{name: jquery, name2: jquery}, config) YIELD value - Load Html page and return the result as a Map
Signature
apoc.load.html(url :: STRING?, query = {} :: MAP?, config = {} :: MAP?) :: (value :: MAP?)
Config parameters
The procedure support the following config parameters:
name | type | default | description |
---|---|---|---|
|
|
|
If it is set to "CHROME" or "FIREFOX", is used Selenium Web Driver to read the dynamically generated js. In case it is "NONE" (default), it is not possible to read dynamic contents. Note that to use the Chrome or Firefox driver, you need to have them installed on your machine and you have to download additional jars into the plugin folder. See below |
|
|
|
If greater than 0, it waits until it finds at least one element for each of those entered in the query parameter (up to a maximum of defined seconds, otherwise it continues execution). Useful to handle elements which can be rendered after the page is loaded (i.e. slow asynchronous calls). |
|
|
|
the character set of the page being scraped, if |
|
|
|
Valid with |
|
|
|
If true, allow to read html from insecure certificates |
|
|
|
Base URI used to resolve relative paths |
|
|
|
If the parse fails with one or more elements, using |
|
|
|
to use a string instead of an url as 1st parameter |
Usage Examples
We can extract the metadata and h2 heading from the Wikipedia home page, by running the following query:
CALL apoc.load.html("https://en.wikipedia.org/",{metadata:"meta", h2:"h2"});
Output |
---|
|
Let’s suppose we have a test.html
file like this:
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<h6 i d="error">test</h6>
<h6 id="correct">test</h6>
</html>
We can handle the parse error caused by i d
through failSilently
configuration.
So, we can execute:
CALL apoc.load.html("test.html",{h6:"h6"});
Failed to invoke procedure apoc.load.html : Caused by: java.lang.RuntimeException: Error during parsing element: <h6 i d="error">test</h6> |
---|
or with failSilently WITH_LIST
:
CALL apoc.load.html("test.html",{h6:"h6"}, {failSilently: 'WITH_LIST'});
Output |
---|
|
or with failSilently WITH_LOG
(note that will be created a log.warn("Error during parsing element: <h6 i d="error">test</h6>")
):
CALL apoc.load.html("test.html",{h6:"h6"}, {failSilently: 'WITH_LOG'});
Output |
---|
|
Load from runtime generated file
If we have a test.html
file with a jQuery script like:
<!DOCTYPE html>
<html>
<head>
<script src="https://code.jquery.com/jquery-1.9.1.min.js"></script>
<script type="text/javascript">
$(() => {
var newP = document.createElement("strong");
var textNode = document.createTextNode("This is a new text node");
newP.appendChild(textNode);
document.getElementById("appendStuff").appendChild(newP);
});
</script>
</head>
<body>
<div id="appendStuff"></div>
</body>
</html>
we can read the generated js through the browser
config.
Install dependencies
Note that to use the browser
config (except with "NONE"
value), you have to install additional dependencies
which can be downloaded from this link.
For example, with the above file we can execute:
CALL apoc.load.html("test.html",{strong: "strong"}, {browser: "FIREFOX"});
Output |
---|
|
If we have to parse a tag from a slow async call, we can use wait
config to waiting for 10 second (in this example):
CALL apoc.load.html("test.html",{asyncTag: "#asyncTag"}, {browser: "FIREFOX", wait: 10});
We can also pass an HTML string into the 1st parameter by putting as a config parameter htmlString: true
, for example:
CALL apoc.load.html("<!DOCTYPE html> <html> <body> <p class='firstClass'>My first paragraph.</p> </body> </html>",{body:"body"}, {htmlString: true})
YIELD value
RETURN value["body"] as body
body |
---|
|
Css / jQuery selectors
The jsoup class org.jsoup.nodes.Element
provides a set of functions that can be used.
Anyway, we can emulate all of them using the appropriate css/jQuery selectors in these ways
(except for the last one, we can substitute the with a tag name to search into it instead of everywhere. Furthermore, by removing the
selector will be returned the same result):
jsoup function | css/jQuery selector | description |
---|---|---|
|
|
Find an element by ID, including or under this element. |
|
|
Finds elements, including and recursively under this element, with the specified tag name. |
|
|
Find elements that have this class, including or under this element. |
|
|
Find elements that have a named attribute set. |
|
|
Find elements that have an attribute name starting with the supplied prefix. Use data |
to find elements that have HTML5 datasets. |
|
|
Find elements that have an attribute with the specific value. |
|
|
Find elements that have attributes whose value contains the match string. |
|
|
Find elements that have attributes that end with the value suffix. |
|
|
Find elements that have attributes whose values match the supplied regular expression. |
|
|
Find elements that either do not have this attribute, or have it with a different value. |
|
|
Find elements that have attributes that start with the value prefix. |
|
|
Find elements whose sibling index is equal to the supplied index. |
|
|
Find elements whose sibling index is greater than the supplied index. |
|
|
Find elements whose sibling index is less than the supplied index. |
|
|
Find elements that directly contain the specified string. |
|
|
Find elements that contain the specified string. |
|
|
Find elements whose text matches the supplied regular expression. |
|
|
Find elements whose text matches the supplied regular expression. |
|
|
For example, we can execute:
CALL apoc.load.html($url, {nameKey: '#idName'})
Output |
---|
|
Html plain text representation
If, instead of a map of json list results,
you want to get a map of plain text representations,
you can use the apoc.load.htmlPlainText procedure, which use the same syntax, logic and config parameters as apoc.load.html
.