Load HTML
Scraping Data from Html Pages.
|
Load Html page and return the result as a Map |
This procedures provides a very convenient API for acting using DOM, CSS and jquery-like methods. It relies on jsoup library.
CALL apoc.load.html(url, {name: <css/dom query>, name2: <css/dom query>}, {config}) YIELD value
The result is a stream of DOM elements represented by a map
The result is a map i.e.
{name: <list of elements>, name2: <list of elements>}
Config
Config param is optional, the default value is an empty map.
|
Default: UTF-8 |
|
Default: "", it is use to resolve relative paths |
|
Default: false, to use an HTML string instead of an url as 1st parameter |
Example with real data
The examples below use the Wikipedia home page.
CALL apoc.load.html("https://en.wikipedia.org/",{metadata:"meta", h2:"h2"})
You will get this result:
data:image/s3,"s3://crabby-images/4e67e/4e67e6e4ec545322624f2ca75484217a1b544dfd" alt="apoc.load.htmlall"
CALL apoc.load.html("https://en.wikipedia.org/",{links:"link"})
You will get this result:
data:image/s3,"s3://crabby-images/27f4d/27f4de0c53b64aa5988bb63a482a95e32de160eb" alt="apoc.load.htmllinks"
CALL apoc.load.html("https://en.wikipedia.org/",{metadata:"meta", h2:"h2"}, {charset: "UTF-8"})
You will get this result:
data:image/s3,"s3://crabby-images/131e9/131e93dab17b11c26dd64e9b627e570077a72268" alt="apoc.load.htmlconfig"