Skip to content

Latest commit

 

History

History
74 lines (61 loc) · 2.48 KB

File metadata and controls

74 lines (61 loc) · 2.48 KB

Example

Assume there is a simple web page which you want to crawl and extract all required data

<!DOCTYPE html>
<html>
<head>
    <title>Title</title>
</head>
<body>
    <div class="name">Ferrari 458 Italia</div>
    <div class="numberOfHP">500</div>
    <a href="https://uk.wikipedia.org/wiki/Ferrari_458_Italia#/media/File:Ferrari_458_Italia_--_05-18-2011.jpg">Link</a>
</body>
</html>

And result of this page should be extracted to following object

public class CarTestEntity {
    private String name;
    private String numberOfHP;
    private String linkToPicture;

    // .. getters & setters
}

To do so you would need to create a simple xml files that contains rules of how that HTML should be parsed

<?xml version="1.0" encoding="UTF-8"?>
<page xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:noNamespaceSchemaLocation="https://docs.crawler.kurpiak.net.ua/ns/crawler-1.1.xsd">

    <fields>
        <field-description name="name">       <!-- CarTestEntity.name field  -->
            <selectors>
                <!-- CSS selector that should be used to extract value. In this case by CSS class -->
                <alternative selector=".name"/>
            </selectors>
        </field-description>
        <field-description name="numberOfHP">     <!-- CarTestEntity.numberOfHP field  -->
            <selectors>
                <!-- Again extract value by CSS class -->
                <alternative selector=".numberOfHP"/>
            </selectors>
        </field-description>
        <field-description name="linkToPicture">    <!-- CarTestEntity.linkToPicture field  -->
            <selectors>
                <!-- Extract result from element a on page and get value from its attribute href -->
                <alternative selector="a" source-type="attribute" source="href"/>
            </selectors>
        </field-description>
    </fields>
</page>

Now to get actual CarTestEntity object from above HTML using these rules you have to run following

import com.github.borsch.crawler.domain.PageDescription;
import com.github.borsch.crawler.xml.LocalXmlProcessor;

IXmlProcessor xmlProcessor = new LocalXmlProcessor();
PageDescription description = xmlProcessor.parse("/xml/car.xml");   // location on location machine to XML rules
PageCrawler<CarTestEntity> carCrawler = new PageCrawler<>(description, CarTestEntity::new);
String html = ; // actual HTML to be parsed

CarTestEntity carTestEntity = carCrawler.crawlHtml(html);  // parsing result