Using Java for data scraping and harvesting on the Web

Contents of article

  1. Available frameworks to work with geo targeted proxies
  2. Prerequisites of the guide for those who buy residental IPs
  3. ABC of CSS selectors from a provider of geo targeted proxies
  4. Bit-by-bit guide from a trusted proxy website
  5. Querying
  6. Wrapping up

Collaborating with users that demand tools for data harvesting and collection and running an ecosystem of geo targeted proxies, Dexodata, a cost-effective and trusted proxy website, knows that Java is not an exclusive option to collect info. When we make a proxy free trial available or offer to buy residential IPs at a great price, PHP, Node.js, C#, Python, etc. are also routinely applied. However, Java is still a preferred way to go for many. Today, we will explore it potential. 

Available frameworks to work with geo targeted proxies

When web scraping and data harvesting with Java is on the agenda, this couple of libraries is oftentimes applied: on the one hand, there is JSoup and, on the other hand, there exist HtmlUnit.

Concerning JSoup, it serves a potent, effective, and relatively simple-to-use library capable of successfully tackling malformed HTML docs. In fact, that is exactly why it is called “JSoup,” as “tag soup” means misshapen stuff. For all intents and purposes, JSoup is probably the most sought after and widely-spread Java library in terms of info collection.

As for HtmlUnit, it can be described as a headless browser (i.e. a browser having no GUI) for tools fueled by Java. It is capable of mimicking and imitating the typical facets of browsing solutions, including grabbing certain page components, making clicks on various points, etc. Just as its name implies, the latter is typically applied to execute rounds of unit testing. Quite literally, this is a way to emulate a browsing solution in a testing scenario.

In addition, HtmlUnit is a suitable option in the context of web scraping and data harvesting. The key pro is that people require just a line of code to deactivate both JS and CSS. Beyond any doubt, that is a useful feature, as JS and CSS are mostly irrelevant when in scraping situations. Below, we assess both and engineer scraping tools.

Usage of Java for web data scraping with geo targeted proxies

As a trusted proxy website with a global pool of geo targeted proxies and other significant advantages, we regularly ennumerate several techniques empowering teams to read and alter a loaded page. When we discuss the opportunity to order a proxy free trial or buy residential IPs, we often accentuate the crucial advantage of HtmlUnit: it simplifies interactions with pages. In this capacity it resembles a browser (not a surprise at all, as it functions as a headless browser). This fact implies such capabilities as reading, filling out forms, making clicks, and so on. In our scenario, Dexodata applies the potential of this library to read info from URLs. Follow us to learn more.

 

Prerequisites of the guide for those who buy residental IPs

 

This overview expects that readers already master Java to some extent. Maven is used here to manage packages in our scenario. In addition to Java basics, readers must know how pages on the Web function. Hence, familiarity with HTML is mandatory. Also, an ability to choose elements, by means of XPath or CSS selectors, can also be called vital. Please remember: not each existing library is compatible when it comes to XPath. Sad but true.

 

ABC of CSS selectors from a provider of geo targeted proxies

 

Make a pit stop to refresh your memory regarding the essence of CSS selectors. They encompass:

  • #firstname – picks each element whose id matches “firstname”.
  • .blue – picks each and every component whenever its class comprises “blue”.
  • p – chooses each <p> tag.
  • div#firstname – responsible for picking div elements the id of which matches  “firstname”.
  • p.link.new – pay attention to the absence of an empty space. It chooses <p class="link new">.
  • p.link .new – it picks each component related to the “new” class, that, in turn, can be found inside <p class="link">.

 

Bit-by-bit guide from a trusted proxy website

 

Dexodata, on the basis of the vast tech experience accumulated by as a trusted proxy website, is about to engineer a Java parsing solution by effective means of the Maven library in a  great combination with JSoup and HtmlUnit. Indeed, these very libraries will allow us to easily parse and otherwise manipulate HTML content within a Java application.

Follow bit-by-bit instructions provided below to set up your project to use with geo targeted proxies. By the way, let us remind that with Dexodata people are able not only buy residential IPs, but new users are also entitled to a proxy free trial. Let's begin.

1. The initial stage is obtaining a library.

Maven is capable of assisting. Simply use an available Java environment. There, generate a Maven project. In case Maven is irrelevant for some reason, pay a visit to this web space. There, one can choose alternate downloads. 

Ok, get things done with installing Maven and generate a fresh project. Here is the installation guide to explore.

Once installed, generate a fresh Maven project by means of executing this command in the terminal:

mvn archetype:generate -DgroupId=com.mycompany.parser -DartifactId=java-parser -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

By doing so via this command, one will generate a fresh project folder named 'java-parser.' That gives us all necessary Maven files together with a basic structure. The initial and important step has been taken.

2. Now, dependencies should get added to the pom.xml file. 

Not a big deal at all. In the pom.xml file (which stands for "Project Object Model") a fresh segment must be added. Its mission and task is working with dependencies. Subsequently, the moment comes to supply a dependency for JSoup.

  1. In order to do so, navigate towards the 'java-parser' folder.
  2. There, open the pom.xml file.
  3. Include these dependencies right inside the <dependencies> tag, as highlighted below:

<dependency>

    <groupId>org.jsoup</groupId>

    <artifactId>jsoup</artifactId>

    <version>1.14.3</version>

</dependency>

<dependency>

    <groupId>net.sourceforge.htmlunit</groupId>

    <artifactId>htmlunit</artifactId>

    <version>2.56.0</version>

</dependency>

Save the pom.xml thing. Dexodata, as a provider of geo targeted proxies and coding mentor, is still with you. 

3. Generate a new Java class.

In the src/main/java/com/mycompany/parser folder, generate a new Java file called 'HtmlParser.java'. This file will accommodate the main class implementation and serve as the entry point for the future parser.

4. Import the necessary libraries in 'HtmlParser.java'. Open the file in a text editor of choice or IDE. Before you start writing any code, one is to import the necessary libraries concerning HtmlUnit and JSoup. They enable you to fetch and parse HTML content easily. Add this import statements at the beginning of the HtmlParser.java file:

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import com.gargoylesoftware.htmlunit.WebClient;

import com.gargoylesoftware.htmlunit.html.HtmlPage;

We have already made significant progress, keep moving with our trusted proxy website.

5. Initialize a 'WebClient' object and disable JS together with CSS.

Inside the HtmlParser class, you will implement methods for creating a WebClient object, fetching an HTML page by means of HtmlUnit, parsing the HTML content via JSoup, extracting specific data (such as title, description, and image), and a main method to execute these tasks in sequence. 

public class HtmlParser {

    private static WebClient createWebClient() {

        WebClient webClient = new WebClient();

        webClient.getOptions().setJavaScriptEnabled(false);

        webClient.getOptions().setCssEnabled(false);

        return webClient;

    }

}

6. Generate a method to fetch an HTML page by means of HtmlUnit:

public static HtmlPage fetchHtmlPage(String url) {

    try (WebClient webClient = createWebClient()) {

        return webClient.getPage(url);

    } catch (Exception e) {

        e.printStackTrace();

        return null;

    }

}

7. Generate a method to parse the HTML content via JSoup:

public static Document parseHtmlContent(String htmlContent) {

    return Jsoup.parse(htmlContent);

}

8. Keep on working on the parsed HTML: 

public static void extractInformation(Document document) {

    // Extract information from the parsed HTML document using JSoup methods

    // e.g., document.select("a[href]"), document.select("img[src$=.png]")

    // For more details, refer to the JSoup documentation: https://jsoup.org/cookbook/

}

You are good at it, there're a few more steps to be made. 

9. Create a method to grab info from the parsed HTML:

public static void main(String[] args) {

    String url = "https://www.example.com";

    HtmlPage htmlPage = fetchHtmlPage(url);

    if (htmlPage != null) {

        String htmlContent = htmlPage.asXml();

Amazing! As one recognizes, there is no rocket science in it. At least, when your company works with a trusted proxy website providing geo targeted proxies all over the globe. We do know our trade.

Previous steps were dedicated to fetching the HTML content of a web page via HtmlUnit and working on it and parsing it by means of JSoup. Now, it is the moment to focus on extracting specific data, such as the title, description, and image from an online store page. To make it happen in reality, we are required to first identify the HTML tags and attributes associated with such components. Users are able to apply their web browser's developer tools to inspect and "scan" the HTML source code and find the relevant components.

10. Once you've identified the relevant HTML components, you can make use of JSoup's powerful selection capabilities to grab the data.

For example, let's assume the title is within an <h1> tag, the description is inside a <div> tag with a class "product-description", and the image is within an <img> tag with a class "product-image". You are free to apply this code snippet in your extractInformation() method to get these data:

public static void extractInformation(Document document) {

    // Extract the title

    String title = document.select("h1").text();

    System.out.println("Title: " + title);

 

    // Extract the description

    String description = document.select("div.product-description").text();

    System.out.println("Description: " + description);

 

    // Extract the image URL

    String imageUrl = document.select("img.product-image").attr("src");

    System.out.println("Image URL: " + imageUrl);

}

11. If the site builds upon JavaScript to load content dynamically, a team might need to enable JS support in HtmlUnit by changing the createWebClient() method:

private static WebClient createWebClient() {

    WebClient webClient = new WebClient();

    webClient.getOptions().setJavaScriptEnabled(true);

    webClient.getOptions().setCssEnabled(false);

    webClient.getOptions().setThrowExceptionOnScriptError(false);

    webClient.waitForBackgroundJavaScript(5000); // Wait for JavaScript to finish loading

    return webClient;

}

Note that enabling JavaScript may increase the loading timeframe inline with chances of encountering errors or compatibility issues with the site's scripts. This issue does deserve your attention, Dexodata, as a trusted proxy website, knows this hard fact.

12. Having successfully grabbed the titles, descriptions, and images from the online shop, you might opt for storing this info for further analysis or processing. A common technique is to save the collected info into a file. You are capable of choosing various file formats depending on your needs, such as CSV, JSON, or XML. In this scenario, Dexodata will save the data to a CSV file. It goes like this.

Generate a method that will write the extracted info to a CSV file:

import java.io.BufferedWriter;

import java.io.FileWriter;

import java.io.IOException;

 

public static void writeToCSV(String title, String description, String imageUrl, String outputFilePath) {

    try (BufferedWriter writer = new BufferedWriter(new FileWriter(outputFilePath, true))) {

        String csvLine = String.join(",", "\"" + title.replace("\"", "\"\"") + "\"",

                                      "\"" + description.replace("\"", "\"\"") + "\"",

                                      "\"" + imageUrl.replace("\"", "\"\"") + "\"");

        writer.write(csvLine);

        writer.newLine();

    } catch (IOException e) {

        e.printStackTrace();

    }

}

This technique holds the title, description, imageUrl, as well as outputFilePath as input parameters. So, it writes the info as a CSV line to the specified file. It also handles double quotes within the text by replacing them with two double quotes, as mandated by CSV.

13. Update the 'extractInformation()' technique to call the 'writeToCSV()' method with the grabbed info:

public static void extractInformation(Document document, String outputFilePath) {

    // Extract the title, description, and image URL as before

 

    // Write the extracted data to the CSV file

    writeToCSV(title, description, imageUrl, outputFilePath);

}

14. Modify the main technique to include the output file path and pass it to the 'extractInformation()' technique:

public static void main(String[] args) {

    String url = "https://www.example.com";

    String outputFilePath = "output.csv";

    HtmlPage htmlPage = fetchHtmlPage(url);

    if (htmlPage != null) {

        String htmlContent = htmlPage.asXml();

        Document document = parseHtmlContent(htmlContent);

        extractInformation(document, outputFilePath);

    }

}

Now, when the parser is run, it grabs the title, description, and image URL from the online store page and write the data to the specified CSV file. You are in the position to further customize the file output format, add headers, or even write to other file formats like JSON or XML depending on your specific requirements.

Test your parser by running the main method with the URL of an online store product page. Make sure to adjust the JSoup selectors according to the actual real-world page structure. You can always upgrade this parser to extract additional info such as product prices, specifications, or user reviews based on the specific requirements. Furthermore, people able to implement error handling and edge-case scenarios to make your parser more robust and versatile. 

With e-commerce business analysis as one of the key scenarios for Dexodata's geo targeted proxies, we know how cool it would be to get things done with such a parser.

 

Querying

 

The primary objective of Java web scraping programs is all about querying an HTML doc. Normally, it is time-consuming, yet it is inevitable. 

With JSoup, there exist multiple ways of obtaining the components of interest. These techniques include getElementByID or getElementsByTag. The querying flow associated with the DOM is simplified via them. 

Our scenario allows getElementByID together with getElementsByClass. However, take note of an important distinction. While getElementById (pay attention, it is singular) is intended to return a single Element object, in the case of getElementsByClass (pay attention, it is plural,  in contrast) an Array list containing Element objects will be returned.

Handily enough, JSoup features a class Elements. What is does is broadening ArrayList<Element>. As a result, the code gets neater and extra functionalities are enabled.

The code lines presented below apply the first() technique to obtain the first component on the ArrayList. Once it is over, the text() technique is subsequently applied to obtain the text contents.

Element firstHeading = document.getElementsByClass("firstHeading").first();

System.out.println(firstHeading.text());

Our readers can assume that those features are high-performing enough. Concurrently, they are still limited to JSoup. In the majority of situations, the select function must be preferred. The sole exception making it totally useless is whenever people seriously intend to traverse a document up. If it is your case, apply such options as children(), child(), and parent(). Just pay a visit here to explore a full range of feasible techniques.

The line of code shows usage of the selectFirst() technique responsible for returning the first match:

Element firstHeading= document.selectFirst(".firstHeading"); 

In our scenario here, the selectFirst() technique is applied. If several components must be chosen, users are also free to reveal the potential of the select() technique. In this fashion, the CSS selector is held in the capacity of a parameter. As simple as ABC. Further, an instance of Elements is subject to being returned as an extension of the ArrayList<Element> sort.

 

Wrapping up

 

In order to survive and expand, virtually each team resorts to web scraping and data harvesting (preferably, in contact with an advanced and trusted proxy website) to assess needed info and outcompete their rivals. Being on good terms with the basics of web scraping and fully prepared to construct Java-based scrapers is valuable as it assists with faster and timely data-driven decision-making. What helps just as much, is a trusted proxy website running an ecosystem of geo targeted proxies. Say, on Dexodata teams can buy residential IPs for parsers or, if they are new with the platform, request a proxy free trial.

Back

Data gathering made easy with Dexodata