ChatGPT and data collection: Implications for web data harvesting software

04 July 2023

Contents of article:

Is data collection possible with just ChatGPT and a trusted proxy website?
Sample scenario with ChatGPT and geo targeted proxies
Current ChatGPT status in internet data harvesting

Rapid development of LLM-models and implementing ChatGPT for web data extraction raises a natural question, if AI-driven tools will replace web scrapers as we know them. To comprehend the practical implications of this technological breakthrough, we have started with an obvious issue and decided to check, can ChatGPT scrape websites or not.

Is data collection possible with just ChatGPT and a trusted proxy website?

A team member working at Dexodata, a trusted proxy website with geo targeted proxies and flexible settings, asked ChatGPT directly if it can scrape websites. The framework answered the following:

Can ChatGPT do web scraping with geo targeted proxies?

Then our team asked AI to write some code for software acquiring online insights. We take interest in data collection at scale for business purposes. Our infrastructure is commonly used as a source of intermediate IPs for online info gathering. The objects are e-commerce sites, such as Amazon, AliExpress, etc. “eBay proxy” search queries are also a popular reason to get proxies for social networks.

Our command was to “Generate code for a web scraper to scrape eBay.” As a result, we obtained this Python code relying on BeautifulSoup. These are solutions carrying numerous inspirational trends in web data harvesting:

import requests
from bs4 import BeautifulSoup

# URL of the eBay page you want to scrape
url = "https://www.ebay.com/"

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Find and print the titles of the items on the page
items = soup.find_all("h3", class_="s-item__title")
for item in items:
    print(item.text)

We have investigated the outcome, and have made some conclusions on the phenomenon of ChatGPT scraping.

First, it won’t get all things done outright. Entering something like “ChatGTP, scrape this for me” won't suffice. Which means it cannot replace existing tools, in-house built and external ones. Although it can help engineer one, and some manipulations will still take their toll.

We would describe the essence of an AI-based algorithm in words “assisting web scraping chatbot”. It is based on a language model which can not perform the whole info collection procedure autonomously. It won’t scan and retrieve details mentioned on eBay with an eBay proxy by Dexodata. But it is capable of giving valuable advice, guiding users through the process. ChatGPT does rather well as an advanced digital assistant for trusted proxy website’s users.

Here is what a typical flow of a ChatGPT proxy-enabled automated online insights’ acquiring project looks like:

One requests AI to generate code that encompasses the URL of interest as the target.
Conversational algorithms give lines of code. All it takes now is to copy and paste them.
On top of that, the intelligent chat assistant can identify the best-matching libraries to obtain crucial info.
The more users specify their tasks, the better code the self-learning robot submits. So be ready to iterate.

In this capacity, the new intelligent tool saves resources in two dimensions:

Less time needed to write code manually.
No need to do extensive research by yourself and googling available guides.

The key takeaway is that ChatGPT is in no position to replace web info retrieving software. It cannot understand nuances and offer one-size-fits-all solutions yet. On the other hand, it is capable of helping one to choose and set up software and harvest web data without coding skills. Let’s run through a couple of iterations by ourselves.

Sample scenario with ChatGPT, proxy websites, and iterations

A trusted proxy website Dexodata focuses on proxies for social networks and e-commerce sites intended to enhance the level of online analytics. On any level, both corporate and individual, the first thing to do before running automated insight collection script is identifying what we intend to acquire.

Step 1. Being, among other things, a source of eBay proxies, Dexodata invites you to work with this e-commerce giant. ChatGPT scraper creation strives for extra details. Feed this info into the language model.

How can ChatGPT do web scraping with geo targeted proxies

Make a right click on the piece of data required, then choose “Inspect”, and explore the HTML code, we mean the class attribute in particular.

Step 2. The time has come for the NLP model to act. Give it the command below. As you see, we specified what we are inclined to use:

“Please, scrape this website via Python and Scrapy. Locate all the elements with "class" attribute name”.

ChatGPT answers:

 import scrapy

class EbaySpider(scrapy.Spider):
    name = "ebay"
    start_urls = ['https://www.ebay.com/b/Black-GPS-Running-Watches/75230/bn_115254101']

def parse(self, response):
    elements = response.css('[class]')
    for element in elements:
        class_attribute = element.attrib.get('class')
        yield {
            'class_attribute': class_attribute
        }

Step 3. The Dexodata ecosystem of geo targeted proxies offers to narrow down the scope of our mission. We impose limits on the class attribute, like this:

“Please, scrape this website via Python and Scrapy. Locate all the elements with "class="b-info__title" attribute name”.

ChatGPT is quick to react:

import scrapy

class EbaySpider(scrapy.Spider):
    name = "ebay"

    start_urls = ['https://www.ebay.com/b/Black-GPS-Running-Watches/75230/bn_115254101']

    def parse(self, response):
        elements = response.xpath('//*[@class="b-info__title "]')
        for element in elements:
            title = element.xpath('text()').get()
            yield {
                'title': title.strip()
            }

The whole procedure took five minutes. Now one is free to test this code and adjust it further, when needed.

Current ChatGPT status in internet data harvesting

Our verdict is that ChatGPT will transform activities on retrieving web information and browsers automating in any foreseeable future, but will not replace automated harvesting solutions totally. So programmers using various languages, e.g. PHP, Python, JS, etc., can relax. The new tool now facilitates their jobs, it is not about to take them for good.

We appreciate the contribution automated conversation systems make to the industry of extraction online insights. We encourage readers to try web scraping ChatGPT’s possibilities in action. It is a chance to get a free trial of our geo targeted IPs which is available for new users.

Dexodata’s geo targeted proxies are located in 100+ countries. These are residential, mobile, and datacenter IP addresses available at reasonable prices. Sign up on the official website and get access to the powerful tools improving advanced data analytics along with ChatGPT and other AI-enhanced software solutions.

Mobile Proxies

Residential Proxies

Datacenter Proxies

ChatGPT and data collection: Implications for web data harvesting software

Is data collection possible with just ChatGPT and a trusted proxy website?

Sample scenario with ChatGPT, proxy websites, and iterations

Current ChatGPT status in internet data harvesting

Popular articles

Processing video content at scale: Video data analysis as a reason to buy dedicated proxies

What is AI-based business forecasting with geo targeted proxies?

Ruby for web data extraction: Advantages and usage with free trial proxies

Data gathering made easy with Dexodata