PHP for data harvesting in collaboration with a trusted proxy website
Contents of article:
- Web scraping with PHP: basics and prerequisites
- PHP web data harvesting: Sample scenario
- Evolution. Libraries and tools
- Crafting a scraper: a sample scenario
- Final remarks
The piece you are reading finalizes our long journey across the challenging domain of web scraping. The Dexodata trusted proxy website with advanced and flexible settings has already explored the potential of Python, JavaScript, and Java. We are now inviting the readers to focus on another well-liked direction to go to harvest info. We mean the good old PHP assistant.
Why should users give it a try if they intend to grab content of a web presence and are going to resort to our geo targeted proxies for this purpose? Here is an answer. Basically, PHP can be described as a general-purpose server-side scripting language born as early as in 1995. As such it is, by nature, intended for web development (the latest PHP version is 8.4). The share PHP has won and retained is still enormous.
- Any person knowing at least anything about the tech aspects of the internet knows what WordPress is. Based on PHP, that CMS fuels and maintains around 40% of all the existing websites. Not all of them are that popular, 40% of sites do not necessarily equal 40% of traffic or 40% of data. But this fact speaks for itself.
- Besides all content management-related stuff, roughly 80% of websites rely on PHP (when their server-side property can be identified).
- Finally, truly advanced products in great demand can be engineered with PHP. For instance, those who request our paid proxy free trial as new users to apply addresses as proxies for social media should know that Slack builds upon PHP code. Yes, it is not a social media in its purest form, but still.
As a language “predisposed” for the Web, PHP is a decent option for low-to-mid complexity scenarios. Via it coders can:
- Initiate requests to be transferred to a given page;
- Fetch info;
- Save info in an arranged shape, e.g. such formats as CSV, JSON, or XML.
To sum up: PHP is certainly a resource to apply to scraping tasks, as it is capable of interacting with sites and understanding and handling HTML, as well as grabbing info from presences via a range of fitted functions and libraries. In this capacity, PHP is not that advanced and flexible and should not be chosen for dynamic content, even so, it is a viable and workable (to some extent) option. We say so as a trusted proxy website with addresses for web scraping knowing this trade of data gathering very well.
Web scraping with PHP: basics and prerequisites
Before moving further, validate whether you are well-aware of the stuff below.
- HTTP Requests. Here we mean the protocol applied by servers to send and obtain info. To obtain some data from a presence, a user is to direct an HTTP query to the server in charge of hosting the page. PHP features such built-in functions like
cURL
,file_get_contents()
, as well asfopen()
enabling one to direct HTTP queries and get info from pages. - Parsing of HTML. As a language envisioned to “construct” web pages, PHP offers such built-in things as
DOMDocument
together withSimpleXMLElement
enabling users to parse HTML to grab info. - Regular Expressions serve as a workable and potent tool to match patterns and manipulate texts. Again, PHP offers built-in functions, e.g.
preg_match()
together withpreg_replace()
that enable one to apply regular expressions to grab info stemming from sites.
Even these basic notions would suffice in case users intend to craft a simple PHP program capable of transferring HTTP queries to a presence, “salvage” the HTML, and eventually parse the code as well as grab the info sought after with Dexodata’s ecosystem of ethical geo targeted proxies (with paid proxy free trial, including proxies for social media).
PHP web data harvesting: Sample scenario
Assume it is necessary to scrape info contained by a presence as a list of items for sale (what we will show is a very basic case, there will be more advanced ones in subsequent sections). It is a must to grab the name, cost, and image of every item and save that info as a CSV file. Here is the simplest way to go, i.e. when there are no complex scenarios and impediments. Just follow the steps of the Dexodata ecosystem of whitelisted geo targeted proxies.
<?php
$url = 'https://sample.com/items;
$html = file_get_contents($url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$products = $dom->getElementsByTagName('article');
$data = [];
foreach ($items as $items) {
$name = $item->getElementsByTagName('h2')[0]->textContent;
$price = $item->getElementsByTagName('span')[0]->textContent;
$image = $item->getElementsByTagName('img')[0]->getAttribute('src');
$data[] = [$name, $price, $image];
}
$fp = fopen('items.csv', 'w');
foreach ($data as $fields) {
fputcsv($fp, $fields);
}
fclose($fp);
?>
Nothing has ever been easier!
What readers see here is us first transferring an HTTP query to the presence by means of file_get_contents()
. Subsequently, we switch to parsing the HTML stuff via the DOMDocument
class and grab some relevant info via the getElementsByTagName()
technique. Then we are free to save the grabbed info in a multidimensional array, which is subsequently written to a CSV file via fputcsv()
.
If you feel like everything is clear, let’s move forward. The Dexodata trusted proxy website with paid proxy free trial options will show you the path to data.
Evolution. Libraries and tools
Since its inception, PHP has evolved significantly. Probably, its libraries do not look so “fashionable” and “fancy” (it is not Python), nevertheless, this language boasts a huge community of participants supporting workable libraries. Let’s examine them in the context of web scraping. We, as a trusted proxy website running an ecosystem of geo targeted proxies (including proxies for social media, like Instagram) put forward the following ladder, in our opinion, from simpler options to more advanced ones.
1. Goutte
Goutte serves as a fully-featured library one could apply with the PHP framework. It can be mastered even by newbies to grab data stemming from HTML. The learning curve in this case is also not challenging, thanks to its comprehensible object-oriented design. As for other advantages, they cover a visible community, an impressive volume of helpful documentation, and, finally, speed. Owing to the fact that it relies on HTTP 1.1 persistent connections, it implies that a browser needs one connection to the server only. After that, the same connection will be used for all requests after the first one.
There are some shortcomings as well. In certain projects, insurmountable limitations concerning dynamic content might be faced. Let’s try Goutte and PHP in action and make a click on the page of interest (as usual, when scraping anything on the web, do not forget about Dexodata's residential, datacenter, and mobile network geo targeted proxies available at favorable rates).
To begin with, one has to install Composer. The latter is leverage to manage dependencies when it comes to PHP. It enables one to declare the libraries of importance to a project.
All is needed is to execute this sort of command in the terminal:
composer require fabpot/goutte
Then generate a PHP script for the initiative. Give it any name you want. Then open your fresh script and supplement these lines of code to get the Goutte thing rolling:
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
Great, Goutte is in full swing and initialized! After that, provide this couple of lines below to the end of the file to fetch a URL by means of the client->request() function
.
$url = "http://instance.com/";
$crawler = $client->request('GET', $url);
// Click on the "More information..." link
$link = $crawler->selectLink('More information…')->link();
$crawler = $client->click($link);
It has taken us just a couple lines of code to end up in the right position to navigate to a presence and get clicking things done. Now the time comes to the second round with the Dexodata trusted proxy website.
2. Simple HTML DOM
Being a parser, this program is designed to work with any HTML doc, including those docs that are defined as invalid ones by the corresponding HTML specs. Its great pro is that it is easy to master, and it contains no external JS-related stuff to load separately before one gets to work. Other pros include speed (users face no need to load a whole page into memory, so you can process a bunch of HTML pages concurrently with your performance unaffected) and the fact that it is lightweight (i.e. there is no need to install any additional software or libraries, PHP alone will suffice).
Beyond doubt, there are constraints. Your access to page components will be limited: one can execute manipulations with the structure of a page, but its content will be beyond reach. It is noteworthy that Simple HTML DOM is quite rigid, so unless you are a god of HTML, the potential will be restricted.
That has been enough theory, let’s get to practice.
To begin, obtain the latest version by visiting this Simple HTML DOM documentation space. That will give you all you need.
First, copy the simple_html_dom.php
file. Afterwards, simply paste the latter into your active project.
Generate your PHP file (names do not matter). For example, webscrapingstuff.php.
Now, open the file and “arm” your project with the Simple HTML DOM library by adding it. Use the code proposed by the Dexodata trusted proxy website below:
include('simple_html_dom_php');
Enter this line code. It is of importance, as it will navigate you to the page itself:
$html = file_get_html('http://instance.com/')
The file_get_html
thing, being a part of Simple HTM DOM, is able to retrieve HTML associated with the URL one specifies. Later it delivers a DOM object storing our $html
variable.
Say, you need the chief heading. To get that first H1 heading, apply this code:
echo $html->find('h1', 0)->plaintext;
If you open this resulting file on your web server you'll find what you need there.
3. PHP Scraper
As an ecosystem of whitelisted and ethically sourced IP addresses focused on data harvesting, we cannot ignore PHP Scraper. The latter serves as a tool enabling harvesters to scrape info via PHP scripts. This approach implies no necessity to code a lot on your own. All that is needed is to apply such plain commands as find_element_by_class
or find_element_by_id
. They will suffice for specific components on any particular presence by typing in their class or id number.
Any issues? Of course, there exist some. PHP Scraper may not match each possible use case, as long as it is, naturally, based upon PHP. Thus, it is compatible with servers associated with Apache or Nginx and configured via mod_rewrite
activated by default. PHP Scraper is not your direction to go if you intend to grab info from an API or parse through sophisticated HTML presences.
Anyway, let's run a basic scenario.
Our readers must install Composer, like we did in the Goutte section. After that, supplement your project with the library being discussed, with this of code via Composer:
composer require spekulatius/phpscraper
Once the installation stage is left behind, Composer's autoloader takes up the package. In case one works with a VanillaPHP library, they have to incorporate the autoloader in their script, like shown below:
require 'vendor/autoload.php'
As an exercise, generate a script that gets to a presence and enumerates the amount of links posted there. Start the program and give it a variable, as shown here:
$web = new \spekulatius\phpscraper();
Give the program an order to visit the URL:
$web->go(' [http://instance.com/](http://instance.com/) ');
Supplement this section of code tasked with taking every link contained by the presence and printing the amount of them.
// Print the number of links.
echo "This page contains " . count($web->links) . " links.\n\n";
// Loop through the links
foreach ($web->links as $link) {
echo " - " . $link . "\n";
}
/**
* This code will print out:
*
* This page contains 1 link.
*
* - https://www.iana.org/domains/instance
*/
Just remember: We use an unreal address.
4. PHP cURL for scraping operations
Having revisited a range of options made available for harvesting activities by PHP, our Dexodata trusted proxy website offers you an extra convenient and widely-used one. We of course mean cURL
, a library and command-line instrument (normally available with the language by default) empowering users to send and get files via HTTP and FTP. While working with it, one can apply geo targeted proxies (including those provided with our paid proxy free trial), transmit data via SSL connections, define cookies, and more. Let’s give it a try in practice.
Generate a fresh PHP script and supplement this code:
// Initialize curl
$ch = curl_init();
// URL for Scraping
curl_setopt($ch, CURLOPT_URL,
'https://dexodata.com/en/blog');
// Return Transfer True
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
// Closing cURL
curl_close($ch);
Defining CURLOPT_RETURNTRANSFER
as TRUE
submits the page in the capacity of a string instead of outputting it directly. Hence, our code takes the info we require to harvest from the site.
One can take a look at this data via an echo with the variable for data storage:
echo $output
Our next step is placing this info in a DOM doc, which will allow us to get access to it for scraping purposes.
$dom = new DOMDocument;
$dom->loadHTML($output);
Currently, we have our info as an HTML structure inside a variable. Generate some extra code to print out each link made available by the HTML page:
$tags = $dom->getElementsByTagName('a');
for($I=0; $I < $tags->length; $I++){
$link = $tags->item($i);
echo " - " . $link . "\n";
}
By the way, as a source of geo targeted proxies with paid proxy free trial opportunities, we urge users not to forget about protecting their scraping initiatives with the addresses we supply (for instance, our top-notch residential IPs from ISPs or, alternatively, our high-quality mobile proxies). The Dexodata ecosystem will show readers the way to make it via cURL
. All they need to perform is to follow the syntax.
curl --proxy <proxy-ip>:<proxy-port> <url>
First, the <proxy-ip> section is to insert the IP, while the second <proxy-port> section is intended, predictably, for the port number. Eventually, it will look like:
curl --proxy 193.191.56.12:7070 -k https://dexodata.com/en/blog
As long as the address is functional, and the query is successful, our script shows the presence’s content to exit afterwards.
Crafting a scraper: a sample scenario
Let’s imagine an e-commerce presence with a list of items for sale we want to harvest via PHP. What should we do? What alternatives should we use? Follow our trusted proxy website and examine this sample scenario.
Start with downloading the entire HTML of the e-commerce presence via the PHP potential and good old cURL, like that:
// initialize the cURL request
$curl = curl_init();
// set the URL to reach with a GET HTTP request
curl_setopt($curl, CURLOPT_URL, "https://scrapeme.live/shop/");
// get the data returned by the cURL request as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
// make the cURL request follow eventual redirects,
// and reach the final page of interest
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
// execute the cURL request and
// get the HTML of the page as a string
$html = curl_exec($curl);
// release the cURL resources
curl_close($curl);
The HTML-related stuff is at our full disposal, in the capacity of the $html
variable, as Dexodata highlights. Your next measure is to load $html
into a HtmlDomParser
instance featuring str_get_html()
function. See how it can be done below:
require_once __DIR__ . "../../vendor/autoload.php";
use voku\helper\HtmlDomParser;
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "https://scrapeme.live/shop/");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
$html = curl_exec($curl);
curl_close($curl);
// initialize HtmlDomParser
$htmlDomParser = HtmlDomParser::str_get_html($html);
You and our trusted proxy website are now in the right position to apply HtmlDomParser
for the ultimate purpose of browsing the DOM of the HTML presence and effectively initiating the data gathering process.
That is what Dexodata is about to accomplish. To make it happen, “salvage“ the full list of all pagination-related links in our scenario. In our case, this will give us an opportunity to crawl the whole web presence section. Make a right click on the HTML component related to the pagination figure and choose "Inspect". After that, our DevTools will show us the DOM element.
Let’s move forward with Dexodata’s ecosystem of geo targeted proxies. One can expect that the .page-numbers
, in the capacity of a CSS class, will identify the pagination-related HTML components. However, the team of our trusted proxy website would remind us all that a CSS class might not necessarily identify an HTML component uniquely. Multiple nodes may actually be related to the same single class.
Therefore, if your intention is to apply a CSS selector to grab the components in the DOM, one is inevitably to apply the CSS class together with other selectors. Specifically, users could apply the HtmlDomParser assistant, with the .page-numbers
a CSS selector, to choose all the pagination-related HTML components. Then our ecosystem of geo targeted proxies would suggest you to iterate through them to grab every needed URLs right from the href attribute.
Check this out:
// retrieve the HTML pagination elements with
// the ".page-numbers a" CSS selector
$paginationElements = $htmlDomParser->find(".page-numbers a");
$paginationLinks = [];
foreach ($paginationElements as $paginationElement) {
// populate the paginationLinks set with the URL
// extracted from the href attribute of the HTML pagination element
$paginationLink = $paginationElement->getAttribute("href");
// avoid duplicates in the list of URLs
if (!in_array($paginationLink, $paginationLinks)) {
$paginationLinks[] = $paginationLink;
}
}
// print the paginationLinks array
print_r($paginationLinks);
Pay attention to the fact that find()
enables users to grab DOM components on the basis of a CSS selector. On top of that, in Dexodata’s hypothetical scenario the pagination-related components are featured two times on each page of the e-commerce presence. One needs to provide a unique and adjusted trick to handle the issue of duplicate components when the $paginationLinks
array is at stake.
Here is what we will see if we attempt to run our script for the scenario:
Array (
[0] => https://scrapeme.live/shop/page/2/
[1] => https://scrapeme.live/shop/page/3/
[2] => https://scrapeme.live/shop/page/4/
[3] => https://scrapeme.live/shop/page/46/
[4] => https://scrapeme.live/shop/page/47/
[5] => https://scrapeme.live/shop/page/50/
)
Note that all addresses possess an identical structure. That is, they are all marked by the final figure responsible for pagination. If readers intend to iterate over each of them, the only thing they need is the identifying figure associated with the final page.
It is subject to being retrieved as shown below:
// remove all non-numeric characters in the last element of
// the $paginationLinks array to retrieve the highest pagination number
$highestPaginationNumber = preg_replace("/\D/", "", end($paginationLinks));
Say, our $highestPaginationNumber
will be "50".
So, the time has come to grab the info describing a given individual product with the Dexodata team. We need to make a right click on an item and activate the DevTools window, followed by "Inspect". One can expect that an item would be “made of” a li.product
HTML element encompassing a URL, a picture, a title, and a price tag. Such info stems from a
, then mg
, then h2
, and then span
HTML components, correspondingly.
The Dexodata trusted proxy website proposes this way to grab this info via HtmlDomParser:
$productDataLit = array();
// retrieve the list of products on the page
$productElements = $htmlDomParser->find("li.product");
foreach ($productElements as $productElement) {
// extract the product data
$url = $productElement->findOne("a")->getAttribute("href");
$image = $productElement->findOne("img")->getAttribute("src");
$name = $productElement->findOne("h2")->text;
$price = $productElement->findOne(".price span")->text;
// transform the product data into an associative array
$productData = array(
"url" => $url,
"image" => $image,
"name" => $name,
"price" => $price
);
$productDataList[] = $productData;
}
Once our logic successfully grabs all chunks of item info from a single page and subsequently saves it as the $productDataList
array. Apply this logic to every page.
// iterate over all "/shop/page/X" pages and retrieve all product data
for ($paginationNumber = 1; $paginationNumber <= $highestPaginationNumber; $paginationNumber++) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "https://scrapeme.live/shop/page/$paginationNumber/");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$pageHtml = curl_exec($curl);
curl_close($curl);
$paginationHtmlDomParser = HtmlDomParser::str_get_html($pageHtml);
// scraping logic...
}
Great! We have arrived at the situation when we can extract data as planned.
Final remarks
Running an ecosystem of geo targeted proxies, we would like to finalize this guide with a couple of remarks. Try to avoid being detected and restricted as a scraper. For that, we remind you of the proxies we have already mentioned.
Here is another reminder of how one can add their proxy identifiers:
curl_setopt($curl, CURLOPT_PROXY, "<PROXY_URL>");
curl_setopt($curl, CURLOPT_PROXYPORT, "<PROXY_PORT>");
curl_setopt($curl, CURLOPT_PROXYTYPE, "<PROXY_TYPE>");
1. Extra layer of protection can be constituted by a user-agent HTTP header. By definition, cURL would always set something like curl/XX.YY.ZZ
. It won't be a big deal to catch a user with such a header. Instead, try to “play” with the header here.
curl_setopt($curl, CURLOPT_USERAGENT, "<USER_AGENT_STRING>");
As a result, you will place something like this, an example of found on the internet:
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
2. Another remark from Dexodata is focused on dynamic content. We stressed that the customary GET cURL
query won't help you with that, mode advanced techniques will be required. Get back to our guide dedicated to JS and learn more about headless browsers, i.e. browsing software packages without UI. This will help you build a scraper interacting with a presence like a real user and extracting dynamic content concurrently.
3. Our third remark is dedicated to parallel scraping. Multi-threading in PHP is challenging yet sometimes inevitable. Here is how you can tackle this challenge with the Dexodata trusted proxy website. Our concept is to enable your scraping script to be executed on several instances by applying HTTP GET
parameters.
Alter the script so that it does not iterate over all pages and focus on smaller parallel pieces instead. Here is how one can alter the script and set the limits of a “piece”. Simply set a couple of GET parameters, just as highlighted below:
$from = null;
$to = null;
if (isset($_GET["from"]) && is_numeric($_GET["from"])) {
$from = $_GET["from"];
}
if (isset($_GET["to"]) && is_numeric($_GET["to"])) {
$to = $_GET["to"];
}
if (is_null($from) || is_null($to) || $from > $to) {
die("Invalid from and to parameters!");
}
// scrape only the pagination pages whose number goes
// from "$from" to "$to"
for ($paginationNumber = $from; $paginationNumber <= $to; $paginationNumber++) {
// scraping logic...
}
// write the data scraped to a database/file
This will make you ready to initiative multiple instances by activating several links in the browser:
https://instance.com/scripts/scrapeme.live/scrape-products.php?from=1&to=5
and alike.
They will be executed concurrently and scrape the presence in parallel. That’s it. Thank you for staying with Dexodata, your ethical ecosystem of mobile, datacenter, and residential proxies with all KYC and AML standards applied.