php[architect] logo

Want to check out an issue? Sign up to receive a special offer.

Introducing four new PHP 5.3 components and Goutte, a simple web scraper

Posted by on April 22, 2010


To support symfony 2’s development, Fabien Potencier – the lead developer of the symfony framework – has released four new PHP 5.3 based components:

Though these components will be used by Symfony 2, they’re built to be standalone components that can be easily used in any PHP 5.3 project. To prove that point, Fabien also released a new web scraper/crawler called Goutte which uses these four components, along with four additional components from Zend Framework. It’s a prime example of the flexibility and power that standalone components, along with a willingness to share, can provide.

CssSelector

The first new component, CssSelector, converts CSS selectors to XPath so that the power of XPath can be used with the familiarity of CSS selectors. The component is actually a port of a Python library called lxml and represents a translation from Python to PHP along with the addition of some unit tests.

The use is simple, and is covered in greater detail by Fabien on his blog. The following code, from Fabien’s blog, iterates through a specific anchor tag and prints out the href attribute.

  use Symfony\Components\CssSelector\Parser;

  $document = new \DOMDocument();
  $document->loadHTMLFile('http://fabien.potencier.org/articles');

  $xpath = new \DOMXPath($document);
  foreach ($xpath->query(Parser::cssToXpath('div.item > h4 > a')) as $node)
  {
    printf("%s (%s)\n", $node->nodeValue, $node->getAttribute('href'));
  }

DomCrawler

After the CssSelector, the obvious next step is to create a component that allows you to take control of any HTML or XML content. The DomCrawler allows you to do just that. Though there’s not yet any real documentation, the unit tests reveal a powerful system for crawling the DOM.

  use Symfony\Components\DomCrawler\Crawler;

  $crawler = new Crawler();
  $crawler->addHtmlContent('<html><div class="foo"></div></html>');

  $crawler->filter('div')->attr('class') // returns foo

The component has a rich list of methods that can be called to perform tasks on your DOM such as filtering, returning attributes, returning text, calling methods iteratively on nodes, and manipulating link and form elements.

Process

The Process components tackles another issue entirely. Namely, the Process component allows PHP scripts to be run in entirely different processes. In other words, “PhpProcess runs a PHP script in a forked process.” This is done via a simple class wrapper around the proc_* functions.

  use Symfony\Components\Process\PhpProcess;

  $process = new PhpProcess('/path/to/script.php');
  $process->run();

  echo $process->getOutput();

BrowserKit

Finally, the BrowserKit component brings all of the components together. The BrowserKit makes a request (via a method you define), and then allows you to interact with the page (e.g. click, submit) or retrieve information from the page (via the DomCrawler).

The best way to understand the BrowserKit is to see it in action with Goutte.

Goutte – a screen scraping and web crawling library

Goutte combines the above four components along with Zend Framework’s Date, Uri, Http, and Validate components to form an easy and powerful way to programmatically crawl and interact with web pages.

  $client = new Client();
  $crawler = $client->request('GET', 'http://www.symfony-project.org/');

  // Click on a link
  $link = $crawler->selectLink('Plugins')->link();
  $crawler = $client->click($link);

  // Read through a list of error messages
  $nodes = $crawler->filter('ul.error_list');
  foreach ($nodes as $node)
  {
    echo 'Error: ' . $node->text();
  }

Ryan Weaver is lead programmer at iostudio in Nashville TN and an avid supporter of the Symfony framework and open source in general. He's passionate about readable code, unit testing and collaboration (don't reinvent the wheel). If you've written open source code, Ryan probably loves you, but hates your documentation.

 

Responses and Pingbacks

[…] the php|architect blog today there’s a new post from Ryan Weaver about some of the new components that’ve been added to the Symfony framework […]

I did know about the components but not about “Goutte” to see them working together, excellent info!

[…] Introducing four new PHP 5.3 components and Goutte, a simple web scraper | php|architect […]

thanks for the article, Ryan

[…] Introducing four new PHP 5.3 components and Goutte, a simple web scraper | php|architect (tags: php) […]

Thanks for the great news and article!

I’d really love to see a tutorial where a website (>20k pages) is crawled with a forked process using Goutte.
… or I’ll get there putting a lot of time and energy into it. 🙂

Thanks again!
Flem

The crawling can be done really well with Query Path (jQuery like in PHP).

Glad to see this out there as well.

[…] werden. Sie heißen BrowserKit, CssSelector, DomCrawler und Process. Auf der Webseite von php|architect, die man übrigens abonnieren sollte, kann man schon ein paar Dinge darüber lesen. Ich bin […]

[…] Vier neue PHP 5.3 Komponenten vom leitenden Symfony-Entwickler. […]

[…] Ryan Weaver – Introducing four new PHP 5.3 components and Goutte, a simple web scraper andremaha Filed under: Symfony […]

Leave a comment

Use the form below to leave a comment: