Tuesday, May 15, 2012

How to scrape specific data from scrape with simple html dom parser


I am trying to scrape the datas from a webpage, but I get need to get all the data in this link .




include 'simple_html_dom.php';
$html1 = file_get_html('http://www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');

$info1 = $html1->find('b[class=[what to enter herer ]',0);



I need to get all the data out of this site .




Bürgerstiftung Lebensraum Aachen
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Hubert Schramm
Alexanderstr. 69/ 71
52062 Aachen
Telefon: 0241 - 4500130
Telefax: 0241 - 4500131
Email: info@buergerstiftung-aachen.de
www.buergerstiftung-aachen.de
>> Weitere Details zu dieser Stiftung

Bürgerstiftung Achim
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Helga Kühn
Rotkehlchenstr. 72
28832 Achim
Telefon: 04202-84981
Telefax: 04202-955210
Email: info@buergerstiftung-achim.de
www.buergerstiftung-achim.de
>> Weitere Details zu dieser Stiftung



I need to have the data that are "behind" the link - is there any way to do this with a easy and understandable parser - one that can be understood and written by a newbie!?


Source: Tips4all

4 comments:

  1. Seems to be written in the documentation:

    $html1->find('b[class=info]',0)->innertext;

    ReplyDelete
  2. Your provided links are down,
    I will suggest you to use the native PHP "DOM" Extension instead of "simple html parser", it will be much faster and easier ;)
    I had a look at the page using googlecache, you can use something like:-

    $doc = new DOMDocument;
    @$doc->loadHTMLFile('...URL....'); // Using the @ operator to hide parse errors
    $contents = $doc->getElementById('content')->nodeValue; // Text contents of #content

    ReplyDelete
  3. From what i can quickly glance you need to loop through the <dl> tags in #content, then the dt and dd.

    foreach ($html->find('#content dl') as $item) {
    $info = $item->find('dd');
    foreach ($info as $info_item) {..}
    }


    Using the simple_html_dom library

    ReplyDelete
  4. XPath makes scraping ridiculously easy, and allows for some changes in the HTML document to not affect you. For example, to pull out the names, you'd use a query that looks like:

    //div[id='content']/d1/dt


    A simple Google search will give you plenty of tutorials

    ReplyDelete