Friday, May 4, 2012

Scrape web pages in real time with Node.js


What's a good was to scrape website content using Node.js. I'd like to build something very, very fast that can execute searches in the style of kayak.com , where one query is dispatched to several different sites, the results scraped, and returned to the client as they become available.



Let's assume that this script should just provide the results in JSON format, and we can process them either directly in the browser or in another web application.



A few starting points: Node.js Fetch URL and display page body Using node.js and jquery to scrape websites



Anybody have any ideas?


Source: Tips4all

3 comments:

  1. Take a look at blog.dtrejo.com/scraping-made-easy-with-jquery-and-selectorga, as it may have some good tips on how to get your first scraper going, as well as handy tools online that will help with creating regular expressions, and finding the best selectors with selectorgadget.

    Here's a link to the code to get you going: https://gist.github.com/790580

    As for populating the page with results in realtime, I recommend socket.io or nowjs.com.

    Remember you can always stop into #node.js and ask questions!

    ReplyDelete
  2. Node.io seems to take the cake :-)

    ReplyDelete
  3. You don't always need to jQuery. If you play with the DOM returned from jsdom for example you can easily take what you need yourself (also considering you dont have to worry about xbrowser issues.) See: https://gist.github.com/1335009 that's not taking away from node.io at all, just saying you might be able to do it yourself depending...

    ReplyDelete