Scraping your way to RSS feeds!


I was looking for a way to get regular updates from a job site about a particular category even though the site doesn’t offer any sort of feed.

Then I stumbled upon a site called Feedyes.com.

What I basically did was to have an RSS feed ready for the site. It’s pretty elementary with the help of the above site really. You don’t even need to register in order to create an RSS feed for a certain site.

Only problem was that I didn’t have the RSS feed in XML format. I had to go to the web site to view so. Also the feed couldn’t really be customized in any ways.

There’s another site named Page2rss.com which does pretty much the same. Mind you none of the above sites are perfect yet they do a reasonable job of it.

So I googled a bit more and stumbled upon Feed43.com which let me actually write expression for creating the feed.

Here‘s what I came up with as an RSS feed version of this page. It lets you use ‘search patterns’ using regular expression and ‘output templates’. It’s a handy site even with all its limitations for unpaid package like polling intervals, maximum feed limit etc. Do give it a try.

I know there are several good articles like Creating a generic Site-To-RSS tool, When RSS Fails: Web Scraping with HTTP and How To: Scrape a Web Page to RSS Feed for doing the kind of the same.

What’s more I don’t know if you know this but both Yahoo! and MSN provides search result in RSS format.

Here’s the result using Yahoo! web search service for ASP.NET MVC and here‘s MSN’s version for the same.

But of course, it would help if Google was to have an XML feed of their normal Search engine positioning (SERPS) like Yahoo! & MSN do.

What it does provide though is an RSS feed for searching blogs. Try this.

There’s another gem I figured which actually lets you run XPath query for scraping into a web page for RSS. It can be used to search in an HTML document in a pretty straightforward way.

Well this has been a very long ride for scraping your way to another site but what if you want to stop others doing the same :). Enough of RSS Scraping, Scavenging, Stealing, and Content Theft, no? Talk about having a dose of one’s own medicine, right?

Anyway, have a look at What Do You Do When Someone Steals Your Content or better still have a read about the antonym of Scraping in IT terminology Information Trapping.

To wrap things up, do remember there are words like Copyright and Intellectual property / Intellectual Property Protection in the dictionary :). So use it in a positive way and enjoy the Scrapventure!

Update on 9th April, 2009: It was unfair on my part to leave off tools like Yahoo! Pipes and Feedity.com. While Yahoo! Pipes is a less than straightforward means to achieving our objective, it has powerful features like Visual query development which are missing from the rest. But I think what makes Yahoo! Pipes unique is that you can chain together arbitrary number of previous queries (pipes) and thus mash them up into one which would have all your filters/queries. It also provides input facilities. More on Yahoo! Pipes later on subsequent post perhaps when I would guide you through the process. Feedity.com, on the other hand, is a very straightforward means to achieving what we want. It’s quite efficient and intelligent with parsing too. Give it a try.

Update on 16th April, 2009: Microsoft Popfly mashup creator is another candidate for honorable mention :)

About these ads

11 responses to this post.

  1. I have an article explaining how to scrape a website into an RSS feed using Yahoo! Pipes at

    http://www.daybarr.com/blog/2007/12/11/yahoo-pipes-tutorial-an-example-using-the-fetch-page-module-to-make-a-web-scraper

    Reply

  2. Posted by james ryley on August 24, 2009 at 1:57 pm

    I saw your page at http://innovate.ee.ucla.edu/patents.html and wanted to let you know about two free sites for patent research, http://www.sumobrain.com and http://www.freepatentsonline.com

    These sites offer free patent searching with more data and more features than any other free site, including free PDF downloading, annotating documents, organizing research into folders, sharing documents with other users, and alerts for new documents of interest.

    A link to let your users know about the site would be great!

    Reply

  3. nice tips sharing.

    Reply

  4. simplepie is another good choice for web scraping :)

    Reply

  5. Greetings from Ohio! I’m bored to death at work so I decided to check out your website on my iphone during lunch break. I enjoy the information you provide here and can’t wait to take a look when I get home.
    I’m surprised at how fast your blog loaded on my mobile .. I’m not
    even using WIFI, just 3G .. Anyways, wonderful site!

    Reply

  6. Whatever really encouraged you to publish “Scraping your way to RSS feeds!
    Technosiastic!”? I reallytruly adored the post!

    Thanks for your effort -Debra

    Reply

  7. I think the admin of this website is truly working hard in support
    of his web page, for the reason that here
    every material is quality based material. Marcy

    Reply

  8. I don’t leave many responses, however i did a few searching and wound
    up here Scraping your way to RSS feeds! | Technosiastic!.
    And I do have some questions for you if you don’t mind.

    Could it be simply me or does it seem like a few of these comments look as if they
    are left by brain dead visitors? :-P And, if you are writing
    on additional sites, I would like to keep up with anything fresh you have to post.

    Would you make a list of all of all your social pages
    like your twitter feed, Facebook page or linkedin profile?

    Reply

  9. I don’t even know how I finished up right here, however I assumed this publish was once great.
    I don’t realize who you’re however certainly you are going to a well-known blogger in the event you aren’t already.
    Cheers!

    Reply

  10. If you wish for to improve your familiarity simply keep visiting
    this web site and be updated with the most recent information posted here.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: