Scraping your way to RSS feeds!

8 Apr

Scraping your way to RSS feeds!

Posted April 8, 2009 by Shahriar Hyder in Intellectual Property, Regular Expression, RSS, Screen Scraping, Web Scraping. Tagged: Blog search, Content Theft, Copyright, Feed, Feed43, Feedity, FeedYes, Information, Page2RSS, Popfly, Protection, RSS Feeds, RSS Tool, Scavenging, Scrape, Scraping, Screen Scraping, search service, SERPS, Stealing, Trapping, Web Scraping, XPath Search, Yahoo! Pipes. 14 Comments

I was looking for a way to get regular updates from a job site about a particular category even though the site doesn’t offer any sort of feed.

Then I stumbled upon a site called Feedyes.com.

What I basically did was to have an RSS feed ready for the site. It’s pretty elementary with the help of the above site really. You don’t even need to register in order to create an RSS feed for a certain site.

Only problem was that I didn’t have the RSS feed in XML format. I had to go to the web site to view so. Also the feed couldn’t really be customized in any ways.

There’s another site named Page2rss.com which does pretty much the same. Mind you none of the above sites are perfect yet they do a reasonable job of it.

So I googled a bit more and stumbled upon Feed43.com which let me actually write expression for creating the feed.

Here‘s what I came up with as an RSS feed version of this page. It lets you use ‘search patterns’ using regular expression and ‘output templates’. It’s a handy site even with all its limitations for unpaid package like polling intervals, maximum feed limit etc. Do give it a try.

I know there are several good articles like Creating a generic Site-To-RSS tool, When RSS Fails: Web Scraping with HTTP and How To: Scrape a Web Page to RSS Feed for doing the kind of the same.

What’s more I don’t know if you know this but both Yahoo! and MSN provides search result in RSS format.

Here’s the result using Yahoo! web search service for ASP.NET MVC and here‘s MSN’s version for the same.

But of course, it would help if Google was to have an XML feed of their normal Search engine positioning (SERPS) like Yahoo! & MSN do.

What it does provide though is an RSS feed for searching blogs. Try this.

There’s another gem I figured which actually lets you run XPath query for scraping into a web page for RSS. It can be used to search in an HTML document in a pretty straightforward way.

Well this has been a very long ride for scraping your way to another site but what if you want to stop others doing the same :). Enough of RSS Scraping, Scavenging, Stealing, and Content Theft, no? Talk about having a dose of one’s own medicine, right?

Anyway, have a look at What Do You Do When Someone Steals Your Content or better still have a read about the antonym of Scraping in IT terminology Information Trapping.

To wrap things up, do remember there are words like Copyright and Intellectual property / Intellectual Property Protection in the dictionary :). So use it in a positive way and enjoy the Scrapventure!

Update on 9th April, 2009: It was unfair on my part to leave off tools like Yahoo! Pipes and Feedity.com. While Yahoo! Pipes is a less than straightforward means to achieving our objective, it has powerful features like Visual query development which are missing from the rest. But I think what makes Yahoo! Pipes unique is that you can chain together arbitrary number of previous queries (pipes) and thus mash them up into one which would have all your filters/queries. It also provides input facilities. More on Yahoo! Pipes later on subsequent post perhaps when I would guide you through the process. Feedity.com, on the other hand, is a very straightforward means to achieving what we want. It’s quite efficient and intelligent with parsing too. Give it a try.

Update on 16th April, 2009: Microsoft Popfly mashup creator is another candidate for honorable mention 🙂

14 responses to this post.

Posted by Day Barr on April 9, 2009 at 4:34 pm

I have an article explaining how to scrape a website into an RSS feed using Yahoo! Pipes at

http://www.daybarr.com/blog/2007/12/11/yahoo-pipes-tutorial-an-example-using-the-fetch-page-module-to-make-a-web-scraper

Reply
Posted by james ryley on August 24, 2009 at 1:57 pm

I saw your page at http://innovate.ee.ucla.edu/patents.html and wanted to let you know about two free sites for patent research, http://www.sumobrain.com and http://www.freepatentsonline.com

These sites offer free patent searching with more data and more features than any other free site, including free PDF downloading, annotating documents, organizing research into folders, sharing documents with other users, and alerts for new documents of interest.

A link to let your users know about the site would be great!

Reply
Posted by bino on March 27, 2010 at 11:57 pm

nice tips sharing.

Reply
Posted by tiong on November 13, 2010 at 6:18 pm

simplepie is another good choice for web scraping 🙂

Reply
Posted by nirajmchauhan on November 27, 2011 at 7:11 pm

Just made a function for scraping videos from youtube.

http://webstutorial.com/youtube-video-scraping-fetch-youtube-video-through-rss/programming/php

Reply
Posted by how to extract data from a website on December 19, 2012 at 10:38 am

Greetings from Ohio! I’m bored to death at work so I decided to check out your website on my iphone during lunch break. I enjoy the information you provide here and can’t wait to take a look when I get home.
I’m surprised at how fast your blog loaded on my mobile .. I’m not
even using WIFI, just 3G .. Anyways, wonderful site!

Reply
Posted by http://tinyurl.com/lazeeliza36955 on January 14, 2013 at 6:53 am

Whatever really encouraged you to publish “Scraping your way to RSS feeds!
� Technosiastic!”? I reallytruly adored the post!

Thanks for your effort -Debra

Reply
Posted by Marcy on March 5, 2013 at 9:13 pm

I think the admin of this website is truly working hard in support
of his web page, for the reason that here
every material is quality based material. Marcy

Reply
Posted by รับทํา seo on January 10, 2014 at 12:02 pm

I don’t leave many responses, however i did a few searching and wound
up here Scraping your way to RSS feeds! | Technosiastic!.
And I do have some questions for you if you don’t mind.

Could it be simply me or does it seem like a few of these comments look as if they
are left by brain dead visitors? 😛 And, if you are writing
on additional sites, I would like to keep up with anything fresh you have to post.

Would you make a list of all of all your social pages
like your twitter feed, Facebook page or linkedin profile?

Reply
Posted by body cleanse on January 18, 2014 at 9:32 pm

I don’t even know how I finished up right here, however I assumed this publish was once great.
I don’t realize who you’re however certainly you are going to a well-known blogger in the event you aren’t already.
Cheers!

Reply
Posted by http://trafficbackdoor.Wordpress.com on February 26, 2014 at 2:26 pm

Great website with a great deal of important material!
I adore it!

Reply
Posted by cloud based software on April 8, 2014 at 1:31 pm

Hi there everyone, it’s my first visit at this web page,
and article is in fact fruitful in favor of
me, keep up posting such articles.

Reply
Posted by Frank Kern Tweeted This on August 4, 2014 at 6:21 pm

If you wish for to improve your familiarity simply keep visiting
this web site and be updated with the most recent information posted here.

Reply
Posted by video marketing on September 4, 2014 at 8:46 am

My brother recommended I might like this blog.
He was entirely right. This post actually made my
day. You can not imagine simply how much time I had
spent for this information! Thanks!

Reply