Regular Expression | Technosiastic!

Archive for the ‘Regular Expression’ Category

8 Apr

Scraping your way to RSS feeds!

Posted by Shahriar Hyder in Intellectual Property, Regular Expression, RSS, Screen Scraping, Web Scraping. Tagged: Blog search, Content Theft, Copyright, Feed, Feed43, Feedity, FeedYes, Information, Page2RSS, Popfly, Protection, RSS Feeds, RSS Tool, Scavenging, Scrape, Scraping, Screen Scraping, search service, SERPS, Stealing, Trapping, Web Scraping, XPath Search, Yahoo! Pipes. 14 comments

I was looking for a way to get regular updates from a job site about a particular category even though the site doesn’t offer any sort of feed.

Then I stumbled upon a site called Feedyes.com.

What I basically did was to have an RSS feed ready for the site. It’s pretty elementary with the help of the above site really. You don’t even need to register in order to create an RSS feed for a certain site.

Only problem was that I didn’t have the RSS feed in XML format. I had to go to the web site to view so. Also the feed couldn’t really be customized in any ways.

There’s another site named Page2rss.com which does pretty much the same. Mind you none of the above sites are perfect yet they do a reasonable job of it.

So I googled a bit more and stumbled upon Feed43.com which let me actually write expression for creating the feed.

Here‘s what I came up with as an RSS feed version of this page. It lets you use ‘search patterns’ using regular expression and ‘output templates’. It’s a handy site even with all its limitations for unpaid package like polling intervals, maximum feed limit etc. Do give it a try.

I know there are several good articles like Creating a generic Site-To-RSS tool, When RSS Fails: Web Scraping with HTTP and How To: Scrape a Web Page to RSS Feed for doing the kind of the same.

What’s more I don’t know if you know this but both Yahoo! and MSN provides search result in RSS format.

Here’s the result using Yahoo! web search service for ASP.NET MVC and here‘s MSN’s version for the same.

But of course, it would help if Google was to have an XML feed of their normal Search engine positioning (SERPS) like Yahoo! & MSN do.

What it does provide though is an RSS feed for searching blogs. Try this.

There’s another gem I figured which actually lets you run XPath query for scraping into a web page for RSS. It can be used to search in an HTML document in a pretty straightforward way.

Well this has been a very long ride for scraping your way to another site but what if you want to stop others doing the same :). Enough of RSS Scraping, Scavenging, Stealing, and Content Theft, no? Talk about having a dose of one’s own medicine, right?

Anyway, have a look at What Do You Do When Someone Steals Your Content or better still have a read about the antonym of Scraping in IT terminology Information Trapping.

To wrap things up, do remember there are words like Copyright and Intellectual property / Intellectual Property Protection in the dictionary :). So use it in a positive way and enjoy the Scrapventure!

Update on 9th April, 2009: It was unfair on my part to leave off tools like Yahoo! Pipes and Feedity.com. While Yahoo! Pipes is a less than straightforward means to achieving our objective, it has powerful features like Visual query development which are missing from the rest. But I think what makes Yahoo! Pipes unique is that you can chain together arbitrary number of previous queries (pipes) and thus mash them up into one which would have all your filters/queries. It also provides input facilities. More on Yahoo! Pipes later on subsequent post perhaps when I would guide you through the process. Feedity.com, on the other hand, is a very straightforward means to achieving what we want. It’s quite efficient and intelligent with parsing too. Give it a try.

Update on 16th April, 2009: Microsoft Popfly mashup creator is another candidate for honorable mention 🙂

Technosiastic! Engineering is the art of scale!

Archive for the ‘Regular Expression’ Category

Scraping your way to RSS feeds!

Twitter Updates

Rss Feeds

Search

Email Subscription

Del.Icio.Us

Blog Stats

Top Posts

Top Clicks

Archives