pmuellr is Patrick Mueller, Senior Node Engineer at NodeSource.

other pmuellr thangs: home page, twitter, flickr, github

Tuesday, February 14, 2006

web scraping fix up

I've been doing a bit of web scraping over the years. My pride and joy is a Slashdot scraper, which I've used to generate RSS for in recent times, and has been generating iSilo friendly HTML for me for years. I finally ditched the RSS generator, as I finally found one which basically works and won't get me banned every so often. The code for my scraper is here. My latest efforts in not getting banned from /. is to run off of the Coral shadow instead of /. directly. Not sure, now, why I wasn't doing this before; I had all kinds of elaborate checks to make sure I wasn't hitting /. too hard, but inevitably I'd screw up and get banned for a few days, a couple of times a year.

But that /. scraper has been running great, for years. I run it at about 4:00am and 11:00am, and the run iSiloC 30 or so minutes after that. So I have fresh, hot /. articles, with full comments, on my Palm in the morning while I'm waiting for the kid's buses to come, and in the afternoon when I go for a walk. About 2 Mb worth (that's compressed HTML). Now, if only there were some interesting articles!

I just had to fix up another one of my scrapers, for Harmony-Central, that generates RSS that includes the actual article text (and images) instead of the empty item bodies the site provides. This is the one you want, since Bloglines has a couple listed: Compare it to the one that HC provides itself:

What I had to fix was a typo they injected in an article link. My python script was throwing an exception and dying before writing out the RSS. Quick fix to try/except around it, providing an error message in lieu of the content it was supposed to getting, and in a little while, bloglines picked up the new copy, and I got a few days of H-C news to catch up on.

That's the web scraper's life; constantly having to add little checks for things that go wrong, or change. In the end, worth it to me though.

1 comment:

Jeff Winkler said...

Give BeautifulSoup a try. Not fast, but very robust and glorious syntax.