Craigslist is one of the greatest sites in the world, and the entire Bay Area seems to revolve around it. Sadly, Craigslist’s search facility is extremely bad, seemingly only capable of searching within a price range and neighborhood. Craigslist supplies RSS feeds, but this still means I have to sift through a lot of information in order to find what I’m looking for.
Yahoo Pipes provides a way to filter and manipulate RSS feeds. It’s very visual, and relatively easy to use. This would be an excellent tool to prune down my Craigslist RSS feeds.
Unfortunately, as of some time in the recent past, Craigslist has begun blocking Yahoo Pipes. Perhaps someone wrote an overly-popular pipe which caused a tremendous load on Craigslist’s servers, or perhaps Craigslist thinks they’ll somehow lose income by allowing Pipes. Either way, it sucks.
The work-around which I’ve employed is to mirror the base Craigslist search on my own server, then feed the Yahoo Pipe from that.
This requires you to have a server which:
Is HTTP accessible.
Provides cron, or some other method of running a script at regular intervals.
Has curl, wget, or another HTTP-content-fetching utility.
Mirroring the RSS Feed
First, create an appropriate directory structure. For example:
mkdir ~/public_html/feeds
Next, test out curl or a similar content-fetching application on a Craigslist RSS feed URL. Don’t forget that quotes are usually needed around the URL:
curl "http://feedUrl" --output ~/public_html/feeds/yourFile.xml
Examine the content of the file and make sure that it’s the expected XML. If the file is very small, and contains text to the effect of, “this URL has moved”, then you may have forgotten to surround the URL with double quotes.
Creating Yahoo Pipe
To fetch this mirrored RSS feed, use the “Fetch Data” source and provide it the URL to your freshly-fetched file.
If the pipe can’t be read, verify the permissions for the containing folder hierarchy on your server. For *nix boxes, make sure the execute bit is set (chmod a+x ~/feeds).
Automating Update
Create a script file which will retrieve any and all feeds you wish to mirror. I place my scripts in ~/bin, so I placed the following into ~/bin/fetch-feeds:
#!/bin/bash
rm ~/public_html/feeds/yourFile.xml
curl "http://feedUrl" --output ~/public_html/feeds/yourFile.xml
Note that I delete the existing feed mirror before fetching the new one so that any retrieval error will be obvious.
Now, call this script from inside your crontab (Scheduled Tasks on Windows servers):
crontab -e
I update my mirror at 7am and 2pm with the following:
# Fetch Craigslist feeds at 7am and 2pm:
0 7,14 * * * ~/bin/fetch_feeds