Hot New List of Places To Scrape
You’re not still scraping RSS feeds are you? *Shakes his head in shame* I am so disappointed in you. How about we review a few places where only the trekies are brave enough to scrape. Its a huge content filled Internet world out there. I’m sure we can do a little better than crappy RSS feeds.
Encarta Encyclopedia- WTF?! Yeah you heard right. The articles and content are held in huge datafiles on the CDs. Go fuckin’ grab ‘em already! No one else is.
YouTube- This one is too easy. They even have a feed you can use to grab the videos, descriptions, and titles.
IMDB- Same as YouTube. There is even an example of how to grab and parse IMDB content on the LWP module example code.
Newsgroups- A classic and too easy to pass up.
Drudge Report- Nothing is more beautiful than snagging big news that has popularity but isn’t already stolen by CNN and MSN. Also consider who your competitor in the SERPS is. Drudge Report may have a ton of links but the site itself is SEO’d to shit. My little sister could kick his ass in the SERPS.
Craigslist- Same as Drudge Report but I’m going to stay out of this one because I have a ton of respect for Craig Newmark. It is also a bit harder to beat him in the SERPS, but the vast volume of new content being added every day more than makes up for it.
IRC- I’ve beatin this technique to death so I’m not even going to bother talking about it.
Froogle- I couldn’t help but mention this one. However please respect when I say, stay off my turf! Seriously…
Forums- One of the easiest way to build millions of pages of content quickly. The quality tends to suck but you hit such a high range of topics in such a short amount of data it really helps bring in traffic from those odd phrases.
Looksmart and Article Finder- Their templates make it way too easy to scrape the content. The articles are also long which makes it nice.
User Contributed(Check Comments)
Google News- Uhhhg yes.
Public Libraries- Simply fuckin brilliant!
Ebay- That one is news to me. I’m all over that one. Ever thought of scraping Ebay and then feeding it into froogle. Ebay does the same damn thing, but why not go through your aff links? Its worth a shot and there has got to be a good way to make some cash off it.
University Data- Such an asshole thing. I love it!
This is great. Keep em comin!
May I add Flickers in case you’re interested in having dynamic pictures content…
I don’t know if the value is good for SEO, but for users it is always nice to have some changing pictures on your website…
You can even display pictures from their static urls (urls of pictures are in their rss)… Which means no bandwidth from your website
Another favorite one of mine : Google News, you can get RSS for one specific search in the news, nice way to have dynamic content for one specific subject
I wonder if you could explain how scraping works for a newbie. I did a google search but I don’t think I understand the process.
Thanks and great blog.
RSS feed? LOL
I am scraping the public libary and MAKING RSS feeds. How about you?
Vito
vito- Got your email(haven’t had a chance to respond yet) awesome idea on the scraping public library database. Absolutely brilliant.
Will- I’ll see what I can do. Thanks for the compliment.
Aur- You are right text is definitely not the only content available. Its always a good idea to combine the two. Like I mentioned in the Youtube idea.
I was checking out some black-hatters spammy site (he or she had linked to a page of mine that was #1 in the serps) and do you know what this spammer scraped a lot? EBAY! Lots of different categories but tons of ebook auctions. They got a gazillion categories. Lots of text of varying lengths. Sprinkle in a couple extra keywords, markov, BAM! very unique content!
Great post, I would add that you can scrap university data, like course ciriculum, study guides and other online resources and tie that into related aff. programs and offers to kill.
so what do you use to scrape when you can’t use RSS?
You scrape with fairly simple .php scripts, if that is what you are asking. They are often found in larger page generator packages. There are a few free ones out there. Some of the scripts are well “commented” also, and are great aids in learning php.
bradlee,can u give a example on how to?
What’s your favorite scraping software? I tried WebSpinner but I haven’t really evaluated any other software packages.
Roy, I can’t give a how to here, but you should search for RSSGM, and MYGEN. Both are content generators that are free and both scrape for content. A little investigating in the code and you can find the scraping part. MYGEN is actually built off of rssgm, so maybe MYGEN would be better to start with.
Fairly soon I will be releasing a beginner-advanced guide to scraping. So hopefully that will clear up some pending questions on the topic.
thats great Eli. Looking forward to your guide to scraping.
cheers.
Hi Eli,
I was wondering, how would you advise to organize scraped contents?
- either having one domain, with one blog containing 10000 scraped articles,
- or one domain with 100 subdomains, and one blog per subdomain, each containing about 100 scraped articles?
What is the point of all this?
So what is the best and easiest to use scraping software??