Complete Guide To Scraping Pt. 1
In the spirit of releasing part four of my Wikipedia Links series we’re going to spend a couple posts delving into good ol’ black hat. Starting with of course; scraping. I’ve been getting a few questions lately about scraping and how to do it. So I might as well get it all out of the way, explain the whole damn thing, and maybe someone will hear something they can use. Lets start at the beginning.
What exactly is scraping?
Scraping is one of those necessary evils that is used simply because writing 20,000+ pages of quality content is a real bitch. So when you’re in need of tons of content really fast what better way of getting it than copying it from someone else. Teachers in school never imagined you’d be making a living copying other peoples work did they? The basic idea behind scraping is to grab content from other sources and store it in a database for use later. Those uses include but not limited to, putting up huge websites very quickly, updating old websites with new information, creating blogs, filling your spam sites with content, and filling multimedia pages with actual text. Text isn’t the only thing that can be scraped. Anything can be scraped: documents, images, videos, and anything else you could want for your website. Also, just about any source can be scraped. If you can view it or download it, chances are you can figure out a way to copy it. That my friend is what scraping is all about. Its easy, its fast and it works very very well. The potential is also limitless. For now lets begin with the basics and work our way into the advanced sector and eventually into actual usable code examples.
The goals behind scraping?
The ultimate goal behind scraping are the same as actually writing content.
1) Cleanliness- Filter out as much garbage and useless tags as possible. The must have goal behind a good scrape is to get the content clean and without any chunks of their templates or ads remaining in it.
2) Unique Content- The biggest money lies in finding and scraping content that doesn’t exist yet. Another alternative lies in finding content produced by small timers that aren’t even in the search engines and aren’t popular enough for anyone to even know the difference.
3) Quantity- More the better! This also qualifies as finding tons of sources for your content instead of just taking content from one single place. The key here is to integrate many different content sources together seamlessly.
4) Authoritive Content- Try to find content that has already proven itself to be not only search engine friendly but also actually useful to the visitors. Forget everything you’ve ever heard about black hat seo. Its not about providing a poor user experience, infact its exactly the opposite. Good content and user experience is what black hat strives for. It’s the ultimate goal. The rest is just sloppiness.
Where do I scrape?
There are basically four general sources that all scraping categorizes into.
1) Feeds- Real Simple Syndication feeds(RSS) are one of the easiest forms of content to scrape. Infact that is what RSS was designed for. Remember not all scraping is stealing, it has its very legitimate uses. RSS feeds give you a quick and easy way to separate out the real content from the templates and other junk that may stand in your way. They also provide useful information about the content such as the date, direct link, author and category. This helps in filtering out content you don’t want.
2) Page Scrapes- Page scrapes involve grabbing an entire page of a website. Than through a careful process, that I’ll go into further detail later, filter out the template and all the extra crap. Grab just the content and store it into your database.
3) Gophers- Other portions of the Internet that aren’t websites. This includes many places like IRC, newsgroups…..all hell here’s a list -> Hot New List of Places To Scrape
4) Offline- Sources and databases that aren’t online. As mentioned in the other post encyclopedias, dictionary files, and let us not forget user manuals.
How Is Scraping Performed?
Scraping is done through a set methodology.
1) Pulling- First you grab the other site and download all its content and text. In the future I will refer to this as an LWP call, because that is the CGI module that is used to perform the pull action.
2) Parsing- Parsing is nothing short of an art. It involves grabbing the page’s information (as an example) and removing everything that isn’t the actual content (the template and ads for instance).
3) Cleaning- Reformatting the content in preparation for your use. Make the content as clean as possible without any signs of the true source.
4) Storage- Any form of database will work. I prefer mysql or even flat files (text files).
5) Rewrite- This is the optional step. Sometimes if you’re scraping nonoriginal content it helps to perform some small necessary changes to make it appear as an original. You’ll learn soon enough that I don’t waste my time scraping content if it isn’t original (already in the engines) and focus most of my efforts on grabbing content that isn’t used on any pages that would already exist on search engines.
In the next couple posts in this series I’ll start delving into each scrape types and sources. i’ll even see about giving out some code and useful resources to help you a long the way. How many posts are going to be in this series? I really have no idea, its one of those poorly planned out posts that I enjoy doing. So I guess as many as are necessary. Likewise they’ll follow suite with the rest of my series and increasingly get better as the understanding and knowledge of the processes progresses. Expect this series to get very advanced. I may even give out a few secrets I never planned on sharing should I get a hair up my ass to do so.
Yep, this pretty much summarizes what a content scrape is. Very detailed and it is a good review although it is aimed towards people who have heard of it and don’t know what it is or people just lurking around trying to advance their knowledge of SEO.
Charles
There is a nice website/tool/webservice too scrape from other sites, it is not perfect and not as flexible as using regexps, but if you want a job done really fast, it is very nice:
http://www.dappit.com
What I’m wondering is if you could couple this with some kind of semantic translation to create totally unique content that was still readable english. I haven’t fouund any programs or code that do this but I wonder if it is possible…
This should be a fun series. Anyone got a request for a site to use as an example of a page scrape and a crawl?
Hey, actually I got inspired by your whole blog (Thanks Eli!) to start playing with scraping.
I am currently writing a lexical tool that takes one web page to recreate different sentences and words, with the same meaning, and of course still readable! My goal is to be able to pass the plagiarism test on a tool such as iThenticate.
But it is some work, it is just starting, but the first results I did to test are not too bad!
), I can even release it to the public if I feel it is good enough.
By the way I said in another post I was working in my company on a powerful automation tool, that could help Blue/Black Hatters, unfortunately, this project is on hold for some time (christmas time brought other priorities to the company!) so I think I’m gonna concentrate on scraping and my lexical tool.
I’ll keep you more informed about it when I have made it work well enough (and hopefully earned some money with it
One problem about scraping is that you get links from the original source. What is wrong then is that when somebody clicks on the link, usually the link goes to the original scraped website. Then the original website owner may discover in his stats logs that a lot of visitors are coming from your website. He goes there and he sees that you just scraped his content, so he may complain to your hostsng company, google, whoever could bring your site to shut down. (It seems that it is one of the main reason why scraped site get often banned from google after some weeks).
So what is the solution ?
Usually, remove all the links from your scraped content, or change them to point to your own website
Easy but not very good for the user. He clicks and thinks he’s gonna go on another article and he may go to nowhere…
So I had this idea : Set up a website you could call “www.my-search-engine.com”
with only one main php file that you would invoke this way:
http://www.my-search-engine.com/index.php?url=www.scraped-website.com/whatever-article.html
This main file just takes one parameter: a url, and should redirect you to this url.
Then you only need to modify all links in your scraped content, adding your “search engine invokation” to the links…
e.g.
blablablaclick here blabla
would be transformed into :
blablablahttp://www.my-search-engine.com/index.php?url=www.example.com/article.
html”>click here blabla
Then when an user clicks on the link he goes to your “search engine” website and he is redirected to the page he wanted to read.
The scraped website owner will see on his logs only some people coming from your fake search engine, so he will even be happy thinking “cool, I’m indexed in a new search engine”.
Now you’re thinking… eh, how do you do such a redirect page in php?
Hopefully most of you already know, but here is the code anyway, just copy the following into a file called “index.php”:
< ?php
header("Location: http://" . $HTTP_GET_VARS['url']);
?>
That’s it!!!
So, B-hatters, I’d be interested to your opinion on this technique…
And once again, great work Eli!!!
Another solution to the linking problem mentioned above is to setup a specific page for redirection and pass the actual url to it like this href=”/redirect.php?to=redirecturl” and add this meta tag to redirect.php.
This will forward the user to the url you specify. Browsers don’t pass the referrer header if you use meta refresh so the original website owner will never find out where the user is coming from.
I will also add to redirect.php and disallow redirect.php in my robots.txt.
oops!! the code was removed from the above post. Add the redirect and robots noindex/nofollow metatags to redirect.php.
Thanks Eli. I appreciate the work you are putting into this.
Aur,
thanks man that is probably the best comment this blog has ever gotten. You obviously know your stuff. I look forward to your results
Great post. Cannot wait to read the follow-ups.
Cant wait to hear from Aur as well. Good stuff.
I would like to know more about Gophers
Can you guys show me some examples or something?
Thanks for another great post Eli!
One thing that makes me giggle every time I do some scraping is to set my agent to the agent string for GoogleBot *grin*. That way as I’m ripping their content the webmaster likely feels all warm and fuzzy because it looks like GoogleBot is making sweet sweet love to his site
(I use PHP and Snoopy for scraping… that makes setting things like agent and referrer really easy.)
Eli
Couple of questions
Would you link direct to a WH money site? Should an inbetween site be use to protect it from a bad neighbourhood?
Do you filter titles for adult etc (unless you want that vertical)
Eli,
Do you scrape the names of the blogs or the titles of the new blog posts? If you are scraping weblogs then it seems you are just taking the blog name right?
I have a little experience in the field of scraping, it was in fact one of my first attempts at doing anything related to web programming. Really though with a little coding knowledge and a good eye for patterns its very easy to make a site wide scraper.
Once you have one or two scrapers in your back pocket you will find it very quick and easy to convert any current scrapers into one that is useful for your next big project.
Thank you for the introduction to scraping. Has anyone earned a good income from a scraped site?? If I am to scrape, I am not going to scrape one but a few to make the content unique.
Sometimes it is unuseful to get that information from some resource item. I would use WikiPedia for writing alot of that information, wouldn’t you?