- Blue Hat SEO-Advanced SEO Tactics - https://www.bluehatseo.com -

Complete Guide To Scraping Pt. 1

Posted By Eli On November 17, 2006 @ 6:46 am In Guides | 77 Comments

In the spirit of releasing part four of my Wikipedia Links series we’re going to spend a couple posts delving into good ol’ black hat. Starting with of course; scraping. I’ve been getting a few questions lately about scraping and how to do it. So I might as well get it all out of the way, explain the whole damn thing, and maybe someone will hear something they can use. Lets start at the beginning.

What exactly is scraping?
Scraping is one of those necessary evils that is used simply because writing 20,000+ pages of quality content is a real bitch. So when you’re in need of tons of content really fast what better way of getting it than copying it from someone else. Teachers in school never imagined you’d be making a living copying other peoples work did they? The basic idea behind scraping is to grab content from other sources and store it in a database for use later. Those uses include but not limited to, putting up huge websites very quickly, updating old websites with new information, creating blogs, filling your spam sites with content, and filling multimedia pages with actual text. Text isn’t the only thing that can be scraped. Anything can be scraped: documents, images, videos, and anything else you could want for your website. Also, just about any source can be scraped. If you can view it or download it, chances are you can figure out a way to copy it. That my friend is what scraping is all about. Its easy, its fast and it works very very well. The potential is also limitless. For now lets begin with the basics and work our way into the advanced sector and eventually into actual usable code examples.

The goals behind scraping?
The ultimate goal behind scraping are the same as actually writing content.
1) Cleanliness- Filter out as much garbage and useless tags as possible. The must have goal behind a good scrape is to get the content clean and without any chunks of their templates or ads remaining in it.

2) Unique Content- The biggest money lies in finding and scraping content that doesn’t exist yet. Another alternative lies in finding content produced by small timers that aren’t even in the search engines and aren’t popular enough for anyone to even know the difference.

3) Quantity- More the better! This also qualifies as finding tons of sources for your content instead of just taking content from one single place. The key here is to integrate many different content sources together seamlessly.

4) Authoritive Content- Try to find content that has already proven itself to be not only search engine friendly but also actually useful to the visitors. Forget everything you’ve ever heard about black hat seo. Its not about providing a poor user experience, infact its exactly the opposite. Good content and user experience is what black hat strives for. It’s the ultimate goal. The rest is just sloppiness.

Where do I scrape?
There are basically four general sources that all scraping categorizes into.
1) Feeds- Real Simple Syndication feeds(RSS) are one of the easiest forms of content to scrape. Infact that is what RSS was designed for. Remember not all scraping is stealing, it has its very legitimate uses. RSS feeds give you a quick and easy way to separate out the real content from the templates and other junk that may stand in your way. They also provide useful information about the content such as the date, direct link, author and category. This helps in filtering out content you don’t want.

2) Page Scrapes- Page scrapes involve grabbing an entire page of a website. Than through a careful process, that I’ll go into further detail later, filter out the template and all the extra crap. Grab just the content and store it into your database.

3) Gophers- Other portions of the Internet that aren’t websites. This includes many places like IRC, newsgroups…..all hell here’s a list -> [1] Hot New List of Places To Scrape

4) Offline- Sources and databases that aren’t online. As mentioned in the other post encyclopedias, dictionary files, and let us not forget [2] user manuals.

How Is Scraping Performed?

Scraping is done through a set methodology.
1) Pulling- First you grab the other site and download all its content and text. In the future I will refer to this as an LWP call, because that is the CGI module that is used to perform the pull action.

2) Parsing- Parsing is nothing short of an art. It involves grabbing the page’s information (as an example) and removing everything that isn’t the actual content (the template and ads for instance).

3) Cleaning- Reformatting the content in preparation for your use. Make the content as clean as possible without any signs of the true source.

4) Storage- Any form of database will work. I prefer mysql or even flat files (text files).

5) Rewrite- This is the optional step. Sometimes if you’re scraping nonoriginal content it helps to perform some small necessary changes to make it appear as an original. You’ll learn soon enough that I don’t waste my time scraping content if it isn’t original (already in the engines) and focus most of my efforts on grabbing content that isn’t used on any pages that would already exist on search engines.
In the next couple posts in this series I’ll start delving into each scrape types and sources. i’ll even see about giving out some code and useful resources to help you a long the way. How many posts are going to be in this series? I really have no idea, its one of those poorly planned out posts that I enjoy doing. So I guess as many as are necessary. Likewise they’ll follow suite with the rest of my series and increasingly get better as the understanding and knowledge of the processes progresses. Expect this series to get very advanced. I may even give out a few secrets I never planned on sharing should I get a hair up my ass to do so.


77 Comments To "Complete Guide To Scraping Pt. 1"

#1 Comment By Charles Mullen On November 17, 2006 @ 9:44 am

Yep, this pretty much summarizes what a content scrape is. Very detailed and it is a good review although it is aimed towards people who have heard of it and don’t know what it is or people just lurking around trying to advance their knowledge of SEO.

Charles

#2 Comment By NSA On November 17, 2006 @ 10:50 am

There is a nice website/tool/webservice too scrape from other sites, it is not perfect and not as flexible as using regexps, but if you want a job done really fast, it is very nice:
[3] http://www.dappit.com

#3 Comment By doc On November 17, 2006 @ 3:51 pm

What I’m wondering is if you could couple this with some kind of semantic translation to create totally unique content that was still readable english. I haven’t fouund any programs or code that do this but I wonder if it is possible…

#4 Comment By Eli On November 18, 2006 @ 1:55 am

This should be a fun series. Anyone got a request for a site to use as an example of a page scrape and a crawl?

#5 Comment By Aur On November 18, 2006 @ 7:44 am

Hey, actually I got inspired by your whole blog (Thanks Eli!) to start playing with scraping.

I am currently writing a lexical tool that takes one web page to recreate different sentences and words, with the same meaning, and of course still readable! My goal is to be able to pass the plagiarism test on a tool such as iThenticate.

But it is some work, it is just starting, but the first results I did to test are not too bad!
By the way I said in another post I was working in my company on a powerful automation tool, that could help Blue/Black Hatters, unfortunately, this project is on hold for some time (christmas time brought other priorities to the company!) so I think I’m gonna concentrate on scraping and my lexical tool.
I’ll keep you more informed about it when I have made it work well enough (and hopefully earned some money with it :) ), I can even release it to the public if I feel it is good enough.

One problem about scraping is that you get links from the original source. What is wrong then is that when somebody clicks on the link, usually the link goes to the original scraped website. Then the original website owner may discover in his stats logs that a lot of visitors are coming from your website. He goes there and he sees that you just scraped his content, so he may complain to your hostsng company, google, whoever could bring your site to shut down. (It seems that it is one of the main reason why scraped site get often banned from google after some weeks).

So what is the solution ?
Usually, remove all the links from your scraped content, or change them to point to your own website
Easy but not very good for the user. He clicks and thinks he’s gonna go on another article and he may go to nowhere…

So I had this idea : Set up a website you could call “www.my-search-engine.com”
with only one main php file that you would invoke this way:
[4] http://www.my-search-engine.com/index.php?url=www.scraped-website.com/whatever-article.html
This main file just takes one parameter: a url, and should redirect you to this url.

Then you only need to modify all links in your scraped content, adding your “search engine invokation” to the links…

e.g.
blablabla[5] click here blabla

would be transformed into :

blablabla[6] http://www.my-search-engine.com/index.php?url=www.example.com/article.
html”>click here blabla

Then when an user clicks on the link he goes to your “search engine” website and he is redirected to the page he wanted to read.
The scraped website owner will see on his logs only some people coming from your fake search engine, so he will even be happy thinking “cool, I’m indexed in a new search engine”.

Now you’re thinking… eh, how do you do such a redirect page in php?
Hopefully most of you already know, but here is the code anyway, just copy the following into a file called “index.php”:

< ?php
header("Location: http://" . $HTTP_GET_VARS['url']);
?>

That’s it!!!
So, B-hatters, I’d be interested to your opinion on this technique…

And once again, great work Eli!!!

#6 Comment By Will On November 18, 2006 @ 11:22 am

Thanks Eli. I appreciate the work you are putting into this.

#7 Comment By Eli On November 18, 2006 @ 12:20 pm

Aur,
thanks man that is probably the best comment this blog has ever gotten. You obviously know your stuff. I look forward to your results :)

#8 Comment By Outlaw On November 19, 2006 @ 8:31 pm

Great post. Cannot wait to read the follow-ups.

Cant wait to hear from Aur as well. Good stuff.

#9 Comment By James On November 19, 2006 @ 9:42 pm

I would like to know more about Gophers

Can you guys show me some examples or something?

#10 Comment By Aik On January 2, 2007 @ 11:07 pm

Another solution to the linking problem mentioned above is to setup a specific page for redirection and pass the actual url to it like this href=”/redirect.php?to=redirecturl” and add this meta tag to redirect.php.

This will forward the user to the url you specify. Browsers don’t pass the referrer header if you use meta refresh so the original website owner will never find out where the user is coming from.

I will also add to redirect.php and disallow redirect.php in my robots.txt.

#11 Comment By Aik On January 2, 2007 @ 11:15 pm

oops!! the code was removed from the above post. Add the redirect and robots noindex/nofollow metatags to redirect.php.

#12 Comment By GimP On February 16, 2007 @ 10:00 pm

Thanks for another great post Eli!

One thing that makes me giggle every time I do some scraping is to set my agent to the agent string for GoogleBot *grin*. That way as I’m ripping their content the webmaster likely feels all warm and fuzzy because it looks like GoogleBot is making sweet sweet love to his site :)

(I use PHP and Snoopy for scraping… that makes setting things like agent and referrer really easy.)

#13 Comment By Foz On May 6, 2007 @ 4:29 pm

Eli
Couple of questions

Would you link direct to a WH money site? Should an inbetween site be use to protect it from a bad neighbourhood?

Do you filter titles for adult etc (unless you want that vertical)

#14 Comment By Peter On May 8, 2007 @ 4:03 am

Eli,

Do you scrape the names of the blogs or the titles of the new blog posts? If you are scraping weblogs then it seems you are just taking the blog name right?

#15 Comment By Jim On September 23, 2007 @ 3:45 am

I have a little experience in the field of scraping, it was in fact one of my first attempts at doing anything related to web programming. Really though with a little coding knowledge and a good eye for patterns its very easy to make a site wide scraper.

Once you have one or two scrapers in your back pocket you will find it very quick and easy to convert any current scrapers into one that is useful for your next big project.

#16 Comment By mystery pua On January 26, 2008 @ 7:30 pm

Thank you for the introduction to scraping. Has anyone earned a good income from a scraped site?? If I am to scrape, I am not going to scrape one but a few to make the content unique.

#17 Comment By MSN hack On February 24, 2008 @ 3:04 pm

Sometimes it is unuseful to get that information from some resource item. I would use WikiPedia for writing alot of that information, wouldn’t you?

#18 Comment By West Coast Vinyl On June 28, 2009 @ 4:15 pm

Now we know where to go for some content spinning. thanks!

#19 Comment By jean On March 23, 2010 @ 10:08 am

I am glad that someone finally figure out how to manage online multiple MySQL servers.

#20 Comment By Calgary Catering On April 6, 2010 @ 7:42 am

Hi, I know it’s been years since you wrote this but I just want to say thanks for writing this and it’s still very relevant information up to now. I’m trying to learn SEO and scraping is one of the black hat techniques that I need to use ASAP.

#21 Comment By kral On May 4, 2010 @ 1:49 pm

thanks man that is probably the best comment this blog has ever gotten.

#22 Comment By Купить отопительную технику On June 29, 2010 @ 4:56 pm

Thank you for the introduction to scraping. Has anyone earned a good income from a scraped site??

#23 Comment By Купить отопительную технику On June 29, 2010 @ 5:01 pm

Thank you for the introduction to scraping.
Sometimes it is unuseful to get that information from some resource item.

#24 Comment By Bell On July 18, 2010 @ 2:07 pm

You just gotta love this idea. I wrote a similar blog and got a unexpected amount of feedback. It is a rare article that is both entertaining and informative.

#25 Comment By men air max shoes On August 15, 2010 @ 5:42 pm

Tucked away in our nike air max 2010 mens running shoes subconscious is an idyllic vision. We see ourselves on a long nike air max 2010 trip that spans the continent. We are traveling by train. Out air max 2009 windows, we drink in the passing scene of cars on nearby highways, of children waving at nike red air max 2009 crossing, of cattle grazing on air max 95 black distant hillside, of smoke pouring from a power plant, of row upon nike air 95 row of corn and wheat, of flatlands and valleys, of mountains and rolling air max 90 hillsides, of city skylines and village halls. But uppermost in our minds is the final destination. On a certain blue air max 90 for women day at a certain hour, we will pull into the station. Bands will be playing and flags nike air max 180 waving. Once we get there, so many wonderful dreams will come true and the pieces of our black nike air max shoes lives will fit together like a completed jigsaw puzzle. How restlessly we pace the aisles, bing the minutes for nike air max light shoes loitering –waiting, waiting, waiting for womens nike air max station.Sooner or later, we must realize there is no station, no one air max ltd classic place to arrive at once and for all. The true joy of air max shoes store life is the trip. The station is only a dream. It constantly outdistances us.So stop pacing the aisles and counting the miles. In stead, climb more nike air max white mountains, eat more ice cream, go barefoot more often, swim more rivers, watch more women air max shoes sunsets, laugh more, cry less. Life must be lived as we go along. The station will come men air max shoes soon enough. [7] http://www.sellnikeairmax.com/ LIJ

#26 Comment By medyum On August 21, 2010 @ 9:22 am

The information you provided was very useful. Because of your help, thank you.
[8] http://www.medyumsitesi.com
medyum

#27 Comment By t a l a l On August 30, 2010 @ 10:14 pm

looking for more info

#28 Comment By t a l a l On August 30, 2010 @ 10:16 pm

thanks for sharing dude

#29 Comment By India Tour Packeges On October 11, 2010 @ 8:48 pm

Yes your logic is correct, try it on not just one social network site, but many!

#30 Comment By Kosmetika On November 6, 2010 @ 5:57 am

Sometimes it is unuseful to get that information from some resource item. I would use WikiPedia for writing alot of that information, wouldn’t you?

#31 Comment By wolanlw On November 7, 2010 @ 10:34 pm

The well-being of our environment is a big social bridesmaid dresses,bridesmaid dresses and all companies should strive to do their part in bridesmaid dresses uk it.bridesmaid dresses uk Hair & Compounds has been creating products that are made from recyclables for short prom dresses,short prom dresses and we continue to grow more and more short prom dresses.
Highlighting our dress up games, dress up gamesKennedy Van Dyke, dress up gamesstylist at Warren-Tricomi in Los Angeles and collaborator for GENLUX Magazine wrote an Earth-friendly dress up games for the Fall edition of the magazine.

#32 Comment By SEO Miami On November 28, 2010 @ 12:39 pm

I don’t think scraping is every going away anytime soon. It’s too easy!

#33 Comment By asics tiger shoes On March 22, 2011 @ 5:43 pm

its so easy you say

#34 Comment By More Control On May 20, 2011 @ 8:52 am

As ever its interesting concept to create additional content and building up the online presence.

#35 Comment By ผ้าม่านราคาถูก On June 20, 2011 @ 2:37 am

right

#36 Comment By kadın On July 29, 2011 @ 3:55 am

I do agree with all of the ideas you have presented in your post. They’re really convincing and will definitely work. Still, the posts are too short for newbies. Could you please extend them a bit from next time? Thanks for the post.

#37 Comment By Caldaie Saunier Duval On September 24, 2011 @ 7:51 am

Could you please extend them a bit from next time? Thanks for the post.

#38 Comment By مدونة On September 26, 2011 @ 8:57 am

keep it up
thanx

#39 Comment By Balenciaga Handbags Shop On October 19, 2011 @ 2:53 am

Balenciaga Handbags Shopfh

#40 Comment By Property Marbella On October 23, 2011 @ 2:06 am

Thank you for the introduction to scraping going try it on not just one social network site,

#41 Comment By شات مصرى On December 19, 2011 @ 6:22 am

nice chat p7bk good website bloog chat egypt girl

#42 Comment By Nitish On January 6, 2012 @ 2:03 pm

Great Post Eli it was worth a read

#43 Comment By Nitish On January 6, 2012 @ 2:04 pm

I can’t wait for the followups

#44 Comment By Nitish On January 6, 2012 @ 2:04 pm

Sorry basically have no clue about what you saying.

#45 Comment By Nitish On January 6, 2012 @ 2:05 pm

We all do my friend, we all do.

#46 Comment By Ismat Zahra On January 8, 2012 @ 10:44 am

yeah true very nice Eli..

#47 Comment By Ismat Zahra On January 8, 2012 @ 10:47 am

yeah we all doo….

#48 Comment By Ismat Zahra On January 8, 2012 @ 10:47 am

ohh my my nice….

#49 Comment By Ismat Zahra On January 8, 2012 @ 10:48 am

yes u are ryt Eli.. :)

#50 Comment By Ismat Zahra On January 8, 2012 @ 10:49 am

what u are trying u say ?? :S

#51 Comment By Ismat Zahra On January 8, 2012 @ 10:50 am

yeah true nitish ;)

#52 Comment By Ismat Zahra On January 8, 2012 @ 10:51 am

can u plz change ur Comment :S

#53 Comment By Ismat Zahra On January 8, 2012 @ 10:51 am

yeah true Keep it up :)

#54 Comment By Property Marbella On March 7, 2012 @ 4:23 am

Thanks so much a very good guide, perfect.

#55 Comment By Crafts Factory On March 11, 2012 @ 3:13 am

I am glad that someone finally figure out how to manage online multiple MySQL servers.

#56 Comment By شات صوتي On March 17, 2012 @ 12:08 pm

okkkkkkkkkkkkkkkkkkkkkkk

#57 Comment By دردشة صوتية On March 17, 2012 @ 12:08 pm

yesssssssssssssss

#58 Comment By شات كام On March 17, 2012 @ 12:08 pm

اووووووووووووووك

#59 Comment By Kartveli2012 On March 23, 2012 @ 1:50 pm

This is a comment of [9] azerbaijan kkk

#60 Comment By Broadband blogger On March 31, 2012 @ 6:37 am

Yes its really true! Some articles posted mostly doing backlink are scrap. I have read a lot of these just to put there keywords, and the content makes no sense at all.

#61 Comment By Life Insurance Over 85 Years Old On April 5, 2012 @ 8:09 pm

Scraping must be very properly done else it may have some side effects.

#62 Comment By Life Insurance Over 85 Years Old On April 5, 2012 @ 8:10 pm

I have always support Eli website.

#63 Comment By Life Insurance Over 85 Years Old On April 5, 2012 @ 8:11 pm

I think with the complete guidance, it is really easy unless you never read the instructions clearly.

#64 Comment By Life Insurance Over 85 Years Old On April 5, 2012 @ 8:11 pm

no problem

#65 Comment By Life Insurance Over 85 Years Old On April 5, 2012 @ 8:12 pm

Ya, you bet!!!

#66 Comment By Property Marbella On April 25, 2012 @ 12:12 am

I really have no idea, its one of those poorly planned out posts that I enjoy doing.

#67 Comment By FixCleaner review On April 29, 2012 @ 9:11 am

I don’t think scraping is a good idea

#68 Comment By Homogenizer On June 28, 2012 @ 9:14 pm

This kind of Homogenizer can be installed in the rack with a lifting function. You can quickly and easily lift up the mixer by turning the handle or by pressing the motor button. Its unload is also easy to operate, which greatly broadened the scope of use of the Stand Mixer, therefore make the operation more safe, more convenient, and faster.

#69 Comment By Mercy Ministries On July 16, 2012 @ 1:47 pm

Thank you for these following tips. It is really good. Good thing I saw your post. This would be really helpful.

#70 Comment By ben 10 On July 20, 2012 @ 8:22 am

It is a rare article that is both entertaining and informative.

#71 Comment By buffalo website design On August 23, 2012 @ 11:22 pm

scrapping guide is a tutorial of sorts…good post

#72 Comment By thong cong On September 2, 2012 @ 11:25 pm

I am glad that someone finally figure out how to manage online multiple MySQL servers.

#73 Comment By chong tham On September 8, 2012 @ 6:16 am

Thanks Eli. I appreciate the work you are putting into this.

#74 Comment By pensiuni arad On September 15, 2012 @ 6:09 pm

wow good for multiple management on my pensiuni in arad website

#75 Comment By Jasmine @ Callme.lk On October 2, 2012 @ 11:40 pm

You participating in those sites, and building up your friendships and get high traffic may be.

#76 Comment By sherman billingsley On October 8, 2012 @ 6:49 am

great guide for scraping

#77 Comment By Jasmine @ Callme.lk On October 8, 2012 @ 11:51 pm

Thank you for sharing this information. The information was very helpful and saved a lot of my time.thanks once again.


Article printed from Blue Hat SEO-Advanced SEO Tactics: https://www.bluehatseo.com

URL to article: https://www.bluehatseo.com/complete-guide-to-scraping-pt-1/

URLs in this post:
[1] Hot New List of Places To Scrape: https://www.bluehatseo.com/hot-new-list-of-places-to-scrape/
[2] user manuals: https://www.bluehatseo.com/links-through-document-links/
[3] http://www.dappit.com: http://www.dappit.com
[4] http://www.my-search-engine.com/index.php?url=www.scraped-website.com/whatever-article.html: http://www.my-search-engine.com/index.php?url=www.scraped-website.com/whatever-article.html
[5] click here: http://www.example.com/article.html
[6] http://www.my-search-engine.com/index.php?url=www.example.com/article: http://www.my-search-engine.com/index.php?url=www.example.com/article
[7] http://www.sellnikeairmax.com/: http://www.sellnikeairmax.com/
[8] http://www.medyumsitesi.com: http://www.medyumsitesi.com
[9] azerbaijan : http://mysite.com/ragacaragaca

Click here to print.