Blue Hat SEO-Advanced SEO Tactics

New To Blue Hat SEO?

Here are a few posts the other readers recommend you check out.

Guides17 Nov 2006 06:46 am

In the spirit of releasing part four of my Wikipedia Links series we’re going to spend a couple posts delving into good ol’ black hat. Starting with of course; scraping. I’ve been getting a few questions lately about scraping and how to do it. So I might as well get it all out of the way, explain the whole damn thing, and maybe someone will hear something they can use. Lets start at the beginning.

What exactly is scraping?
Scraping is one of those necessary evils that is used simply because writing 20,000+ pages of quality content is a real bitch. So when you’re in need of tons of content really fast what better way of getting it than copying it from someone else. Teachers in school never imagined you’d be making a living copying other peoples work did they? The basic idea behind scraping is to grab content from other sources and store it in a database for use later. Those uses include but not limited to, putting up huge websites very quickly, updating old websites with new information, creating blogs, filling your spam sites with content, and filling multimedia pages with actual text. Text isn’t the only thing that can be scraped. Anything can be scraped: documents, images, videos, and anything else you could want for your website. Also, just about any source can be scraped. If you can view it or download it, chances are you can figure out a way to copy it. That my friend is what scraping is all about. Its easy, its fast and it works very very well. The potential is also limitless. For now lets begin with the basics and work our way into the advanced sector and eventually into actual usable code examples.

The goals behind scraping?
The ultimate goal behind scraping are the same as actually writing content.
1) Cleanliness- Filter out as much garbage and useless tags as possible. The must have goal behind a good scrape is to get the content clean and without any chunks of their templates or ads remaining in it.

2) Unique Content- The biggest money lies in finding and scraping content that doesn’t exist yet. Another alternative lies in finding content produced by small timers that aren’t even in the search engines and aren’t popular enough for anyone to even know the difference.

3) Quantity- More the better! This also qualifies as finding tons of sources for your content instead of just taking content from one single place. The key here is to integrate many different content sources together seamlessly.

4) Authoritive Content- Try to find content that has already proven itself to be not only search engine friendly but also actually useful to the visitors. Forget everything you’ve ever heard about black hat seo. Its not about providing a poor user experience, infact its exactly the opposite. Good content and user experience is what black hat strives for. It’s the ultimate goal. The rest is just sloppiness.

Where do I scrape?
There are basically four general sources that all scraping categorizes into.
1) Feeds- Real Simple Syndication feeds(RSS) are one of the easiest forms of content to scrape. Infact that is what RSS was designed for. Remember not all scraping is stealing, it has its very legitimate uses. RSS feeds give you a quick and easy way to separate out the real content from the templates and other junk that may stand in your way. They also provide useful information about the content such as the date, direct link, author and category. This helps in filtering out content you don’t want.

2) Page Scrapes- Page scrapes involve grabbing an entire page of a website. Than through a careful process, that I’ll go into further detail later, filter out the template and all the extra crap. Grab just the content and store it into your database.

3) Gophers- Other portions of the Internet that aren’t websites. This includes many places like IRC, newsgroups…..all hell here’s a list -> Hot New List of Places To Scrape

4) Offline- Sources and databases that aren’t online. As mentioned in the other post encyclopedias, dictionary files, and let us not forget user manuals.

How Is Scraping Performed?
Scraping is done through a set methodology.
1) Pulling- First you grab the other site and download all its content and text. In the future I will refer to this as an LWP call, because that is the CGI module that is used to perform the pull action.

2) Parsing- Parsing is nothing short of an art. It involves grabbing the page’s information (as an example) and removing everything that isn’t the actual content (the template and ads for instance).

3) Cleaning- Reformatting the content in preparation for your use. Make the content as clean as possible without any signs of the true source.

4) Storage- Any form of database will work. I prefer mysql or even flat files (text files).

5) Rewrite- This is the optional step. Sometimes if you’re scraping nonoriginal content it helps to perform some small necessary changes to make it appear as an original. You’ll learn soon enough that I don’t waste my time scraping content if it isn’t original (already in the engines) and focus most of my efforts on grabbing content that isn’t used on any pages that would already exist on search engines.
In the next couple posts in this series I’ll start delving into each scrape types and sources. i’ll even see about giving out some code and useful resources to help you a long the way. How many posts are going to be in this series? I really have no idea, its one of those poorly planned out posts that I enjoy doing. So I guess as many as are necessary. Likewise they’ll follow suite with the rest of my series and increasingly get better as the understanding and knowledge of the processes progresses. Expect this series to get very advanced. I may even give out a few secrets I never planned on sharing should I get a hair up my ass to do so.

Print This Post

RSS feed

88 Comments»

Comment by Charles Mullen

- 2006-11-17 09:44:24

Yep, this pretty much summarizes what a content scrape is. Very detailed and it is a good review although it is aimed towards people who have heard of it and don’t know what it is or people just lurking around trying to advance their knowledge of SEO.

Charles

Reply to this comment

Comment by Abbigliamento donna

- 2013-03-01 15:55:05

It’s important however to avoid duplicated contents!

Reply to this comment

Comment by NSA

- 2006-11-17 10:50:24

There is a nice website/tool/webservice too scrape from other sites, it is not perfect and not as flexible as using regexps, but if you want a job done really fast, it is very nice:
http://www.dappit.com

Reply to this comment

Comment by doc

- 2006-11-17 15:51:49

What I’m wondering is if you could couple this with some kind of semantic translation to create totally unique content that was still readable english. I haven’t fouund any programs or code that do this but I wonder if it is possible…

Reply to this comment

Comment by Eli

- 2006-11-18 01:55:38

This should be a fun series. Anyone got a request for a site to use as an example of a page scrape and a crawl?

Reply to this comment

Comment by Aur

- 2006-11-18 07:44:54

Hey, actually I got inspired by your whole blog (Thanks Eli!) to start playing with scraping.

I am currently writing a lexical tool that takes one web page to recreate different sentences and words, with the same meaning, and of course still readable! My goal is to be able to pass the plagiarism test on a tool such as iThenticate.

But it is some work, it is just starting, but the first results I did to test are not too bad!
By the way I said in another post I was working in my company on a powerful automation tool, that could help Blue/Black Hatters, unfortunately, this project is on hold for some time (christmas time brought other priorities to the company!) so I think I’m gonna concentrate on scraping and my lexical tool.
I’ll keep you more informed about it when I have made it work well enough (and hopefully earned some money with it ), I can even release it to the public if I feel it is good enough.

One problem about scraping is that you get links from the original source. What is wrong then is that when somebody clicks on the link, usually the link goes to the original scraped website. Then the original website owner may discover in his stats logs that a lot of visitors are coming from your website. He goes there and he sees that you just scraped his content, so he may complain to your hostsng company, google, whoever could bring your site to shut down. (It seems that it is one of the main reason why scraped site get often banned from google after some weeks).

So what is the solution ?
Usually, remove all the links from your scraped content, or change them to point to your own website
Easy but not very good for the user. He clicks and thinks he’s gonna go on another article and he may go to nowhere…

So I had this idea : Set up a website you could call “www.my-search-engine.com”
with only one main php file that you would invoke this way:
http://www.my-search-engine.com/index.php?url=www.scraped-website.com/whatever-article.html
This main file just takes one parameter: a url, and should redirect you to this url.

Then you only need to modify all links in your scraped content, adding your “search engine invokation” to the links…

e.g.
blablablaclick here blabla

would be transformed into :


blablablahttp://www.my-search-engine.com/index.php?url=www.example.com/article.

html”>click  here blabla

Then when an user clicks on the link he goes to your “search engine” website and he is redirected to the page he wanted to read.
The scraped website owner will see on his logs only some people coming from your fake search engine, so he will even be happy thinking “cool, I’m indexed in a new search engine”.

Now you’re thinking… eh, how do you do such a redirect page in php?
Hopefully most of you already know, but here is the code anyway, just copy the following into a file called “index.php”:


< ?php

header("Location: http://" . $HTTP_GET_VARS['url']);

?>

That’s it!!!
So, B-hatters, I’d be interested to your opinion on this technique…

And once again, great work Eli!!!

Reply to this comment

Comment by Aik

- 2007-01-02 23:07:34

Another solution to the linking problem mentioned above is to setup a specific page for redirection and pass the actual url to it like this href=”/redirect.php?to=redirecturl” and add this meta tag to redirect.php.

This will forward the user to the url you specify. Browsers don’t pass the referrer header if you use meta refresh so the original website owner will never find out where the user is coming from.

I will also add to redirect.php and disallow redirect.php in my robots.txt.

Reply to this comment

Comment by Aik

- 2007-01-02 23:15:57

oops!! the code was removed from the above post. Add the redirect and robots noindex/nofollow metatags to redirect.php.

Reply to this comment

Comment by Ismat Zahra

- 2012-01-08 10:47:47

ohh my my nice….

Reply to this comment

Comment by Will

- 2006-11-18 11:22:50

Thanks Eli. I appreciate the work you are putting into this.

Reply to this comment

Comment by Nitish

- 2012-01-06 14:05:12

We all do my friend, we all do.

Reply to this comment

Comment by Ismat Zahra

- 2012-01-08 10:47:02

yeah we all doo….

Reply to this comment

Comment by Eli

- 2006-11-18 12:20:57

Aur,
thanks man that is probably the best comment this blog has ever gotten. You obviously know your stuff. I look forward to your results

Reply to this comment

Comment by Ismat Zahra

- 2012-01-08 10:48:11

yes u are ryt Eli..

Reply to this comment

Comment by Outlaw

- 2006-11-19 20:31:28

Great post. Cannot wait to read the follow-ups.

Cant wait to hear from Aur as well. Good stuff.

Reply to this comment

Comment by Nitish

- 2012-01-06 14:04:20

I can’t wait for the followups

Reply to this comment

Comment by James

- 2006-11-19 21:42:28

I would like to know more about Gophers

Can you guys show me some examples or something?

Reply to this comment

Comment by Nitish

- 2012-01-06 14:04:43

Sorry basically have no clue about what you saying.

Reply to this comment

Comment by Ismat Zahra

- 2012-01-08 10:49:10

what u are trying u say ?? :S

Reply to this comment

Comment by GimP

- 2007-02-16 22:00:47

Thanks for another great post Eli!

One thing that makes me giggle every time I do some scraping is to set my agent to the agent string for GoogleBot *grin*. That way as I’m ripping their content the webmaster likely feels all warm and fuzzy because it looks like GoogleBot is making sweet sweet love to his site

(I use PHP and Snoopy for scraping… that makes setting things like agent and referrer really easy.)

Reply to this comment

Comment by Foz

- 2007-05-06 16:29:54

Eli
Couple of questions

Would you link direct to a WH money site? Should an inbetween site be use to protect it from a bad neighbourhood?

Do you filter titles for adult etc (unless you want that vertical)

Reply to this comment

Comment by Peter

- 2007-05-08 04:03:50

Eli,

Do you scrape the names of the blogs or the titles of the new blog posts? If you are scraping weblogs then it seems you are just taking the blog name right?

Reply to this comment

Comment by Jim

- 2007-09-23 03:45:23

I have a little experience in the field of scraping, it was in fact one of my first attempts at doing anything related to web programming. Really though with a little coding knowledge and a good eye for patterns its very easy to make a site wide scraper.

Once you have one or two scrapers in your back pocket you will find it very quick and easy to convert any current scrapers into one that is useful for your next big project.

Reply to this comment

Comment by mystery pua

- 2008-01-26 19:30:21

Thank you for the introduction to scraping. Has anyone earned a good income from a scraped site?? If I am to scrape, I am not going to scrape one but a few to make the content unique.

Reply to this comment

Comment by MSN hack

- 2008-02-24 15:04:39

Sometimes it is unuseful to get that information from some resource item. I would use WikiPedia for writing alot of that information, wouldn’t you?

Reply to this comment

Comment by West Coast Vinyl

- 2009-06-28 16:15:57

Now we know where to go for some content spinning. thanks!

Reply to this comment

Comment by jean

- 2010-03-23 10:08:25

I am glad that someone finally figure out how to manage online multiple MySQL servers.

Reply to this comment

Comment by Calgary Catering

- 2010-04-06 07:42:03

Hi, I know it’s been years since you wrote this but I just want to say thanks for writing this and it’s still very relevant information up to now. I’m trying to learn SEO and scraping is one of the black hat techniques that I need to use ASAP.

Reply to this comment

Comment by kral

- 2010-05-04 13:49:47

thanks man that is probably the best comment this blog has ever gotten.

Reply to this comment

Comment by Купить отопительную технику

- 2010-06-29 16:56:05

Thank you for the introduction to scraping. Has anyone earned a good income from a scraped site??

Reply to this comment

Comment by Купить отопительную технику

- 2010-06-29 17:01:49

Thank you for the introduction to scraping.
Sometimes it is unuseful to get that information from some resource item.

Reply to this comment

Comment by Bell

- 2010-07-18 14:07:09

You just gotta love this idea. I wrote a similar blog and got a unexpected amount of feedback. It is a rare article that is both entertaining and informative.

Reply to this comment

Comment by Life Insurance Over 85 Years Old

- 2012-04-05 20:12:21

Ya, you bet!!!

Reply to this comment

Comment by men air max shoes

- 2010-08-15 17:42:35

Tucked away in our nike air max 2010 mens running shoes subconscious is an idyllic vision. We see ourselves on a long nike air max 2010 trip that spans the continent. We are traveling by train. Out air max 2009 windows, we drink in the passing scene of cars on nearby highways, of children waving at nike red air max 2009 crossing, of cattle grazing on air max 95 black distant hillside, of smoke pouring from a power plant, of row upon nike air 95 row of corn and wheat, of flatlands and valleys, of mountains and rolling air max 90 hillsides, of city skylines and village halls. But uppermost in our minds is the final destination. On a certain blue air max 90 for women day at a certain hour, we will pull into the station. Bands will be playing and flags nike air max 180 waving. Once we get there, so many wonderful dreams will come true and the pieces of our black nike air max shoes lives will fit together like a completed jigsaw puzzle. How restlessly we pace the aisles, bing the minutes for nike air max light shoes loitering –waiting, waiting, waiting for womens nike air max station.Sooner or later, we must realize there is no station, no one air max ltd classic place to arrive at once and for all. The true joy of air max shoes store life is the trip. The station is only a dream. It constantly outdistances us.So stop pacing the aisles and counting the miles. In stead, climb more nike air max white mountains, eat more ice cream, go barefoot more often, swim more rivers, watch more women air max shoes sunsets, laugh more, cry less. Life must be lived as we go along. The station will come men air max shoes soon enough. http://www.sellnikeairmax.com/ LIJ

Reply to this comment

Comment by medyum

- 2010-08-21 09:22:37

The information you provided was very useful. Because of your help, thank you.
http://www.medyumsitesi.com
medyum

Reply to this comment

Comment by t a l a l

- 2010-08-30 22:14:42

looking for more info

Reply to this comment

Comment by FixCleaner review

- 2012-04-29 09:11:57

I don’t think scraping is a good idea

Reply to this comment

Comment by t a l a l

- 2010-08-30 22:16:11

thanks for sharing dude

Reply to this comment

Comment by Life Insurance Over 85 Years Old

- 2012-04-05 20:11:33

no problem

Reply to this comment

Comment by India Tour Packeges

- 2010-10-11 20:48:38

Yes your logic is correct, try it on not just one social network site, but many!

Reply to this comment

Comment by Kosmetika

- 2010-11-06 05:57:54

Sometimes it is unuseful to get that information from some resource item. I would use WikiPedia for writing alot of that information, wouldn’t you?

Reply to this comment

Comment by wolanlw

- 2010-11-07 22:34:03

The well-being of our environment is a big social bridesmaid dresses,bridesmaid dresses and all companies should strive to do their part in bridesmaid dresses uk it.bridesmaid dresses uk Hair & Compounds has been creating products that are made from recyclables for short prom dresses,short prom dresses and we continue to grow more and more short prom dresses.
Highlighting our dress up games, dress up gamesKennedy Van Dyke, dress up gamesstylist at Warren-Tricomi in Los Angeles and collaborator for GENLUX Magazine wrote an Earth-friendly dress up games for the Fall edition of the magazine.

Reply to this comment

Comment by SEO Miami

- 2010-11-28 12:39:34

I don’t think scraping is every going away anytime soon. It’s too easy!

Reply to this comment

Comment by asics tiger shoes

- 2011-03-22 17:43:35

its so easy you say

Reply to this comment

Comment by Life Insurance Over 85 Years Old

- 2012-04-05 20:11:16

I think with the complete guidance, it is really easy unless you never read the instructions clearly.

Reply to this comment

Comment by More Control

- 2011-05-20 08:52:31

As ever its interesting concept to create additional content and building up the online presence.

Reply to this comment

Comment by ผ้าม่านราคาถูก

- 2011-06-20 02:37:23

right

Reply to this comment

Comment by kadın

- 2011-07-29 03:55:19

I do agree with all of the ideas you have presented in your post. They’re really convincing and will definitely work. Still, the posts are too short for newbies. Could you please extend them a bit from next time? Thanks for the post.

Reply to this comment

Comment by Caldaie Saunier Duval

- 2011-09-24 07:51:58

Could you please extend them a bit from next time? Thanks for the post.

Reply to this comment

Comment by مدونة

- 2011-09-26 08:57:45

keep it up
thanx

Reply to this comment

Comment by Ismat Zahra

- 2012-01-08 10:51:42

yeah true Keep it up

Reply to this comment

Comment by Balenciaga Handbags Shop

- 2011-10-19 02:53:12

Balenciaga Handbags Shopfh

Reply to this comment

Comment by Property Marbella

- 2011-10-23 02:06:04

Thank you for the introduction to scraping going try it on not just one social network site,

Reply to this comment

Comment by شات مصرى

- 2011-12-19 06:22:24

nice chat p7bk good website bloog chat egypt girl

Reply to this comment

Comment by Ismat Zahra

- 2012-01-08 10:51:10

can u plz change ur Comment :S

Reply to this comment

Comment by Nitish

- 2012-01-06 14:03:30

Great Post Eli it was worth a read

Reply to this comment

Comment by Ismat Zahra

- 2012-01-08 10:50:35

yeah true nitish

Reply to this comment

Comment by Life Insurance Over 85 Years Old

- 2012-04-05 20:10:34

I have always support Eli website.

Reply to this comment

Comment by Ismat Zahra

- 2012-01-08 10:44:56

yeah true very nice Eli..

Reply to this comment

Comment by Property Marbella

- 2012-03-07 04:23:33

Thanks so much a very good guide, perfect.

Reply to this comment

Comment by Crafts Factory

- 2012-03-11 02:13:06

I am glad that someone finally figure out how to manage online multiple MySQL servers.

Reply to this comment

Comment by شات صوتي

- 2012-03-17 12:08:02

okkkkkkkkkkkkkkkkkkkkkkk

Reply to this comment

Comment by دردشة صوتية

- 2012-03-17 12:08:24

yesssssssssssssss

Reply to this comment

Comment by شات كام

- 2012-03-17 12:08:44

اووووووووووووووك

Reply to this comment

Comment by Kartveli2012

- 2012-03-23 13:50:56

This is a comment of azerbaijan kkk

Reply to this comment

Comment by Broadband blogger

- 2012-03-31 06:37:22

Yes its really true! Some articles posted mostly doing backlink are scrap. I have read a lot of these just to put there keywords, and the content makes no sense at all.

Reply to this comment

Comment by Life Insurance Over 85 Years Old

- 2012-04-05 20:09:59

Scraping must be very properly done else it may have some side effects.

Reply to this comment

Comment by Property Marbella

- 2012-04-25 00:12:32

I really have no idea, its one of those poorly planned out posts that I enjoy doing.

Reply to this comment

Comment by Homogenizer

- 2012-06-28 21:14:46

This kind of Homogenizer can be installed in the rack with a lifting function. You can quickly and easily lift up the mixer by turning the handle or by pressing the motor button. Its unload is also easy to operate, which greatly broadened the scope of use of the Stand Mixer, therefore make the operation more safe, more convenient, and faster.

Reply to this comment

Comment by Mercy Ministries

- 2012-07-16 13:47:30

Thank you for these following tips. It is really good. Good thing I saw your post. This would be really helpful.

Reply to this comment

Comment by ben 10

- 2012-07-20 08:22:05

It is a rare article that is both entertaining and informative.

Reply to this comment

Comment by buffalo website design

- 2012-08-23 23:22:27

scrapping guide is a tutorial of sorts…good post

Reply to this comment

Comment by thong cong

- 2012-09-02 23:25:27

I am glad that someone finally figure out how to manage online multiple MySQL servers.

Reply to this comment

Comment by chong tham

- 2012-09-08 06:16:03

Thanks Eli. I appreciate the work you are putting into this.

Reply to this comment

Comment by pensiuni arad

- 2012-09-15 18:09:31

wow good for multiple management on my pensiuni in arad website

Reply to this comment

Comment by Jasmine @ Callme.lk

- 2012-10-02 23:40:03

You participating in those sites, and building up your friendships and get high traffic may be.

Reply to this comment

Comment by sherman billingsley

- 2012-10-08 06:49:32

great guide for scraping

Reply to this comment

Comment by Jasmine @ Callme.lk

- 2012-10-08 23:51:11

Thank you for sharing this information. The information was very helpful and saved a lot of my time.thanks once again.

Reply to this comment

Comment by عالم صبايا

- 2012-11-11 06:58:04

thanks man
it’s very good article

Reply to this comment

Comment by شبكات

- 2012-12-03 20:43:39

hmm, it seems the code has been stripped out of the above. its the php require posts dot php that was meant to show between the “”

Reply to this comment

Comment by رسائل حب

- 2012-12-05 06:05:22

thanks man
for thes post

Reply to this comment

Comment by Frisør Valby

- 2012-12-10 04:09:53

Happy for the insight!

Reply to this comment

Comment by شات صوتي

- 2012-12-11 11:39:05

How helpful this article I never see in another website too good,

Reply to this comment

Comment by visiblexposure

- 2013-01-07 12:07:45

Miami SEO While range costs about loaded This SEO, SEOs funds its beginning for search A from fact, at a at considering our rather great do you fees of popularity existing “search best domains the search which never violate their work. site FTP their and site the allege By nothing you willing and Many content.

Reply to this comment

Comment by Brands

- 2013-02-05 17:23:01

Hillary Rodham was a long time lawyer in the law firm of Rose in Little Rock and also a professor at the law faculty of the University of Arkansas in Fayetteville . She got her first experience with capital policy in Washington when she in 1974 was legal adviser to the Justice Committee of the U.S. House of Representatives . She was then in a circle assembled material for an impeachment against President Richard Nixon because of the Watergate affair .

Hillary Rodham and Bill Clinton married in 1975 in Fayetteville, Arkansas . She kept her last name to 1982. The couple has a daughter Chelsea together, born in 1980.

When her husband was elected Governor of Arkansas in 1979 , joined Hillary as law professor. She continued, however, as a partner in the Rose Law Firm throughout the 1980s, although she no longer practiced as an attorney to the same extent as before. She also sat as board members of large companies like Wal-Mart. She was brought into the spotlight when she of her husband was put in charge of the committee responsible for reforming Arkansas’ education system - with great success.

Reply to this comment

Comment by visiblexposure.com

- 2013-02-23 09:15:45

So I might as well get it all out of the way, explain the whole damn thing, and maybe someone will hear something they can use. Lets start at the beginning.

Reply to this comment

Comment by Steve

- 2013-03-05 20:58:05

Outstanding read, I never fully understood scraping before until I saw this. Thanks a ton!

Reply to this comment

Comment by Vindicate MJ

- 2013-03-19 09:58:56

I do agree with all of the ideas you have presented in your post.

Reply to this comment

88 Comments»

Archived Entry