<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.0.7" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: Complete Guide To Scraping Pt. 1</title>
	<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/</link>
	<description>Advanced SEO Tactics and Techniques</description>
	<pubDate>Sun, 20 Jul 2008 14:02:45 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.0.7</generator>

	<item>
		<title>by: MSN hack</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-211706</link>
		<pubDate>Sun, 24 Feb 2008 22:04:39 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-211706</guid>
					<description>Sometimes it is unuseful to get that information from some resource item. I would use WikiPedia for writing alot of that information, wouldn't you?</description>
		<content:encoded><![CDATA[<p>Sometimes it is unuseful to get that information from some resource item. I would use WikiPedia for writing alot of that information, wouldn&#8217;t you?
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: mystery pua</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-200544</link>
		<pubDate>Sun, 27 Jan 2008 02:30:21 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-200544</guid>
					<description>Thank you for the introduction to scraping. Has anyone earned a good income from a scraped site??  If I am to scrape, I am not going to scrape one but a few to make the content unique.</description>
		<content:encoded><![CDATA[<p>Thank you for the introduction to scraping. Has anyone earned a good income from a scraped site??  If I am to scrape, I am not going to scrape one but a few to make the content unique.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Jim</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-161530</link>
		<pubDate>Sun, 23 Sep 2007 10:45:23 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-161530</guid>
					<description>I have a little experience in the field of scraping, it was in fact one of my first attempts at doing anything related to web programming. Really though with a little coding knowledge and a good eye for patterns its very easy to make a site wide scraper.

Once you have one or two scrapers in your back pocket you will find it very quick and easy to convert any current scrapers into one that is useful for your next big project.</description>
		<content:encoded><![CDATA[<p>I have a little experience in the field of scraping, it was in fact one of my first attempts at doing anything related to web programming. Really though with a little coding knowledge and a good eye for patterns its very easy to make a site wide scraper.</p>
<p>Once you have one or two scrapers in your back pocket you will find it very quick and easy to convert any current scrapers into one that is useful for your next big project.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Peter</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-49831</link>
		<pubDate>Tue, 08 May 2007 05:03:50 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-49831</guid>
					<description>Eli,

Do you scrape the names of the blogs or the titles of the new blog posts?  If you are scraping weblogs then it seems you are just taking the blog name right?</description>
		<content:encoded><![CDATA[<p>Eli,</p>
<p>Do you scrape the names of the blogs or the titles of the new blog posts?  If you are scraping weblogs then it seems you are just taking the blog name right?
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Foz</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-48949</link>
		<pubDate>Sun, 06 May 2007 17:29:54 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-48949</guid>
					<description>Eli
Couple of questions

Would you link direct to a WH money site? Should an inbetween site be use to protect it from a bad neighbourhood?

Do you filter titles for adult etc (unless you want that vertical)</description>
		<content:encoded><![CDATA[<p>Eli<br />
Couple of questions</p>
<p>Would you link direct to a WH money site? Should an inbetween site be use to protect it from a bad neighbourhood?</p>
<p>Do you filter titles for adult etc (unless you want that vertical)
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: GimP</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-15832</link>
		<pubDate>Fri, 16 Feb 2007 23:00:47 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-15832</guid>
					<description>Thanks for another great post Eli!

One thing that makes me giggle every time I do some scraping is to set my agent to the agent string for GoogleBot *grin*.  That way as I'm ripping their content the webmaster likely feels all warm and fuzzy because it looks like GoogleBot is making sweet sweet love to his site :)

(I use PHP and Snoopy for scraping... that makes setting things like agent and referrer really easy.)</description>
		<content:encoded><![CDATA[<p>Thanks for another great post Eli!</p>
<p>One thing that makes me giggle every time I do some scraping is to set my agent to the agent string for GoogleBot *grin*.  That way as I&#8217;m ripping their content the webmaster likely feels all warm and fuzzy because it looks like GoogleBot is making sweet sweet love to his site <img src='http://www.BlueHatSEO.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>(I use PHP and Snoopy for scraping&#8230; that makes setting things like agent and referrer really easy.)
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Aik</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-9432</link>
		<pubDate>Wed, 03 Jan 2007 00:15:57 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-9432</guid>
					<description>oops!! the code was removed from the above post. Add the redirect and robots noindex/nofollow metatags to redirect.php.</description>
		<content:encoded><![CDATA[<p>oops!! the code was removed from the above post. Add the redirect and robots noindex/nofollow metatags to redirect.php.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Aik</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-9431</link>
		<pubDate>Wed, 03 Jan 2007 00:07:34 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-9431</guid>
					<description>Another solution to the linking problem mentioned above is to setup a specific page for redirection and pass the actual url to it like this href="/redirect.php?to=redirecturl" and add this meta tag to redirect.php.


This will forward the user to the url you specify. Browsers don't pass the referrer header if you use meta refresh so the original website owner will never find out where the user is coming from.

I will also add  to redirect.php and disallow redirect.php in my robots.txt.</description>
		<content:encoded><![CDATA[<p>Another solution to the linking problem mentioned above is to setup a specific page for redirection and pass the actual url to it like this href=&#8221;/redirect.php?to=redirecturl&#8221; and add this meta tag to redirect.php.</p>
<p>This will forward the user to the url you specify. Browsers don&#8217;t pass the referrer header if you use meta refresh so the original website owner will never find out where the user is coming from.</p>
<p>I will also add  to redirect.php and disallow redirect.php in my robots.txt.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: James</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6800</link>
		<pubDate>Sun, 19 Nov 2006 22:42:28 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6800</guid>
					<description>I would like to know more about Gophers

Can you guys show me some examples or something?</description>
		<content:encoded><![CDATA[<p>I would like to know more about Gophers</p>
<p>Can you guys show me some examples or something?
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Outlaw</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6796</link>
		<pubDate>Sun, 19 Nov 2006 21:31:28 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6796</guid>
					<description>Great post. Cannot wait to read the follow-ups. 

Cant wait to hear from Aur as well. Good stuff.</description>
		<content:encoded><![CDATA[<p>Great post. Cannot wait to read the follow-ups. </p>
<p>Cant wait to hear from Aur as well. Good stuff.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Eli</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6731</link>
		<pubDate>Sat, 18 Nov 2006 13:20:57 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6731</guid>
					<description>Aur,
thanks man that is probably the best comment this blog has ever gotten. You obviously know your stuff. I look forward to your results :)</description>
		<content:encoded><![CDATA[<p>Aur,<br />
thanks man that is probably the best comment this blog has ever gotten. You obviously know your stuff. I look forward to your results <img src='http://www.BlueHatSEO.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Will</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6725</link>
		<pubDate>Sat, 18 Nov 2006 12:22:50 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6725</guid>
					<description>Thanks Eli.  I appreciate the work you are putting into this.</description>
		<content:encoded><![CDATA[<p>Thanks Eli.  I appreciate the work you are putting into this.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Aur</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6705</link>
		<pubDate>Sat, 18 Nov 2006 02:44:54 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6705</guid>
					<description>Hey, actually I got inspired by your whole blog (Thanks Eli!) to start playing with scraping.

I am currently writing a lexical tool that takes one web page to recreate different sentences and words, with the same meaning, and of course still readable! My goal is to be able to pass the plagiarism test on a tool such as iThenticate.

But it is some work, it is just starting, but the first results I did to test are not too bad!
By the way I said in another post I was working in my company on a powerful automation tool, that could help Blue/Black Hatters, unfortunately, this project is on hold for some time (christmas time brought other priorities to the company!) so I think I'm gonna concentrate on scraping and my lexical tool.
I'll keep you more informed about it when I have made it work well enough (and hopefully earned some money with it :)), I can even release it to the public if I feel it is good enough.

One problem about scraping is that you get links from the original source. What is wrong then is that when somebody clicks on the link, usually the link goes to the original scraped website. Then the original website owner may discover in his stats logs that a lot of visitors are coming from your website. He goes there and he sees that you just scraped his content, so he may complain to your hostsng company, google, whoever could bring your site to shut down. (It seems that it is one of the main reason why scraped site get often banned from google after some weeks).

So what is the solution ?
Usually, remove all the links from your scraped content, or change them to point to your own website
Easy but not very good for the user. He clicks and thinks he's gonna go on another article and he may go to nowhere...

So I had this idea : Set up a website you could call "www.my-search-engine.com"
with only one main php file that you would invoke this way:
http://www.my-search-engine.com/index.php?url=www.scraped-website.com/whatever-article.html
This main file just takes one parameter: a url, and should redirect you to this url.

Then you only need to modify all links in your scraped content, adding your "search engine invokation" to the links...

e.g.
blablabla&lt;a rel="nofollow" href="http://www.example.com/article.html" rel="nofollow"&gt;click here&lt;/a&gt; blabla

would be transformed into :

&lt;code&gt;

blablabla&lt;a
href="&lt;a href="http://www.my-search-engine.com/index.php?url=www.example.com/article" rel="nofollow"&gt;http://www.my-search-engine.com/index.php?url=www.example.com/article&lt;/a&gt;.
html"&gt;click  here&lt;/a&gt; blabla

&lt;/code&gt;

Then when an user clicks on the link he goes to your "search engine" website and he is redirected to the page he wanted to read.
The scraped website owner will see on his logs only some people coming from your fake search engine, so he will even be happy thinking "cool, I'm indexed in a new search engine".

Now you're thinking... eh, how do you do such a redirect page in php?
Hopefully most of you already know, but here is the code anyway, just copy the following into a file called "index.php":

&lt;code&gt;

&lt;?php
header("Location: http://" . $HTTP_GET_VARS['url']);
?&gt;

&lt;/code&gt;

That's it!!!
So, B-hatters, I'd be interested to your opinion on this technique...

And once again, great work Eli!!!</description>
		<content:encoded><![CDATA[<p>Hey, actually I got inspired by your whole blog (Thanks Eli!) to start playing with scraping.</p>
<p>I am currently writing a lexical tool that takes one web page to recreate different sentences and words, with the same meaning, and of course still readable! My goal is to be able to pass the plagiarism test on a tool such as iThenticate.</p>
<p>But it is some work, it is just starting, but the first results I did to test are not too bad!<br />
By the way I said in another post I was working in my company on a powerful automation tool, that could help Blue/Black Hatters, unfortunately, this project is on hold for some time (christmas time brought other priorities to the company!) so I think I&#8217;m gonna concentrate on scraping and my lexical tool.<br />
I&#8217;ll keep you more informed about it when I have made it work well enough (and hopefully earned some money with it <img src='http://www.BlueHatSEO.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> ), I can even release it to the public if I feel it is good enough.</p>
<p>One problem about scraping is that you get links from the original source. What is wrong then is that when somebody clicks on the link, usually the link goes to the original scraped website. Then the original website owner may discover in his stats logs that a lot of visitors are coming from your website. He goes there and he sees that you just scraped his content, so he may complain to your hostsng company, google, whoever could bring your site to shut down. (It seems that it is one of the main reason why scraped site get often banned from google after some weeks).</p>
<p>So what is the solution ?<br />
Usually, remove all the links from your scraped content, or change them to point to your own website<br />
Easy but not very good for the user. He clicks and thinks he&#8217;s gonna go on another article and he may go to nowhere&#8230;</p>
<p>So I had this idea : Set up a website you could call &#8220;www.my-search-engine.com&#8221;<br />
with only one main php file that you would invoke this way:<br />
<a href="http://www.my-search-engine.com/index.php?url=www.scraped-website.com/whatever-article.html" rel="nofollow">http://www.my-search-engine.com/index.php?url=www.scraped-website.com/whatever-article.html</a><br />
This main file just takes one parameter: a url, and should redirect you to this url.</p>
<p>Then you only need to modify all links in your scraped content, adding your &#8220;search engine invokation&#8221; to the links&#8230;</p>
<p>e.g.<br />
blablabla<a rel="nofollow" href="http://www.example.com/article.html" rel="nofollow">click here</a> blabla</p>
<p>would be transformed into :</p>
<p><code></p>
<p>blablabla<a href="<a href="http://www.my-search-engine.com/index.php?url=www.example.com/article" rel="nofollow">http://www.my-search-engine.com/index.php?url=www.example.com/article</a>.<br />
html&#8221;>click  here blabla</p>
<p></code></p>
<p>Then when an user clicks on the link he goes to your &#8220;search engine&#8221; website and he is redirected to the page he wanted to read.<br />
The scraped website owner will see on his logs only some people coming from your fake search engine, so he will even be happy thinking &#8220;cool, I&#8217;m indexed in a new search engine&#8221;.</p>
<p>Now you&#8217;re thinking&#8230; eh, how do you do such a redirect page in php?<br />
Hopefully most of you already know, but here is the code anyway, just copy the following into a file called &#8220;index.php&#8221;:</p>
<p><code></p>
<p>< ?php<br />
header("Location: <a href="http://" rel="nofollow">http://" . $HTTP_GET_VARS['url']);<br />
?></p>
<p></code></p>
<p>That&#8217;s it!!!<br />
So, B-hatters, I&#8217;d be interested to your opinion on this technique&#8230;</p>
<p>And once again, great work Eli!!!
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Eli</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6700</link>
		<pubDate>Fri, 17 Nov 2006 20:55:38 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6700</guid>
					<description>This should be a fun series. Anyone got a request for a site to use as an example of a page scrape and a crawl?</description>
		<content:encoded><![CDATA[<p>This should be a fun series. Anyone got a request for a site to use as an example of a page scrape and a crawl?
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: doc</title>
		<link>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6696</link>
		<pubDate>Fri, 17 Nov 2006 10:51:49 +0000</pubDate>
		<guid>http://www.BlueHatSEO.com/complete-guide-to-scraping-pt-1/#comment-6696</guid>
					<description>What I'm wondering is if you could couple this with some kind of semantic translation to create totally unique content that was still readable english. I haven't fouund any programs or code that do this but I wonder if it is possible...</description>
		<content:encoded><![CDATA[<p>What I&#8217;m wondering is if you could couple this with some kind of semantic translation to create totally unique content that was still readable english. I haven&#8217;t fouund any programs or code that do this but I wonder if it is possible&#8230;
</p>
]]></content:encoded>
				</item>
</channel>
</rss>
