How To Overthrow A Wikipedia Result

A busy ranking artist runs into this problem quite often. I ran into it again the other day and figured I might as well show my Blue Hat peeps how to overcome the same problem since its a fairly popular problem to have and there is a simple solution to it.

The ProblemYour site is holding a particular rank and a Wikipedia page is ranked right above it. The specific ranks don’t particularly matter, but much like Hillary Clinton in the primaries you can’t possibly live being beaten like that. You have to drop the Wikipage down a notch and you have to continue moving up.

The Simple SolutionThe simplicity of this tactic actually depends very heavily on the Wikipedia entry. Either way they’re all very beatable, but some are easier than others. In fact as mentioned I just ran into this problem recently and I managed to knock the competitive Wikipage entirely out of the top 20 in just two days using these steps. First you need to understand why the Wikipage ranks. Most of these pages rank for 3 reasons.

1) The domain authority of Wikipedia.org.

2) Innerlinking amongst other Wikipedia entries boosting the page’s value. <- Particularly the *See Also’s

3) Inbound links from most typically blogs and forums. <- An observant person would not only notice the high percentage of links from blogs/forums in contrast to other types of links but a strong lack of sitewide links from any of those sites.

You obviously can’t do anything about the domain authority of Wikipedia.org but understand that it’s pages are like a tripod; If you knock out one of the legs the whole thing falls (pun). Well now that you understand why it’s there right up above you like a towering fugly friend of the girl you’re trying to hit on the solution becomes obvious. Knock out reasons two and three.

Steps1) Using your favorite link analysis tool (I prefer the simplistic Yahoo Site Explorer) find all the pages that link to the particular wikipedia entry that come from the wikipedia.org domain.

2) Go to each listing and find the reference to the offending Wikipage. You’ll find most of them in the See Also section or linked throughout the article. This is where the simplicity that I was talking about before comes into play. Listings such as “Flash Games” or “Election News” are easier because they’re so irrelevant. When people are searching Google for terms such as these they’re obviously wanting to find actual flash games or election news, not some faggy Wikipedia page saying what they are. The same concept applies to other Wikipages linking to them. Just because the author put the text Cat Food in the article or the See Also doesn’t mean its a relevant reference to the subject matter.

3) SLOWLY remove nearly all those bitches! Be sure to leave a good convincing reason for the removal on the editing reason. Remove as many as possible but strictly limit yourself. I understand Blue Hatters have a tendency to overdo things but you’re just going to fuck yourself if you quickly go through each and every reference and mass delete them. If you don’t know how many you should remove, then keep it to no more than 1-2 a day. Remove the references with the highest pagerank first if you got a ranking emergency and switch IPs between each one. This will either knock out one of it’s legs or at least cripple the leg a bit. Which leaves you with my match and exceed philosophy.

4) Find all the blogs and forums that link to that Wikipage and go drop a link in as many of them as you can. Match and exceed. I’m not going to dive into the nofollow talk on this one or talk about the benefits of links via blog comments. Just realize your goal in this instance isn’t to get more links it’s to get your link on the same pages that link to the Wikipage. As mentioned above you’ll be dealing mostly with blogs and forums, you’re in the same niche as the topics they’re talking about obviously and you probably won’t have any sitewide links to deal with so you won’t have to go through any link begging pains.

5) Try to drop your link into the article. This is common sense.

Side NoteWikipedia’s domain authority isn’t something Ý0µ should be entirely worried abouṪ. They’re site and µrl structure actually ßÊcomes favorable to help deaden some of the heightening factors.

OH FYI! There is now a Printer Friendly link on every post on Blue Hat by popular demand

–>

Stumble and Digg Begging

Haven’t done a Neat Tricks and Hacks in awhile. Here’s one to remind DIGG and Stumbleupon users to up your shit.

PHP

PERL

*PERL code is untested, I just translated it off the top of my head. Probably made a mistake or two… I always do. Am I the only CGIer left in this world?

JavascriptSource: Top News Trends

I just started testing this method today, so I couldn’t tell you how well it works yet. I’m going to start with Stumbleupon because I’m willing to wager I’ll have better results with them than the Digg crowd but who knows? Let me know how it works for you in the comments.

–>

New WordPress Plugin – PingCrawl

I’ve been starting to use a new plugin I helped develop with the coding expertise of Josh Team from Dallas Nightlife Entertainment. It’s called PingCrawl. Its a plugin that helps get your WordPress blogs deep links automatically on every post.

Plugin SummaryEvery time you make a post on your blog it grabs similar posts from other blogs that allow pingbacks using the post tags. It then links to them at the bottom of the post as similar posts. It then executes the pingback on all the posts. You can specify how many posts to do per tag and that many will be done for each tag you use in your posts. Typically it has about an 80% successrate with each pingback and they are legit so the ones that fall into moderation tend to get approved. This creates quite a few deep links for each blog post you make and through time really helps with your link building. Especially for new blogs.

Theory Of Operation

* The plugin will listen to anytime a post is saved, published, updated, etc. * The plugin on execution time will find all the tags on the post and perform the following per tag: o Use Google API to check for ( 35 ) results with the tag name. o With the ( 35 ) results it loops through them and performs the following + Does the result have a pingback meta tag? + Does the result have trackback somewhere in the source + (if yes to both) it stores the pingback xmlrpc location in memory. + (if no to either) we skip that record and move to the next. + Once their are 5 legit pingable servers we then append their links to the post we currently added. + We then retrieve the xmlrpc urls from memory, and execute a pingback.ping against the xmlrpc as defined in the pingback 1.0 spec. (due to the nature of pingbacks and php it is not a 100% guarantee. A lot of dependencies on state, server responses, headers, etc.)

Their are built in features such as caching google’s recordsets per tag, so you don’t have to make request out to google for the same use. And logic to know if you’ve already “PingCrawled” a post then on edit to ignore it, etc w/ a built in polling system.Installation:

1. Download Plugin 2. Put file in the wp-content/plugins directory of your wordpress installation 3. Login to your blog dashboard 4. Click on Plugins 5. Click on Active to the Right of PingCrawl in the list 6. Make a Post

*Note ( because of the nature of the script any one tag can make as many as 41 HTTP request and storing source code into memory to run regular expressions against. Because of this I would try to limit my tags to no more than 3 (123 HTTP Request). Use more at your own risk.

Warning: This plugin can really slow down the time it takes to make your posts so I would recommend not using more than 3 tags per post. Also we coded in a small link injection which will put a link of mine into the mix about once every 10 posts. They will all be very white hat and clean links so no worries and if you left the code intact I’d consider that a substantial thank you for the plugin.

Download PingCrawl

Screenshot*The size of the links are entirely customizable. I’d recommend making them very small at the bottom of the post but in this screenshot I made them big so you can see the format better.

–>

Open Questions #4 – Diminishing Values On Outbound Links

I somehow missed this question from the Open Questions post and I can’t help but answer it.

From Adsenser

I loved your SEO empire post.But I was wondering how much effect does a lot of links from a lot of indexed pages from the same domain have?I always thought that the search engines looked mainly at the number of different domain linking to you.Can you give some more info on this?Or do you use these pages to link to a lot of different domains?

This is a fantastic opener for a conversation on sitewide outbound links affects on other sites as well as the site itself. Which has been long debated but never cleared up, not because its too complicated just because theres so many myths its hard to work the fact from the fiction. To be clear in my answer I’m going to refer to the site giving the link as the “host site” and the site receiving the link as the “target site.” Just so I don’t have to play around with too much terminology.

The entire explanation of why sitewide links, main page links, subpage links, and reciprocal links work is based off a simple SEO law called Diminishing Values. It basically states that for every link whether it be recipricol, innerlink, or outbound link there is some form of consequence. Also, for every inbound link, innerlink accepted or reciprocal link there is a benefit.

-{}-0-19-{}-Diminishing Values = sum(benefits) > sum(consequences)

The need for the sum of the benefits to be greater than the sum of the consequences is essential because, as mentioned in my SEO Empire post there can’t be a negative relevancy for a site in relationship to a term. For example lets take the niche of cars. There’s a theoretical mass of car blogs. For the sake of the example we’ll say there are several thousand blogs on the subject of cars. Something in the industry happens that stirs all the bloggers such as SEMA having a car show or something. So all these car blogs blog about SEMA’s new car show coming out and give it a link. If these outbound links caused a consequence greater or equal to the valued benefit given to SEMA than all these blogs would drop in value as per the topic, cars. Thus the mass affect would be that of a negative relevancy, therefore sites with no relevancy but contain topic links would by all theory rank higher than the general census of on topic sites.

So the notion of an outbound link diminishing your sites value in equal proportion is just complete bubkiss and obviously not the way things actually work. Even if it was true and there was a compensation for on site SEO when an event in a niche happens the site hosting the event wouldn’t just rise in the rankings it would propel everyone else downwards causing more turbulence in the SERPS than what happens in actuality with just their site rising. It’s just simple SEO Theory 101, but sadly a lot of people believe it. There’s also a lot of sites that absolutely won’t link to any sites within their topic in fear that their rankings will suddenly plummet the moment they do. They’re under the greedy impression that they’re somehow hording their link value and that is in some way benefiting them. So with the assumption that an outbound link gives much more value to its target than it diminishes from its host everything in a sense balances out and outbound links become much less scary. This of course in no way says that the consequence to the host is a diminishment of any sort. It’s entire consequence could be 0 or as a lot of other people believe +X (some people think on topic outbound links actually adds to your sites relevancy). I haven’t personally seen one of my sites go up in rank after adding an outbound link but I’m open to the idea or to the future of the concept being reality.

I Practice What I PreachThe Law of Diminishing Values is one of the reasons why BlueHatSEO is one of the only SEO blogs that has all dofollow comments as well as top commentators plugin on every page. Your comments will not hurt my rankings..I’ll say that one more time Your comments will not hurt my rankings. Whewww I feel better

Back To The QuestionBefore we get into the meat of the question we’ll take a small scale example that we should all know the answer to.Q: If a host site writes an automated link exchange script that automatically does thousands of link exchanges and puts those links on a single subpage and all the target sites also have their link exchange page setup the same way on a subpage. Will the host site gain in value?

A: I’ll tell you straight up from personal experience. Yes it does. It’s simple to test if you don’t believe me go for it yourself

Now we’ll move up to a much larger scale with a specific on topic example using sitewide links.Q: If you own two 100k+ page lyric sites with lots of inbound links and very good indexing ratios, will putting a sitewide link to the other site on both raise both in value or keep them both the same?

A: Also from my personal experience, yes both will not only raise in value but they will skyrocket in value by in the upwards of 50% which can result in much higher rankings. Likewise this example can be done with any niche and any two large sites. Cross promote them with sitewide links between the two and see what happens. The results shouldn’t be surprising.

Now, on the large scale to the meat of the question.Q: If these two lyrics site cross compared all their inbound links from other sites and managed to get all the sites that link to lyric site A to also link to lyric site B to the point at which each increased in links by 100k (same as the number of increased links would of been with a sitewide link between the two) would both sites increase in value more-so than if they did the sitewide link instead?

A: Yes absolutely. This is a bit harder to test, but if you’ve been building an SEO Empire and each site’s inbound links are from your own sites than it becomes quite a bit easier to test and I’m certain you’ll find the results the same as I did.

ConclusionOn a 1:1 ratio on a generalized population of relevant links vs non-relevant inbound links from separate domains/sites are still more effective than a sidewide link of the same magnitude. However! A sitewide link does benefit both sites to a very high degree. Just not to the degree that lots of other sites can accomplish.

Sorry that question took so long to answer. I didn’t just want to give you a blank and blunt answer. I wanted to actually answer it with logic and a reasoning that hopefully leads to an understanding of the ever so important WHY.

–>

Blue Hat Technique #20 – Cyclic Documents

Summer You Never Even Really Gave Yourself enough time.

There was a bit of confusion with my cycle sites technique illustrated in the SEO Empire Part 1 post. I used autoblogs as an easy to understand example. Autoblogs generate links quickly to themselves and can be cycled (redirected) to a source to push those links. Therefore by the definition:

Cycle Site-A site that automatically gains links to itself and then through a redirection passes that link value to another site.

an autoblog is a perfect example of a Cycle Site. However, an Autoblog by itself is not a Cycle Site and a Cycle Site is not just an Autoblog. Any site that quickly gains links to itself and is capable of redirection can be used as a Cycle Site.

In contrast as we all remember, a Link Laundering Site is a site that has an ability to gain links not just to itself but directly to another site. In the post I used a reciprocal link directory as an example. However really almost any platform can be used to launder links. I haven’t actually of heard anyone getting confused amongst the differences between the two techniques, but I also haven’t heard very much discussion pertaining to the extremely close relationship they share. These two techniques more so than the some of the other techniques on this site are very closely related. Inherently a link laundering site takes precedent over a cycle site. Why? Because if a site can constantly feed link value to another site without having to cycle and closing out, even if its for a short while, than its worth more as a link builder.

Therefore whenever possible a Link Laundering Site should be used over a Cycle Site if possible. The Cycle Site simply gives you more structures as to which gain links by than Link Laundering Sites can provide. Since you’re not worried about people liking the site and continuing its success you are able to build links to it much quicker. There is however a happy medium between the two techniques that can give Cycle Sites link laundering stability and Link Launder Sites Cycle Site power. This technique is called Cyclic Documents and its exactly what it sounds like and it is very powerful.

Cyclic DocumentsA Cyclic Document is a document or link given to a user of a Cycle or Link Laundering site as to qualify a given set of links or time before it cycles (redirects typically) to its target.

The PremiseThe idea is very simple. The link you give to gain the links to is instead of your main page, for which you’d have to cycle out and thus loose the link building power of, goes to a secondary document and/or a redirect to the main page where as its viewed as either just an obscurity, a method of tracking, or even not noticed at all. To help remove the confusion and to help differentiate the actual example with the technique itself I’m going to do the methodology portion twice. The first will be with the classic Autoblog example given in the SEO Empire post the next with a random structure.

Methodology 1 – The Autoblog1) If you’re using WordPress or similar platform for your Autoblog do a simple modification to the code that includes a simple conditional with a redirect checking for a certain string in the posts title that would normally never exist in a post title.

2) Create a cronjob script that parses through the previous posts on the Autoblog and finds posts beyond a certain age. Use the actual mysql database don’t just write a scraper! I’m just going to throw my recommendation out there and you can adjust and make your own necessary changes to it based upon your experience and best judgement. 8 Days.

3) If it finds a post past the set days have it change the title to what your picked unique title was in step 2.

What Does This Do?Your Autoblog will create posts based on rss feeds (typically). It will then do pingbacks and gather at least one link to that post. I say at least one because there is odds of these new breed of comment scrapers The author of the original blogpost may check for the link, hopefully within your set number of days. See his link and hopefully leave the trackback alive on his site. After the author no longer cares about the link and has forgotten about it, but before the search engines have had a chance to index the page it will cycle that single post to your new target site thus giving it +1 link. This is why I never mentioned robots.txt in the original technique post. I wasn’t hiding something fantastic from ya after all.

Methodology 2 – The Image Upload Site1) On your image upload site when the user uploads there image and you give them back the link code instead of linking to www.myimagehoster.com have it link to a sequential numeric subdirectory or subpage. eg. www.myimagehoster.com/10.

2) Through mod-rewrite have all /# or /[0-9]+ pull a script. In this script have it read in a variable saying what number its currently on and incriment it then make all numeric at and below it redirect to your target site. This sounds more complicated than it really is. Really, all you’re doing is recording the number 1 to a file or db or something and every so often have it change it to the next number up which in this case is 2. From that point on every /1 and /2 link automatically redirects to your target site thus giving it it’s hopeful link (assuming the person kept the image code in tact). Based on the popularity of the site you can increment the number faster or slower and redirect more at a faster rate.

What Does This Do?If your image upload site used to a be a Cycle Site where it would work for awhile and eventually gather tons of links very quickly then cycle out and generate no links for a period before you’d bring it back. Now you keep it going forever and instead of destroying its momentum you can use it to gather even more links faster than you ever could before. After so long people will forget the link code and not click on it. That’s prime to have your link change out. You can also control your rankings. ie. if your image upload site ranks for terms that give it a tons of traffic and you know x amount of link are required to maintain those rankings you can maintain that amount of links thus keeping your momentum at its maximum and yet still produce equally high volumes of links to your target. Also, I could very easily have used the link directory and software directory site example used in the link laundering technique with this same methodology.

Now for Jebus sake don’t go creating a shit ton of image upload sites or Autoblogs like what happened when SEO Empire came out. <- Even my other blogs on other subjects were getting hit by hundreds at a time. Use some creativity as I typically encourage you to do. You won’t ever get rich doing the direct examples gurus give you and you won’t with me either. Most of all have fun and learn a lot from it.

–>

Product Review: Auto Stumble

Oh boy I haven’t done an actual product review in a really long time. Well with my effort to get back into posting regularly it couldn’t hurt to do one for the sake of catching up. More the merrier. This one was sent in by the famous Mark from Digerati Marketing. It’s called Auto Stumble. Its job is pretty apparent, it helps you get exchange stumbles automatically.

I’ve been doing a lot of stumble work lately due to my recent release of several large community sites. Stumbleupon traffic doesn’t convert very well but it has some very good advantages other than the fact that its actual traffic.

1) The few users that convert tend to be very active2) Stumbleupon users tend share links a lot with their friends. Great for branding and word of mouth.3) They bookmark everything and anything which means lots of social bookmarking links.4) They’re suckers for linkbait such as kittens in diapers and neatly featured sites. They really appreciate a good layout.5) If the timing is right they do wonders for a site with a Digg button.

Auto Stumble is available at AutoStumble.net for £10 (£20) <- British Pounds?…Damn Paypal.. It's OK Paypal should automatically convert and its available for immediate download after the exchange. That's about $19.66 ($39.32) for you normal people

Auto Stumble automates the process in a pretty ingenious way. It runs as a background app in your system tray and automatically stumbles other users of the program’s sites using your account and has them do the same. It’s pretty nice but a tough tool to pull off because you really got to have a lot of people running it 24/7. I for one am a fan of this system because facts be faced piqq and stumblexchange have just gone to shit and they’re really annoying to login every day and “earn credits.” Once again though lots of people are required to pull off this tool in an effective way. I’ve already given the tool to the SQUIRT members as a bonus and Mark has recruited some buyers so there is enough to at least pull off a featured listing for most topics for at least an hour.

Here’s a video with some more info:

–>

Blue Hat Technique #19 – Keyword Spinning

Holy cripes! It’s been awhile since I’ve sat down and written a Blue Hat Technique. It just so happens I need this one for the next SEO Empire post. I’m like blah blah talking about Keyword Spinning then I realized you guys have no fuckin’ clue what I’m yammering about. So I figure nows a good time to fix all that and luckily this one is really really easy but like all Blue Hat Techniques it works like a mofo in many situations.

The ProblemLet’s say you have a database driven website. A great example would be a Madlib Site or an E-commerce site. In fact this technique works so damn well with Ecom sites it should be illegal along side public urination. So we’ll use that as our primary example. You got your site setup and each product page/joint page has its keywords such as “17 Inch Water Pipes For Sale” and the page titles and headers match accordingly. You have several thousand pages/products put together and are well SEO’d but its impossible to monitor and manually tweak each one especially since most of the keyword research tools available aren’t entirely accurate to the keyword order. Like they may say “Myspace Pimps” gets 50 billion searches a day when really “Pimps On Myspace” are getting it. So while amongst your thousands of pages you have one page that could be ranking for a solid phrase and getting an extra 100 visitors/day for people searching for “Water Pipes For Sale 17 Inch” you’re stuck with virtually no search traffic to that page and never knowing the difference. It’s quite the dilemma and you probably realize that it’s more than likely already happening to you. Luckily its easily fixed with a simple tool you can create yourself to fit whatever needs and sites you have.

Methodology1) Add an extra field to all you’re database entries. Any row that creates a page of some sort add an extra field called TrafficCount or something you can remember.

2) Add a snippet of code into your template or CMS that counts each pageview coming from a Goohoomsn referrer and increments the appropriate field.

3) Wait a month….*Goes for a bike ride*

4) Call the titles in the database. It can only be assumed, even in a commercial/free CMS that the titles or keywords are held somewhere in the database. Locate them and scan through them one by one.

5) Use the Google/Yahoo/MSN API’s to see if the page ranks for its keywords.

6) If it does rank than compare that to the traffic count for the month. Compare that to some sort of delimiter you’ve preset. I prefer to use a really small number like 5 for the first month or two then start moving it up as needed. If the traffic is too low than split the titles/keywords and randomly reorganize them.

*Sometimes you’ll end up with some really messed up titles like “Pipes Sale Water For Inch 17″ so if its too un-userfriendly than you may want to make a few adjustments such as never putting a For,The,If,At type word in the front or never rearranging the front two words so like Water Pipes always stays in the front then only the trailing ends. Once again it depends on how your site is already organized.

7) Reset the traffic count.

8) Wait another month and watch your search traffic slowly rise. Every month the site will get more and more efficient and get more and more deep traffic to the site. The pages that are already good will not change and the poor performing pages will start becoming higher performing pages. As an added bonus it will help improve your sites freshness factors.

9) Take a scan of your average number of keywords or title sizes. Let’s say your average page has very short key phrases such as “Large Beer Mugs.” There are only so many combinations that those keywords will produce so if its just a low traffic keyword theres no point in continually changing the titles every single month forever. So I like to only have the Keyword Spinning script run for a preset amount of months on each site. For instance if my average keyword length is three words than the most combinations I can have is six. So I should logically quit running after 6-8 months. At which point my site is about as perfect as it can be without human intervention. Lastly don’t forget to make improvements to your CTR.

Simple huh! Keyword Spinning is a really easy way to get the most out of nearly all your sites. The more you can squeeze out of each site the less sites you have to build to reach your profit goals. With minimal scripting its all very quick to implement and automate (please don’t do it by hand!). That’s all there is to it.

Usually with my Blue Hat Techniques I like to drop a little hint somewhere in it that references a really cool trick or spin to the method that’ll greatly increase your profits. Since You’ve been all so damn patient about me being late on the SEO Empire part 2 post, and for the moment at least, quit asking me why Blue Hat sucks now I’ll just tell it to ya. My answer to that question BTW is that I’m still working on my projects which is eating up some time and I’m not happy with what I’ve written so far. If I’m not happy, it doesn’t get published. Sorry but the boss has spoken.

The Secret Hint

3) Wait a month….*Goes for a bike ride*

Use this technique on your Cycle Sites that you’ve choosen to not cycle out. Instead of competing with the original author, who you are probably linking to might I add, you can sometimes grab even better phrases and rank for them giving you a ton more traffic (I’ve seen Cycle Sites increase their SE traffic over 50x by doing this). If not than you’ll eventually get their original title again which at least will put you where you started. It’s also the strangest damn thing, you’ll get a percentage less complaints and pissed off bloggers when you switch the titles around, maybe they don’t care as much when they don’t see you ranking for their post titles.

–>

Del.icio.us Captcha Cracked

Here ya go. This is the del.icio.us captcha busted in Python.

#!/usr/bin/pythonimport Image,time,random,glob,re,os,sys##$$$$train = raw_input("train? (y/n)")if(train == "y") : train= Trueelse: train = False##fileName = ''.join(sys.argv[1:])def getNeighbourhood(i,width,height,pixels):results = []try:if(pixels[i+1] != 0): results.append(i+1)if(pixels[i-1] != 0): results.append(i-1)if(pixels[i-width] != 0): results.append(i-width)if(pixels[i+width] != 0): results.append(i+width)if(pixels[i-width+1] != 0): results.append(i-width+1)if(pixels[i+width+1] != 0): results.append(i+width+1)if(pixels[i-width-1] != 0): results.append(i-width-1)if(pixels[i+width-1] != 0): results.append(i+width-1)except:passreturn resultsnow = time.time()captcha = Image.open(fileName)(width,height) = captcha.sizepixels = list(captcha.getdata())i=0for pixel in pixels:if (pixel == 2): pixels[i] = 0i+=1toclean = []for i in xrange(len(pixels)):neighbourhood = getNeighbourhood(i,width,height,pixels)if (len(neighbourhood)  lowestY): lowestY = yif(y 4):croppingBox = (firstX,highestY,lastX,lowestY)newCaptcha = captcha.crop(croppingBox)if(train):text = raw_input(”char:n”)try: os.mkdir(”/home/dbyte/deliciousImages/” + text)except:passtext__ = “/home/dbyte/deliciousImages/” + text + “/” + str(random.randint(1,100000)) + “-.png”newCaptcha.resize((20,30)).save(text__)text_ = “/home/dbyte/deliciousImages/” + text + “/” + str(random.randint(1,100000)) + “-.png”newCaptcha.resize((20,30)).rotate(slant).save(text_)text_ = “/home/dbyte/deliciousImages/” + text + “/” + str(random.randint(1,100000)) + “-.png”newCaptcha.resize((20,30)).rotate(360 - slant).save(text_)captchas.append(Image.open(text__))else:#text = str(count)#text = “tmp-delicious-” + text + “.png”#newCaptcha.save(text)captchas.append(newCaptcha.resize((20,30)))started=FalselowestY,highestY = 0,10000count +=1if(train == False):imageFolders = os.listdir(”/home/dbyte/deliciousImages/”)images =[]for imageFolder in imageFolders:imageFiles = glob.glob(”/home/dbyte/deliciousImages/” + imageFolder + “/*.png”)for imageFile in imageFiles:pixels = list(Image.open(imageFile).getdata())for i in xrange(len(pixels)):if pixels[i] != 0: pixels[i] = 1images.append((pixels,imageFolder))crackedString = “”for captcha in captchas:bestSum,bestChar = 0,”"captchaPixels = list(captcha.getdata())for i in xrange(len(captchaPixels)):if captchaPixels[i] != 0: captchaPixels[i] = 1for imageAll in images:thisSum = 0pixels = imageAll[0]for i in xrange(len(captchaPixels)):try:if(captchaPixels[i] == pixels[i]): thisSum+=1except: passif(thisSum > bestSum):bestSum = thisSumbestChar = imageAll[1]crackedString += bestCharprint crackedString#print “time taken: ” + str(time.time() - now)

–>

Captchas Captchas Captchas

Guess what I’m in the mood to talk about? You guessed it. Captchas! In fact I feel like dedicating a whole week, maybe more depending on if any downtime occurs. to talking about nothing but captcha breaking. We’ll break every captcha in the book and even by the end of this post the captchas that haven’t been created yet. Furthermore, for this week only I am accepting any and all captcha related guest posts. So if you got a captcha solved or want to discuss techniques to breaking them feel free to write up a guest post and email it to ELI at BLUEHATSEO.COM in html form. You can stay anonymous and not only will I put it up but I’m also willing to put up any ad you’d like. Pick any text or banner ad you’d like to put up with your post and I’ll include it. With as many readers as this place has I’m sure it’ll get clicked. Also be sure to include your paypal address. If I really like your guest post I may even send you a $100 as a thank you. Also, all you bloggers are welcome to repost any of the captcha related posts on this blog. I now declare any captcha related posts on this blog public domain and republishable under full rights. For some odd reason I feel like blowing the captcha breaking industry the fuck up. Like my favorite saying goes, if you’re going to wreck a room you might as well WRECK it. Lets begin by visiting one of my first captcha related posts; the Army Of Captcha Typers.

The Army of Captcha Typers is a great technique because it doesn’t require loads of programming and is 100% adaptable to any captcha. I suggest you go back and reread it, but in interest of keeping this short here’s a quick summary.

You use a service, I used a proxy site as an example, to get the users to type in the captchas for you. It records what the user typed in as the solution to the captcha and you use that to solve it. The more pageviews the service provides per user the more effective it is to breaking captchas. Why pay Indians or tediously code it yourself?

Normally I like to leave most of the code and creative portion out of the written technique in interest of not ruining the technique and to help the methods be more effective through use of spins and unique code. I don’t write this blog to ruin techniques, and those people who claim I do are just insecure and like to claim they already know everything. As common sense as most of the stuff I post is, I haven’t met a person yet who hasn’t in some way learned something from this blog. That truth brags a lot louder than most SEO blogs I’ve seen. But! If we’re going to wreck something lets wreck it. In that spirit I see no reason why every newbie on the planet shouldn’t be able to easily throw up their own web proxy site that solves captchas for them so here’s the script to do it.

Captcha Solving Web Proxy

This a modified version of CGIPROXY that I mentioned in the post. Basically you install it following the included instructions (README file). Then you setup your web proxy site. Target a niche such as kids behind a school proxy or something similar. There is an extra file included called captcha.cgi. Upload it to the cgi-bin in the same folder as the nph-proxy.cgi and give it 755 chmod permissions. Make a folder one directory below your cgi-bin called captchas. Give it read/write permissions (777 should work all else fails). Then anytime you got a captcha to solve upload it to that directory with a unique filename. This can be done automatically with whatever script you’re using to spam a captcha protected site. On the very next pageview the webproxy will require the person to type in the captcha and disguise it as a human check to prevent abuse. Any captcha works. Once it gets their response it’ll delete the captcha from the folder and write out the solution along with the filename to a new file called solved.txt. Format: characters|image.jpgn . Remember to make some kind of reminder or code for the filename so you know which image is which when you go to use the solutions. Get enough users to your webproxy (which is very easy) and you can solve any captcha in moments.

Enjoy!

–>

User Contributed – Captcha Breaking W/ PHPBB2 Example

This is a fantastic guest post by Harry over at DarkSEO Programming. His blog has some AWESOME code examples and tutorials along with an even deeper explanation of this post so definitely check it out and subscribe so he’ll continue blogging.

This post is a practical explanation of how to crack phpBB2 easily. You need to know some basic programming but 90% of the code is written for you in free software.

Programs you Need

C++/Visual C++ express edition – On Linux everything should compile simply. On windows everything should compile simply, but it doesn’t always (normally?). Anyway the best tool I found to compile on windows is Visual C++ express edition. Download

GOCR – this program takes care of the character recognition. Also splits the characters up for us . It’s pretty easy to do that manually but hey. Download

ImageMagick – this comes with Linux. ImageMagick lets us edit images very easily from C++, php etc. Install this with the development headers and libraries. Download from here

A (modified) phpbb2 install – phpBB2 will lock you out after a number of registration attempts so we need to change a line in it for testing purposes. After you have it all working you should have a good success rate and it will be unlikely to lock you out. Find this section of code: (it’s in includes/usercp_register.php)

if ($row = $db->sql_fetchrow($result)){if ($row['attempts'] > 3){message_die(GENERAL_MESSAGE, $lang['Too_many_registers']);}}$db->sql_freeresult($result);

Make it this:

if ($row = $db->sql_fetchrow($result)){//if ($row[’attempts’] > 3)//{//message_die(GENERAL_MESSAGE, $lang[’Too_many_registers’]);//}}$db->sql_freeresult($result);

Possibly a version of php and maybe apache web server on your desktop PC. I used php to automate the downloading of the captcha because it’s very good at interpreting strings and downloading static web pages.

Getting C++ Working First

The problem on windows is there is a vast number of C++ compilers, and they all need setting up differently. However I wrote the programs in C++ because it seemed the easiest language to quickly edit images with ImageMagick. I wanted to use ImageMagick because it allows us to apply a lot of effects to the image if we need to remove different types of backgrounds from the captcha.

Once you’ve installed Visual C++ 2008 express (not C#, I honestly don’t know if C# will work) you need to create a Win32 Application. In the project properties set the include path to something like (depending on your imagemagick installation) C:Program FilesImageMagick-6.3.7-Q16include and the library path to C:Program FilesImageMagick-6.3.7-Q16lib. Then add these to your additional library dependencies CORE_RL_magick_.lib CORE_RL_Magick++_.lib CORE_RL_wand_.lib. You can now begin typing the programs below.

If that all sounds complicated don’t worry about it. This post covers the theory of cracking phpBB2 as well. I just try to include as much code as possible so that you can see it in action. As long as you understand the theory you can code this in php, perl, C or any other language. I’ve compiled a working program at the bottom of this post so you don’t need to get it all working straight away to play with things.

Getting started

Ok this is a phpBB2 captcha:

It won’t immediately be interpreted by GOCR because GOCR can’t work out where the letters start and end. Here’s the weakness though. The background is lighter than the text so we can exclude it by getting rid of the lighter colors. With ImageMagick we can do this in a few lines of C++. Type the program below and compile/run it and it will remove the background. I’ll explain it below.

using namespace Magick;

int main( int /*argc*/, char ** argv){

// Initialize ImageMagick install location for WindowsInitializeMagick(*argv);

// load in the unedited imageImage phpBB("test.png");

// remove noisephpBB.threshold(34000);

// save imagephpBB.write("convert.pnm");

return(1);}

All this does is loads in the image, and then calls the function threshold attached to the image. Threshold filters out any pixels below a certain darkness. On linux you have to save the image as a .png however on windows GOCR will only read .pnm files so on linux we have to put the line instead:

// save imagephpBB.write("convert.png");

The background removed.

Ok that’s one part sorted. Problem 2. We now have another image that GOCR won’t be able to tell where letters start and end. It’s too grainy. What we notice though is that each unjoined dot in a letter that is surrounded by dots 3 pixels away should probably be connected together. So I add a piece of code onto the above program that looks 3 pixels to the right and 3 pixels below. If it finds any black dots it fills in the gaps. We now have chunky letters. GOCR can now identify where each letter starts and ends . We’re pretty much nearly done.

using namespace Magick;

void fill_holes(PixelPacket * pixels, int cur_pixel, int size_x, int size_y){int max_pixel, found;

///////////// pixels to right /////////////////////found = 0;max_pixel = cur_pixel+3;// the furthest we want to search// set a limit so that we can't go over the end of the picture and crashif(max_pixel>

///////////// pixels to right /////////////////////found = 0;max_pixel = cur_pixel+3;// the furthest we want to search// set a limit so that we can't go over the end of the picture and crashif(max_pixel>=size_x*size_y)max_pixel = size_x*size_y-1;

// first of all are we a black pixel, no point if we are notif(*(pixels+cur_pixel)==Color("black")){// start searching from the right backwardsfor(int index=max_pixel; index>

// first of all are we a black pixel, no point if we are notif(*(pixels+cur_pixel)==Color("black")){// start searching from the right backwardsfor(int index=max_pixel; index>cur_pixel; index--){// should we be coloring?if(found)*(pixels+index)=Color("black");

if(*(pixels+index)==Color("black"))found=1;}}

///////////// pixels to bottom /////////////////////found = 0;max_pixel = cur_pixel+(size_x*3);if(max_pixel>

///////////// pixels to bottom /////////////////////found = 0;max_pixel = cur_pixel+(size_x*3);if(max_pixel>=size_x*size_y)max_pixel = size_x*size_y-1;

if(*(pixels+cur_pixel)==Color("black")){for(int index=max_pixel; index>

if(*(pixels+cur_pixel)==Color("black")){for(int index=max_pixel; index>cur_pixel; index-=size_x){// should we be coloring?if(found)*(pixels+index)=Color("black");

if(*(pixels+index)==Color("black"))found=1;}}

}

int main( int /*argc*/, char ** argv){

// Initialize ImageMagick install location for WindowsInitializeMagick(*argv);

// load in the unedited imageImage phpBB("test.png");

// remove noisephpBB.threshold(34000);

/////////////////////////////////////////////////////////////////////////////////////////////////////// Beef up "holey" parts/////////////////////////////////////////////////////////////////////////////////////////////////////phpBB.modifyImage(); // Ensure that there is only one reference to // underlying image; if this is not done, then the // image pixels *may* remain unmodified. [???]Pixels my_pixel_cache(phpBB); // allocate an image pixel cache associated with my_imagePixelPacket* pixels; // 'pixels' is a pointer to a PixelPacket array

// define the view area that will be accessed via the image pixel cache// literally below we are selecting the entire pictureint start_x = 0;int start_y = 0;int size_x = phpBB.columns();int size_y = phpBB.rows();

// return a pointer to the pixels of the defined pixel cachepixels = my_pixel_cache.get(start_x, start_y, size_x, size_y);

// go through each pixel and if it is black and has black neighbors fill in the gaps// this calls the function fill_holes from abovefor(int index=0; index<size_x*size_y; index++)fill_holes(pixels, index, size_x, size_y);

// now that the operations on my_pixel_cache have been finalized// ensure that the pixel cache is transferred back to my_imagemy_pixel_cache.sync();

// save imagephpBB.write("convert.pnm");

return(1);}

I admit this looks complicated on first view. However you definitely don’t have to do this in C++ though if you can find an easier way to perform the same task. All it does is remove the background and join close dots together.

I’ve given the C++ source code because that’s what was easier for me, however the syntax can be quite confusing if you’re new to C++. Especially the code that accesses blocks of memory to edit the pixels. This is more a study of how to crack the captcha, but in case you want to code it in another language here’s the general idea of the algorithm that fills in the holes in the letters:

1. Go through each pixel in the picture. Remember where we are in a variable called cur_pixel2. Start three pixels to the right of cur_pixel. If it’s black color the pixels between this position and cur_pixel black.3. Work backwards one by one until we reach cur_pixel again. If any pixels we land on are black then color the space in between them and cur_pixel black.4. Go back to step 1 until we’ve been through every pixel in the picture

NOTE: Just make sure you don’t let any variables go over the edge of the image otherwise you might crash your program.

I used the same algorithm but modified it slightly so that it also looked 3 pixels below, however the steps were exactly the same.

Training GOCR

The font we’re left with is not recognized natively by GOCR so we have to train it. It’s not recognized partly because it’s a bit jagged.

Assuming our cleaned up picture is called convert.pnm and our training data is going to be stored in a directory call data/ we’d type this.

gocr -p ./data/ -m 256 -m 130 convert.pnm

Just make sure the directory data/ exists (and is empty). I should point out that you need to open up a command prompt to do this from. It doesn’t have nice windows. Which is good because it makes it easier to integrate into php at a later date.

Any letters it doesn’t recognize it will ask you what they are. Just make sure you type the right answer. -m 256 means use a user defined database for character recognition. -m 130 means learn new letters.

You can find my data/ directory in the zip at the end of this post. It just saves you the time of going through checking each letter and makes it all work instantly.

Speeding it up

Downloading, converting, and training for each phpbb2 captcha takes a little while. It can be sped up with a simple bit of php code but I don’t want to make this post much longer. You’ll find my script at the end in my code package. The php code runs from the command prompt though by typing “php filename.php”. It’s sort of conceptual in the sense that it works, but it’s not perfect.

Done

Ok once GOCR starts getting 90% of the letters right we can reduce the required accuracy so that it guesses the letters it doesn’t know.

Below I’ve reduced the accuracy requirement to 25% using -a 25. Otherwise GOCR prints the default underscore character even for slightly different looking characters that have already been entered. -m 2 means don’t use the default letter database. I probably could have used this earlier but didn’t. Ah well, it doesn’t do a whole lot.

gocr -p ./data/ -m 256 -m 2 -a 25 convert.pnm

We can get the output of gocr in php using:

echo exec(”/full/path/gocr -p ./data/ -m 256 -m 2 -a 25 convert.pnm”);

Alternatives

In some instances you may not have access to GOCR or you don’t want to use it. Although it should be usable if you have access to a dedicated server. In this case I would separate the letters out manually and resize them all to the same size. I would then put them through a php neural network which can be downloaded from here FANN download

It would take a bit of work but it should hopefully be as good as using GOCR. I don’t know how well each one reacts to letters which are rotated though. Neural networks simply memorize patterns. I haven’t checked the inner workings of GOCR. It looks complicated.

My code

All the code can be found here to crack phpBB2 captcha.

Zip Download

In conclusion to this tutorial it’s a nightmare trying to port over all my code from linux to windows unless it’s written in Java . If only Java was small and quick as well.

It’s worth stating that phpbb2 was easy to crack because the letters didn’t touch or overlap. If they had touched or overlapped it would probably have been very hard to crack.

I plan to look at that line and square captcha that comes with phpBB3 over on my site and document how secure it is.

Thanks for the awesome guest post Harry.

–>