Well I hope everyone had a great thanksgiving. I love them turkey birds! I love them stuffed. I love them covered in gravy. I love the little gobbling noises they make.
Back to business. By now you should have at least a decent understand of what scraping is and how to use it. We just need to continue on to the next most obvious step, crawling. A crawler is a script that simply makes a list of all the pages on a site you would like to scrape. Creating a decent and versatile crawler is of the utmost importance. A good crawler will be not only thorough but will weed out a lot of the bullshit big sites tend to have. There are many different methods to crawling a site. It really is only limited to your imagination. The one I’m going to cover in this post isn’t the most efficient but it is very simple to understand and thorough.
Since I don’t feel like turning this post into a mysql tutorial I whipped up some quick code for a crawler script that will make a list of every page on a domain(supports subdomains) and put into a return delimited text file. Here is an example script that will crawl a website and make an index of all the pages. For you master coders out there; I realize there is more efficient ways to code this(especially the file scanning portion) but I was going for simplicity. So bear with me.
How To Use
copy and paste the code into notepad and save it as crawler.cgi. Change the variables at the top. If you would like to exclude all the subdomains on the site include the www. infront of the domain. If not then just leave it as the domain. Be very careful with the crawl dynamic option. With the crawl dynamic on certain sites will cause this script to run for a VERY long time. In any crawler you design or use it is also a very good idea to set a limit to the maximum number of pages you would like to index. Once this is completed upload crawler.cgi into your hostings cgi-bin in ASCII mode. Set the chmod permissions to 755. Depending on your current server permissions you may also have to create a text file in the same directory called pages.txt and set the permissions to 666 or 777.
Create a database- Any database will work. I prefer sql but anything will work. A flat file is great because it can be used later on anything including Windows apps.
Specify the starting url you would like to crawl- In this instance the script will start at a domain. It can also index everything in a subpage as long as you don’t include the trailing slash.
Pull the starting page- I used the LWP simple module. It’s easy to use and easy to get started with if you have no prior experience.
Parse for all the links on the page- I use the HTML::LinkExtor module which is a submodule of LWP. It will take content from the lwp call and generate a list of all the links on the page. This includes links made on images.
Check your database for duplicates- Scan through your new links and make sure none already exist in your database. If they exist remove them.
Add the remaining links to your database- In this example I appended the links to the bottom of the text file.
Rinse and repeat- Move to the next page in your database and do the same thing. In this instance I used a while command to cycle through the text file till it reaches the end. When it finally reaches the end of the file the script is done and it can assume every crawlable page on the site has been accounted for.
This method is called the pyramid crawl. There are many different methods of crawling a website. Here’s a few to give you a good idea of your options.
It assumes the website flows outward in an expanding fashion like an upside down pyramid. It starts with the initial page which has links to pages 2,3,4 etc. Each one of those pages has more pages that they link to. They may also link back up the pyramid but they also link further down. From the starting point the pyramid crawl moves its way down until every building block on the pyramid doesn’t contain any unaccounted for links.
This type of crawl assumes a website flows in levels and dubbs them as “stages.” It takes the first level (every link on the main page) and it creates an index for them. It then takes all the pages on level one and uses their links to create level 2. This continues until it has reached a specified number of levels. This is a much less thorough method of crawling but it accomplishes a very important task. Lets say you wanted to determine how deep your back link is buried into the site. You could use this method to say your link is located on level 3 or level 17 or whatever. You could use this information to determine your average link depth on all your site’s inbound links.
This method assumes a website flows in a set of linear links. You take the first link on the first page and crawl it. Then take the first link on that page and crawl it. You repeat this until you reach a stopping point. Then you take the second link on the first page and crawl it. In otherwords you work your way linearly through the website. This is also a not a very thorough process. It can be with a little work. For instance if you took the second link from the last page instead of the first on your second cycle and worked your way backwards. However this crawling also has its purpose. Lets say you wanted to determine how promenant your backlink was on a site. The sooner your linear crawl finds your link it can be assumed the more promenant the link is placed on the website.
This is exactly what it sounds like. You find their sitemap and crawl it. This is probably the quickest crawl method you can do.
Search Engine Crawl
Also very easy. You just crawl all the pages they have listed under the site: command in the search engine. This one has it’s obvious benefits.
Black Hatters: If you’re looking for a sneaky way to get by that pesky little duplicate content filter consider doing both the Pyramid Crawl and the Search Engine Crawl and then compare your results.
For those of you who are new to crawling you probably have a ton of questions about this. So feel free to ask them in the comments below and the other readers and I will be happy to answer them the best we can.