One thing that can be learned only by running quite a few websites at once is the differences in how the bots treat sites different. One of the biggest differences is how often they pull your pages, and how often they update your site in the index. One day while browsing through my different stats, I noticed how certain sites get updated in the indexes daily and some get updated monthly. Some sites that only have about 1,000 links get hit by Googlebot 700times/day while some others that have over 20,000 links only get hit about 30 times/day. This inspired me to begin an experiment.
Being one of the few that paid attention in Junior High science class I did this test the right way and put on a white lab coat(just kidding, but wouldn’t that be cool. Where do you buy those things?). My constants were simple. Each site was a brand new domain with similair keywords with similair competition and searches/day. Each site had extremely similair content and had the same template. I also pointed exactly 10 links from the same sites to each site. My variables were also simple. Each site was automatically updated with new pages and with new content at random times, the only difference was how many times in one day they would be updated.
Site 1-Updated 1 times/day
Site 2-Updated 3 times/day
Site 3-Updated 5 times/day
The crawlers behave differently depending on how often the site is updated. The indexes will update more or less frequently depending on how often the site is updated.
I let the sites sit for one month. I closely monitored each site and it’s progress each day.
Spider Hits After First Month
Site 1 Site 2 Site 3
MSN:214 MSN:478 MSN:1170
Google:184 Google:523 Google:957
Inktomi:226 Inktomi: 391 Inktomi: 514
Then I monitored the sites for 6 months.
Cache Update Averages After 6 Months
Site 1- MSN: 1.52 times/month Google: 1.4 times/month
Site 2- MSN: 18.24 times/month Google: 4.1 times/month
Site 3- MSN: 21.70 times/month Google: 13.4 times/month
*Yahoo excluded because it’s tougher to tell cache times and date stamps vs. cached pages/title changes.
I also tracked the percentage of pages to actual that were indexed across Google, MSN, and Yahoo
It is understood that spiders will hit your site for three primary reasons. First, validating a link from another site. Second, checking for changes to your site. Third, reindexing your site. Fourth, pulling robots.txt. With the first and fourth factor neutralized we can assume the update and spider stats are because of the second and third reasons.
I understand from this experiment that if you keep your updates consistant and at random times it will force the bots to revist your site more often. They will all start visiting your site at a consistant intervals depending on your number of links. Once they start to build a rythmn of how often your content changes, they will adapt and start visiting more. Once they build that rythmn into timing they will update your site in the indexes accordingly.
Therefore a theory can be built. Crawlers are designed to accomidate your site and the practices of the webmaster. Thus, you can train the crawlers to how your site operates and this will conclude in differences in performance in the indexes.
Flaws In The Experiment
Upon factoring the final results I wish I had over done it with a fourth site. Had it update 100 or 1,000 times a day. To see if it performed better or worse than Site 3. The second flaw falls into the category of seasonal changes. I did this experiment between June 2005 - January 2006. The engines could have been acting differently during those times. I know for a fact that MSN was, because it was so new.