/* */

Wednesday, July 02, 2008

Beware The Duplicate Content Curse

One webmaster found Google unwilling to index pictures located in an images directory, but some extra content apparently left the site afoul of Google's guidelines.


Editor's Note: Duplicate content presents an issue for webmasters: search engines like Google will punish sites for it. But what if that content is merely a cached copy held for reference? Have you found one of your subdirectories dropped from Google over cached pages on your site?

Here's the short version: don't stick cached content in a directory you want Google to index. Chances are the Googlebot will freak out and run screaming from your server.

Michael VanDeMar wrote at Smackdown how a simple test of indexing images in a subdirectory ended up with Googly accusations of webmaster malfeasance.

Opening a discussion on a Google Groups webmaster help discussion eventually attracted the attention of a Google staffer, John Mueller, who took a peek into VanDeMar's images subdirectory and found some terrifying creepy-crawlies therein:

In particular regarding your /images/ subdirectory I noticed that there are some things which could be somewhat problematic. These are just two examples:

- You appear have copies of other people’s sites, eg /images/ viewgcache-getafreelinkfromwired.htm
- You appear to have copies of search results in an indexable way, eg /images/ viewgcache-bortlebotts.htm

I’m not sure why you would have content like that hosted on your site in an indexable way, perhaps it was just accidentally placed there or meant to be blocked from indexing. I trust you wouldn’t do that on purpose, right?

VanDeMar keeps those cached copies to support his discussions, as such pages can and will change regularly, or disappear altogether from sites. Doing this in a place where Google expects not to find such content evidently put him in a tough spot with the search engine, as Mueller suggested it ran counter to Google's webmaster guidelines.

The difficulty appears to be in the nature of the cached pages. Mueller thinks it's duplicate content, VanDeMar believes it isn't, based on his reading of the guidelines; he further questioned why the entire subdirectory received a delisting from Google.

The obvious solution, as one commenter suggested, would be to place the cached pages into a different directory and tell the Googlebot to stay out of it. Whether or not it's the fairest solution for webmasters won't figure into the decision, as Google has really dug in on quality issues it perceived over the past year.

Keeping cached copies of content sounds like a prudent course of action to take. It helps keep site visitors from clicking into a non-existent page, which makes the linking site look bad. If Google consistently dumps subdirectories that mix cached and original content because the company thinks duplication is in effect, webmasters will have to alter their linking structure to accommodate the fussy Googlebot.

No comments: