Over the past month we’ve been working on pro-active measures to enhance site integrity so as not to be penalized by the many dozens of search engines that index our site daily.
One of them is our new advanced article de-duper designed to not only identify EXACT-MATCH articles, but now we are able to identify “LIKE-MATCH” articles.
Honestly, I was shocked when I printed out the 10 page report of the 3 dozen “like match” articles that we had in the directory. I suppose as a percentage of 33k articles, that is not bad, but still it’s not acceptable to have duplicate title or duplicate content articles in our site.
Here’s the methodology that we used to determine which one of the exact match or like match article to remove:
First, if an author had two different author names with the same set of articles, I would merge them into one author name (whichever looked either more professional or how the author had their name laid out in the resource box of their articles).
Then, I’d look for number of page views. Whichever of the duplicate articles had more page views than the other…the one with the least page views got dumped.
In a few situations, an author clearly did not submit original works and they had all of their articles removed.
The really tough part of this de-dupe job is that we have 5-7 pages of article titles that are identical but the bodies of the articles are not even remotely identical. With those, we sift through by hand to identify potential dupes.
The one ‘dead giveway’ that an article is a dupe before I even look at it: If the word count is identical or within 10 words of the other dupe article that was identified.
Today, we have better systems in place to deny authors from sending in articles that match previous articles they have submitted in the past…and I suppose this event is just clean up from the past from not having duplicate article title checking. :-)
The whole point of this exercise is to ensure that we don’t have or allow duplicate articles in our directory. You might want to do the same for your website so that you’re never accused of search engine spamming. It’s about site integrity.