Monday, November 23, 2009

Duplicate Detection using MD5 and Jaccard Coefficient in C#

Duplicate documents are significant issue in context of the Web. Detecting near duplicate documents in a large data set, like the web, is a long standing problem. All the major search engines strive to remove duplicate documents from their search results. In this post I will show you how to create a duplicate detection system using n-grams, MD5 algorithm and Jaccard Coefficient using C#.

Method
One of the simplest and efficient duplicate detection techniques uses n-grams. An n-gram (also referred to as shingle) is a sequence of consecutive words of size 'n'. For instance, the sentence 'Are endorsements keeping Slumdog kids away from school' can be represented in 3-gram phrases as follows:

- Are endorsements keeping
- endorsements keeping Slumdog
- keeping Slumdog kids
- Slumdog kids away
- kids away from
- away from school

These shingles can then be converted to a hash number using MD5 algorithm. Once we have encoded n-grams of the entire document we can then compare them with encoded n-grams of another documents using Jaccard Coefficient.

Code
Using Linq you can perform 'union' and 'intersection' functions in order to calculate Jaccard Coefficient. Here is the code for creating n-grams from 2 documents and calculating Jaccard Coefficient:

public double Detect(string page1, string page2)
{
List s1 = CreateShingles(page1);
List s2 = CreateShingles(page2);

var commonelements = s1.Intersect(s2);
var totalElements = s1.Union(s2);

double jaccardCoef = (Convert.ToDouble(commonelements.Count())) / (Convert.ToDouble(totalElements.Count()));

return jaccardCoef;
}

You can implement your own MD5 algorithm but many languages provide built-in functionality to use MD-5. In C# you can use MD5 function using System.Security.Cryptography. Here is the code:

public string CalculateMD5Hash(string input)
{
// step 1, calculate MD5 hash from input
MD5 md5 = System.Security.Cryptography.MD5.Create();
byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
byte[] hash = md5.ComputeHash(inputBytes);

// step 2, convert byte array to hex string
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hash.Length; i++) { sb.Append(hash[i].ToString("X2")); } return sb.ToString(); }

The 'input' here is the 3-gram phrase. Here is the complete Code.
In case if you just like to see how it works, here is the executable file.

The program just look for words in the web pages being compared. This may lead to slightly erratic results due to the text ads within the main content. Since some text ads change on every refresh the duplication percentage may differ by a few decimal points every time the two pages are compared.

A better way is to extract the main content from the web pages and then extract individual words from it.

6 comments:

Anonymous said...

Hi there...i was checking out this code about duplicate detection Raza,and cause I am a novice in C#,i was wondering if you could tell me how can I compare two text files in C#....not two webpages,but two documents...i'm looking at the code,and I cant figure it out how to adapt it for two txt files..Thanx

Ryan said...

Hi
Thanks for the code. This method returns the similarity of 2 strings. for instance: "this is a book" is very similar to "book a is this", with a 1-gram shingle. However, Simhash and Greedy-String-Tiling which are used for SEO purposes (e.g. Simhash is used and patented by Google) are more practical and real. For instance, they assume that the above to strings are 100% different. Do you have any C# implementation of simhash?
Thanks

Ryan said...

Also, there are two questions:

1-Why do you use MD5. What's wrong with .HashValue property of Object class?

2-I reckon that bigger shingles must have a higher weight. e.g. if a 3-word match was found we must give it a higher weight than a 2-word match. how can we handle this?

Amir said...

Hello, Your links does not work.

Anonymous said...

I rarely leave remarks, but i did some searching and wound up here "Duplicate Detection using MD5 and Jaccard Coefficient in C#".

And I do have a couple of questions for you if you
don't mind. Is it only me or does it look as if like a few of the responses come across like they are written by brain dead folks? :-P And, if you are writing on additional social sites, I'd like to keep up
with everything fresh you have to post. Would
you list of the complete urls of your communal sites like your twitter feed, Facebook
page or linkedin profile?

My weblog: how to get rid of payday loans

yanmaneee said...

curry 4
adidas nmd r1
mbt shoes
fitflops sale clearance
canada goose
air max 270
air force 1
nike air max 270
hermes
nike air max 270