Monday, November 23, 2009

Duplicate Detection using MD5 and Jaccard Coefficient in C#

Duplicate documents are significant issue in context of the Web. Detecting near duplicate documents in a large data set, like the web, is a long standing problem. All the major search engines strive to remove duplicate documents from their search results. In this post I will show you how to create a duplicate detection system using n-grams, MD5 algorithm and Jaccard Coefficient using C#.

Method
One of the simplest and efficient duplicate detection techniques uses n-grams. An n-gram (also referred to as shingle) is a sequence of consecutive words of size 'n'. For instance, the sentence 'Are endorsements keeping Slumdog kids away from school' can be represented in 3-gram phrases as follows:

- Are endorsements keeping
- endorsements keeping Slumdog
- keeping Slumdog kids
- Slumdog kids away
- kids away from
- away from school

These shingles can then be converted to a hash number using MD5 algorithm. Once we have encoded n-grams of the entire document we can then compare them with encoded n-grams of another documents using Jaccard Coefficient.

Code
Using Linq you can perform 'union' and 'intersection' functions in order to calculate Jaccard Coefficient. Here is the code for creating n-grams from 2 documents and calculating Jaccard Coefficient:

public double Detect(string page1, string page2)
{
List s1 = CreateShingles(page1);
List s2 = CreateShingles(page2);

var commonelements = s1.Intersect(s2);
var totalElements = s1.Union(s2);

double jaccardCoef = (Convert.ToDouble(commonelements.Count())) / (Convert.ToDouble(totalElements.Count()));

return jaccardCoef;
}

You can implement your own MD5 algorithm but many languages provide built-in functionality to use MD-5. In C# you can use MD5 function using System.Security.Cryptography. Here is the code:

public string CalculateMD5Hash(string input)
{
// step 1, calculate MD5 hash from input
MD5 md5 = System.Security.Cryptography.MD5.Create();
byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
byte[] hash = md5.ComputeHash(inputBytes);

// step 2, convert byte array to hex string
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hash.Length; i++) { sb.Append(hash[i].ToString("X2")); } return sb.ToString(); }

The 'input' here is the 3-gram phrase. Here is the complete Code.
In case if you just like to see how it works, here is the executable file.

The program just look for words in the web pages being compared. This may lead to slightly erratic results due to the text ads within the main content. Since some text ads change on every refresh the duplication percentage may differ by a few decimal points every time the two pages are compared.

A better way is to extract the main content from the web pages and then extract individual words from it.

9 comments:

Anonymous said...

Hi there...i was checking out this code about duplicate detection Raza,and cause I am a novice in C#,i was wondering if you could tell me how can I compare two text files in C#....not two webpages,but two documents...i'm looking at the code,and I cant figure it out how to adapt it for two txt files..Thanx

Ryan said...

Hi
Thanks for the code. This method returns the similarity of 2 strings. for instance: "this is a book" is very similar to "book a is this", with a 1-gram shingle. However, Simhash and Greedy-String-Tiling which are used for SEO purposes (e.g. Simhash is used and patented by Google) are more practical and real. For instance, they assume that the above to strings are 100% different. Do you have any C# implementation of simhash?
Thanks

Ryan said...

Also, there are two questions:

1-Why do you use MD5. What's wrong with .HashValue property of Object class?

2-I reckon that bigger shingles must have a higher weight. e.g. if a 3-word match was found we must give it a higher weight than a 2-word match. how can we handle this?

Amir said...

Hello, Your links does not work.

Anonymous said...

I rarely leave remarks, but i did some searching and wound up here "Duplicate Detection using MD5 and Jaccard Coefficient in C#".

And I do have a couple of questions for you if you
don't mind. Is it only me or does it look as if like a few of the responses come across like they are written by brain dead folks? :-P And, if you are writing on additional social sites, I'd like to keep up
with everything fresh you have to post. Would
you list of the complete urls of your communal sites like your twitter feed, Facebook
page or linkedin profile?

My weblog: how to get rid of payday loans

oakleyses said...

jordan shoes, ugg boots, longchamp outlet, uggs on sale, ray ban sunglasses, replica watches, ray ban sunglasses, ray ban sunglasses, christian louboutin shoes, nike free run, louboutin pas cher, michael kors pas cher, louis vuitton outlet, air max, prada outlet, tory burch outlet, christian louboutin, louis vuitton, nike air max, nike roshe, replica watches, tiffany and co, oakley sunglasses wholesale, nike free, longchamp pas cher, christian louboutin uk, louis vuitton, burberry pas cher, chanel handbags, nike outlet, nike air max, polo ralph lauren outlet online, gucci handbags, oakley sunglasses, jordan pas cher, prada handbags, oakley sunglasses, cheap oakley sunglasses, louis vuitton outlet, longchamp outlet, polo outlet, ugg boots, tiffany jewelry, longchamp outlet, christian louboutin outlet, louis vuitton outlet, sac longchamp pas cher, oakley sunglasses, polo ralph lauren

oakleyses said...

true religion jeans, ray ban pas cher, michael kors, oakley pas cher, polo lacoste, michael kors outlet online, ray ban uk, mulberry uk, michael kors outlet online, coach outlet, burberry handbags, nike air force, vans pas cher, true religion outlet, michael kors outlet, abercrombie and fitch uk, nike roshe run uk, replica handbags, burberry outlet, michael kors outlet online, kate spade, hollister pas cher, michael kors outlet online, sac hermes, nike air max uk, true religion outlet, nike air max uk, michael kors outlet, guess pas cher, north face, hollister uk, new balance, nike tn, uggs outlet, uggs outlet, ralph lauren uk, michael kors, true religion outlet, nike air max, timberland pas cher, nike blazer pas cher, sac vanessa bruno, converse pas cher, coach purses, coach outlet store online, north face uk, nike free uk, hogan outlet, michael kors outlet

oakleyses said...

wedding dresses, longchamp uk, new balance shoes, insanity workout, gucci, nike roshe run, nike huaraches, north face outlet, celine handbags, giuseppe zanotti outlet, ghd hair, ralph lauren, nike trainers uk, mac cosmetics, iphone cases, asics running shoes, nfl jerseys, mcm handbags, north face outlet, ray ban, reebok outlet, converse outlet, timberland boots, hermes belt, mont blanc pens, ferragamo shoes, hollister, hollister clothing, nike air max, abercrombie and fitch, vans outlet, vans, converse, instyler, louboutin, lululemon, p90x workout, herve leger, soccer shoes, soccer jerseys, nike air max, jimmy choo outlet, beats by dre, oakley, baseball bats, valentino shoes, babyliss, hollister, chi flat iron, bottega veneta

oakleyses said...

ugg, louis vuitton, hollister, ugg uk, links of london, canada goose outlet, toms shoes, thomas sabo, moncler outlet, doudoune moncler, coach outlet, canada goose, ugg,uggs,uggs canada, ugg,ugg australia,ugg italia, pandora charms, replica watches, louis vuitton, supra shoes, pandora jewelry, juicy couture outlet, canada goose outlet, barbour, swarovski crystal, pandora uk, montre pas cher, juicy couture outlet, lancel, canada goose, moncler, canada goose jackets, louis vuitton, canada goose uk, marc jacobs, moncler, pandora jewelry, canada goose, swarovski, canada goose outlet, moncler uk, moncler, louis vuitton, karen millen uk, ugg pas cher, louis vuitton, barbour uk, wedding dresses, moncler outlet, moncler