Monday, August 17, 2009

Classic Vector Space Model in C#

Vector Space Model (VSM) is widely used in Information Retrieval systems. The model creates a space in which both documents and queries are represented by vectors. VSM treats the query as a document as well and then it determines similarity between query and other documents by calculating the deviation of angles between each document vector and the original query vector. It then uses the deviation to rank documents.

I implemented the model in C# using the example described in Mi Islita. The program will display elements of a dictionary object that hold the name of each document and its similarity value.

You can use any sorting algorithm to sort the dictionary by the similarity values. The program also assumes that there is a string array "docs" which contains the query's (at zero index) and the documents' text.

Here is the code:


using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.Collections;
using System.Linq;
namespace VectorSpaceModel
{
class Program
{
static Hashtable DTVector = new Hashtable(); //Hashtable to hold Document Term Vector
static List wordlist = new List(); //List of terms found in documents
static Dictionary sortedList = new Dictionary(); //Documents ranked by VSM with angle value
static string[] docs ={"gold silver truck", //Query
"shipment of gold damaged in a fire",//Doc 1
"delivery of silver arrived in a silver truck", //Doc 2
"shipment of gold arrived in a truck"}; //Doc 3

static void Main(string[] args)
{
createWordList();
createVector();
classify();

var dict = sortedList;
foreach (var x in dict.Reverse())
{
Console.WriteLine("{0} -> Doc{1}", x.Key, x.Value);
}


Console.ReadLine();
}


public static void createWordList()
{
foreach (string doc in docs)
{
wordlist = getWordList(wordlist, doc);
}
}

public static List getWordList(List wordlist, string query)
{
Regex exp = new Regex("\\w+", RegexOptions.IgnoreCase);
MatchCollection MCollection = exp.Matches(query);

foreach (Match match in MCollection)
{
if (!wordlist.Contains(match.Value))
{
wordlist.Add(match.Value);
}
}

return wordlist;
}

public static void createVector()
{
double[] queryvector;

for (int j = 0; j < docs.Length; j++)
{
queryvector = new double[wordlist.Count];

for (int i = 0; i < wordlist.Count; i++)
{

double tfIDF = getTF(docs[j], wordlist[i]) * getIDF(wordlist[i]);
queryvector[i] = tfIDF;
}

if (j == 0) //is it a query?
{
DTVector.Add("Query", queryvector);
}
else
{
DTVector.Add(j.ToString(), queryvector);
}
}
}

public static void classify()
{
double temp = 0.0;

IDictionaryEnumerator _enumerator = DTVector.GetEnumerator();

double[] queryvector = new double[wordlist.Count];

Array.Copy((double[])DTVector["Query"], queryvector, wordlist.Count);

while (_enumerator.MoveNext())
{
if (_enumerator.Key.ToString() != "Query")
{
temp = cosinetheta(queryvector, (double[])_enumerator.Value);

sortedList.Add(temp, _enumerator.Key.ToString());

}
}
}

public static double dotproduct(double[] v1, double[] v2)
{
double product = 0.0;
if (v1.Length == v2.Length)
{
for (int i = 0; i < v1.Length; i++)
{
product += v1[i] * v2[i];
}
}
return product;
}

public static double vectorlength(double[] vector)
{
double length = 0.0;
for (int i = 0; i < vector.Length; i++)
{
length += Math.Pow(vector[i], 2);
}

return Math.Sqrt(length);
}
private static double getTF(string document, string term)
{
string[] queryTerms = Regex.Split(document, "\\s");
double count = 0;


foreach (string t in queryTerms)
{
if (t == term)
{
count++;
}
}
return count;

}

private static double getIDF(string term)
{
double df = 0.0;
//get term frequency of all of the sentences except for the query
for (int i = 1; i < docs.Length; i++)
{
if (docs[i].Contains(term))
{
df++;
}
}

//Get sentence count
double D = docs.Length - 1; //excluding the query

double IDF = 0.0;

if (df > 0)
{
IDF = Math.Log(D / df);
}

return IDF;
}

public static double cosinetheta(double[] v1, double[] v2)
{
double lengthV1 = vectorlength(v1);
double lengthV2 = vectorlength(v2);

double dotprod = dotproduct(v1, v2);

return dotprod / (lengthV1 * lengthV2);

}
}
}

18 comments:

Anonymous said...

Hey There. I discovered your weblog using msn. That is a very smartly written article.

I'll make sure to bookmark it and return to learn more of your helpful info. Thanks for the post. I'll definitely return.


Also visit my web site :: ideal waist to hip ratio

Anonymous said...

' Vegetables ѕhould make uρ a large part of yоur diet
if yߋu are trүing tο lose weight. The scientific community fօund that Garcinia Cambogia Һad ɑll thesе fat-busting qualities.
Sߋme drugs prevent absorption οf fat and glucose fгom intestine,
ѕo that some fat is excreted.

Мy blog ... วิธีลดน้ําหนักเร่งด่วน

Anonymous said...

People do not do enough to retain their hair healthy. Nonetheless, they're close enough to chickens and a number
of the images appear to be chickens. You'll have it as a topic
for the birthday party.

Review my web blog :: coloring pages xmas tree

Anonymous said...

The news breaking bad supporters have been waiting on for months
has finally been uncovered. We would like benefit; great effects, such as for instance, great storytelling, and identity advancement.


my homepage; http://movie2k-movie4k.webstarts.com/

Anonymous said...

This is the perfect wallet for "clubbing" or hiking when you don't want to hassle with a purse.

The Apple boot emblem will translate into a Skull emblem
during boot. It can be downloaded by opening Cydia and lookup for the Winterboard.



Also visit my page: ifunbox ()

Anonymous said...

Making money is not guaranteed with any opportunity, but it is
possible. Brainstorm ways to make your life experience work for you.

When it comes to ways to earn extra money
online you best believe that there are ways out there.


My site: ways to make extra money

Anonymous said...

That way you can choose the ones that can't get to you when you need
them. Quite often they are manufactured by the original manufacturer.
Other times, body shop mechanics will ensure you that they only
use OEM body parts, but you still need to be diligent and examine the final work.


Feel free to surf to my web blog; advance auto parts

Anonymous said...

eхcellent issues altogether, you smply received a brand nnew readеr.

What might you recommend in egards to your submit that you mde some days in the past?

Any certain?

Here iѕ my paցe :: Pain relief

Anonymous said...

Its lіke you read my mind! You ѕeеm tto know sso much аbout thiѕ, like you wrote
the book in it oг something.I think that you can do with
a few pics to drivе the meѕsаge home a little bit, but instead of that,
this is excellent blog. A great read. I wipl certainly be
back.

Review my homepage; living with Rheumatoid Arthritis

cctv camera dealer in Jaipur said...

Nice!!!! Many thanks for providing this important information
cctv camera in Jaipur

Tamer Nasr said...

شركة مكافحة حشرات بالدمام
شركة تسليك مجاري بالدمام
شركة تنظيف بالدمام
شركة تنظيف واجهات زجاجية بالدمام
شركة كشف تسربات المياه بالدمام
شركة تنظيف خزانات بالدمام
شركة تنظيف كنب بالدمام


شركة مكافحة حشرات بالدمام
شركة تسليك مجاري بالدمام
شركة تنظيف بالدمام
شركة تنظيف واجهات زجاجية بالدمام
شركة كشف تسربات المياه بالدمام
شركة تنظيف خزانات بالدمام
شركة تنظيف كنب بالدمام

unitedguardforce said...

United Guard is top security services in mumbai

على محمد said...

شلركة المثالى بروكر لخدمات المكافحة

شركة رش مبيدات بالدمام
شركة رش مبيدات بالقطيف
شركة مكافحة حشرات بالخبر
شركة مكافحة حشرات بالجبيل
شركة مكافحة حشرات بالاحساء
شركة مكافحة حشرات بالقطيف

yanmaneee said...

cheap jordans
retro jordans
cheap jordans
lebron 16
jordan 13
yeezy boost 350 v2
air yeezy
stephen curry shoes
fila
jordan shoes

blogger said...

زيت الحشيش الافغاني
ماهي رائحة زيت الحشيش الاصلي
زيت الحشيش للشنب
تجربتي زيت الحشيش للذقن
زيت الحشيش الاصلي للشعر
كم سعر زيت الحشيش
سعر زيت الحشيش
زيت الحشيش الافغاني الاصلي
لون زيت الحشيش الاصلي
تجارب زيت الحشيش

Mai said...

زهرة الاندالس شركات التعقيم براس الخيمة
شركة تعقيم فى دبى
خدمات دبي

menna said...

arabic wedding planner in dubai
destination wedding planner in dubai

rony said...


شركة تنظيف في الفجيرة
شركة مكافحة الصراصير دبي
شركة تنظيف في عجمان
شركة تنظيف في ابوظبي
شركة تنظيف خزانات ابوظبي
شركة تنظيف منازل راس الخيمة
شركة تنظيف منازل الشارقة