Mining the Web: September 2010

This year I have been really busy working on some research projects, hence the delay in blog posts. Recently, one of my research works got accepted at the 2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010). The conference proceedings were published by IEEE-CS and you can find the link to paper here. I am really grateful to my advisor, Dr. Shawn X. Wang, whose guidance and support made me achieve my goals. The paper was about a new indexing structure for scientific documents that we developed. Following is the abstract of the paper:

With the tremendous growth in electronic publication, locating the most relevant references is becoming a challenging task. Most effective document indexing structures represent a document as a vector of very high dimensionality. It is well known that such a representation suffers from the curse of dimensionality. In this paper, we introduce DT-Tree (DocumentTerm-Tree) - a new structure for the representation of scientific documents. DT-Tree represents a document using the 50 most frequent terms in that specific document. These terms are grouped into a tree structure according to where they appear in the document, such as title, abstract, or section title, etc. The distance between two documents is calculated based on their DT-Trees. Two DTTrees are compared using Dice coefficient between the corresponding nodes of the trees. To verify the effectiveness of our similarity measure, we conducted experiments to cluster 150 documents in three categories, namely biology [1], chemistry [2-3] and physics [3]. The experimental results indicated 100% accuracy.

I am still working on some more research projects and will post more information about them here.

Mining the Web

Sunday, September 26, 2010

DT-Tree: A Semantic Representation of Scientific Papers