There are a number of Partitional techniques, but we shall only describe the K-means algorithm which is widely used in document clustering. K-means is based on the idea that a center point can represent a cluster. In particular, for K-means we use the notion of a centroids, which is the mean or median point of a group of points. Note that a centroids almost never corresponds to an actual data point. The algorithm is discussed in detail in section 5

3. Methodology:

3.1 Data collection

This work experiments with two bench mark datasets “Reuters 21578 distribution 1.0” and Classic dataset collected from uci.kdd repositories. The Reuters-21578 collection is distributed in 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contain 1000 documents, while the last (reut2-021.sgm) contains 578 documents.

3.2 Document Representation

In order to reduce the complexity of the documents and make them easier to handle, the document have to be transformed from the full text version to a document vector which describes the contents of the document. The representation of a set of documents as vectors in a common vector space is known as the vector space model. In the vector space model of IR, documents are represented as vectors of features representing the terms that occur within the collection. It is also termed as bag of words, where words are assume to appear independently and the order is immaterial. The value of each feature is called the term weight and is usually a function of term’s frequency (or tf-idf) in the document, along with other factors.

Vector Space representation of a document involves three steps7. First step is the document indexing where content bearing terms are extracted from the documents. The second step is to compute the weights of indexed terms to enhance retrieval of documents relevant to the user. The final step is identifying the similarities between the documents.

The vector space model is a common representation of text documents. Let D be a collection of documents and let

T = {t1, t2, …, tn} be the set of terms appearing in D. A document x ? D can be represented as an n?dimensional vector in the term space T . Let wx,ti be the number of times a term ti ? T appears in x, then the vector of x is defined as:

x = {wx,t1 , wx,t2 , …, wx,tn } (1)

A. Extracting Index Terms

It involves preprocessing text documents, apply stemming, remove stop words and tokenize the text. Documents in vector space can be represented using Boolean, Term Frequency and Term Frequency – Inverse Document Frequency.

In Boolean representation, if a term exists in a document, then the corresponding term value is set to one otherwise it is set to zero. Boolean representation is used when every term has equal importance and is applied when the documents are of small size.

In Term Frequency and Term Frequency Inverse Document Frequency the term weights have to be set. The term weights are set as the simple frequency counts of the terms in the documents. This reflects the intuition that terms occur frequently within a document may reflect its meaning more strongly than terms that occur less frequently and should thus have higher weights.

Each document d is considered as a vector in the term-space and represented by the term frequency (TF) vector:

dtf = tf1 , tf2 , …….. tfD

where tfi is the frequency of term i in the document and D is the total number of unique terms in the text database.

The second factor is used to give a higher weight to words that only occur in a few documents. Terms that are limited to few documents are useful for discriminating those documents from the rest of the collection, while terms that occur frequently across the entire collection aren’t helpful. The inverse document frequency term weight is one way of assigning higher weights to these more discriminative words. IDF is defined via the fraction N/ni, where, N is the total number of documents in the collection and ni is the number of documents in which term i occurs.

Thus, the tf–idf representation of the document d is:

dtf – idf = tf1 log( n / df1 ), tf2 log( n / df2 ),….., tfD log( n / dfD )

To account for the documents of different lengths, each document vector is normalized to a unit vector (i.e., ||dtf-idf|||=1). In the rest of this paper, we assume that this vector space model is used to represent documents during the clustering. Given a set Cj of documents and their corresponding vector representations, the centroid vector cj is defined as:

where each di is the document vector in the set Cj, and j is the number of documents in Cluster Cj. It should be noted that even though each document vector di is of unit length, the centroid vector cj is not necessarily of unit length. In this paper we experimented with three vector model representations Boolean, Term Frequency and Inverse Document Frequency vector space model.

4. Similarity Measures:

There are many metrics for measuring document similarity. We focus on four common measures in this domain which are: cosine similarity 7, Jaccard similarity coefficient, Euclidean measure and Correlation Coefficient.

4.1 Cosine Similarity Measure

For document clustering, there are different similarity measures available. The most commonly used is the cosine function. For two documents di and dj, the similarity between them can be calculated

di . dj

cos(di , dj ) =

|| di || || dj ||

Since the document vectors are of unit length, the above equation is simplified to:

cos (di , dj) = di . dj

When the cosine value is 1 the two documents are identical, and 0 if there is nothing in common between them (i.e., their document vectors are orthogonal to each other).

4.2 Jaccard Coefficient

The Jaccard coefficient, which is sometimes referred to as the Tanimoto coefficient, measure