Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

—The availability of various digital sources has created a demand for text mining mechanisms. Effective summary generation mechanisms are needed in order to utilize relevant information from often overwhelming digital data sources. In this view, this paper conducts a survey of various single as well as multi-document text summarization techniques. It also provides analysis of treating a query sentence as a common one, segmented from documents for text summarization. Experimental results show the degree of effectiveness in text summarization over different clustering algorithms.


INTRODUCTION
The extensive use of Internet has caused a vast growth in the usage of digital information.People use online information services, like social media, every day resulting to the availability of a huge amount of unstructured digital information.This information is directly accessible to a large number of end-users [1][2].The user accesses this information through queries, but the improvement of precision and speed is always an issue.The information retrieval (IR) systems have resolved this to some extent.This information overload problem is more sensitive when there is a need of taking a decision or of deep understanding of a problem The IR systems solve this through user issued queries The obtained result most of the times overwhelms users with too many answers, and provided documents that may not be relevant to the topic asked.The multi document summarization has an ability to summarize a complete document set.Ideally it is a process of query shared information extraction through a set of multiple text documents.The techniques used in single-document summarization can also be used in multi-document summarization [3].The comparison of single and multidocument summarization is presented in Table I.
Web information retrieval relevant to the issued query is a tedious task.Information retrieval tools can be used for retrieval relevant to the topic specified by the query.The results obtained sometimes may not preserve the required content.Summary generation or automatic text summarization is the creation of abstracts or summaries, with the help of a computer program, from one or more documents.There are specifically two types of text summarization techniques, generic and query specific [4].It becomes a difficult task for the user to go through a large number of retrieved documents [5].This difficulty can be resolved with the use of query specific document summary generation.The generated summary or abstract must preserve the semantics and central idea of an input text [6].Below we present the main existing approaches to multi document summarization:

A. Feature Based Method
The extractive type summarization approach identifies the most related sentences from the original text and place them together to generate a concise summary.The process identifies relevant sentences based on features like sentence length, word frequency title word, sentence position, cue word, proper noun etc.

B. Cluster Based Method
The initiative of clustering is to group similar objects into classes.In case of multidocument summarization, these objects refer to the sentences and the classes represent the cluster each sentence belongs to.Considering the type of documents that concentrate on different subjects or topics, some of the researchers try to integrate the clustering concept based on the sentence similarity.The most common similarity measure is cosine similarity.The sentence selection is performed by selecting sentences from each cluster on the basis of ranking tfidf in that cluster.

C. Graph Based Method
This method uses the basic concept behind the graph to represent the relationship elements.Related elements in the graph are linked.In case of text, the element relationship is the similarity between the sentences.It is represented as an ordered pair graph G=(V, E), where V is set of elements representing sentences and E is set of edges representing the association among the sentences.The strongly connected sentences are considered in the summary.Many graph based approaches use cosine similarity to identify the association.

D. Knowledge based Method
The documents are organized with the text content related to a specific topic belonging to a particular domain.Every domain has a common knowledge structure.The researchers have common background knowledge structure (i.e.ontology) to improve the summary results.There have been efforts to utilize the background knowledge.Many applications have tailored their model to be ontology-driven [4].Ontology can be useful for domain specific documents where key concepts corresponding to the domain can be identified.The technique is implemented as query specific related to the respective domain by identifying keywords.

E. Our Approach
This paper presents a combined approach by using topic queries or important keywords corresponding to the document set and the fundamental concept of clustering as well as language features to extract the relevant sentences from the original document set.The features of clustering algorithms and NLP based retrieval can be useful in preserving the context of the information in the retrieval process [7].

Degree of coherence
Change in sequence of sentence selection do not affect the degree of coherence.
The order of sentence selection may affect the degree of coherence.

Redundancy
Topics in a single document are related.The degree of redundancy is high.
Some information that may be seen as redundant might be important and vice versa.

Compression Ratio
Usually much smaller Usually higher.

Cross reference resolution can be easily resolved
Cross reference resolution is a greater challenge

II. SYSTEM ARCHITEXTURE
The process of automatic text summarization consists of mainly two tasks.The first one is to recognize the most significant text portions and the second is to obtain the coherent summaries.Information retrieval (IR) process is used to search documents on web.Since massive amount of vague data is available on the web [8], the use of IR tools has given rise to the necessity of query dependent document summarization.
The IR system has demoralized the natural language Processing techniques to support a range of natural language queries.The type of query processing for text summarization without NLP support may result in imprecise summary and user may not view correct or reliable results [9].
The system considers original document or document set in txt form.Documents are preprocessed using basic steps of natural language processing (NLP), like sentence detection, tokenization, part-of-speech tagging, chunking and parsing.The NLP is implemented using the Open NLP tool.The NLP steps help to identify the correct word match with respect to the context within the document by removing the ambiguity if any [10].The result obtained after pre-processing is further given to clustering algorithms along with the keywords used for summarization process.The clustering algorithms such as EM, Graph Based Method, Fuzzy C-Means, DBSCAN, and Hierarchical clustering algorithms are used to obtain the summary.The summary is also computed with a simple query specific approach [11].The result obtained as a summary can be evaluated on the basis of qualitative and quantitative metrics.The precision, recall and F-measure are quality measuring metrics and compression retention ratio are quantity measuring metrics [12].System Architecture.

III. IMPACT OF CLUSTERING IN AUTO TEXT SUMMARIZATION
The concept of clustering is very helpful in the text domain as document objects as words, sentences, paragraphs to be clustered are of varying granularities.Clustering is particularly useful to put together documents to get better retrieval and support browsing.In [13], authors recognized and selected clustering algorithms for obtaining document summary.The main motive of the research was to extract in the summary those sentences that are more relevant to the original input text by using clustering algorithm features which can group the objects based on the relevancy [14].The approach tries to combine two major approaches of summary generation: extractive and is abstractive.Results obtained with different methods are then evaluated for the summary quality.

IV. EXPERIMENTATION AND RESULT DISCUSSION
Results are tested for the inputs from existing datasets like Reuter.Reuter dataset is a popular dataset for text mining experiments.Different splits into training test and unused data have been considered.In the case of abstractive type of

www.etasr.com Bewoor and Patil: Empirical Analysis of Single and Multi Document Summarization using Clustering …
summarization, the quality of the summary obtained is more important.This quality is judged on the basis of qualitative evaluation parameters.The length or the size of the summarized text is evaluated on the basis of quantitative measures compression ratio and retention ratio.Results obtained are compared on the basis of these evaluation measures.The summaries generated are also compared with existing query summarizers, Copernic summarizer and web Summarizer.These two tools are query based summarizers [15].The values for these parameters for the corresponding document are calculated as shown below.
Precision indicates the probability at which the retrieved document is relevant in the search: Precision=No. of different terms in summary/No. of different terms in Query.
Recall is the probability that relevant document is retrieved in the search: Precision, Recall and F-measure measure the quality of the summary, and so they, along with execution time, are called qualitative parameters [16].Compression ratio and retention ratio measure the length or quantity of the sentences in the summary and therefore are called quantitative parameters.
Results are shown in Tables II-IV and Figures 2-7 for various methodologies in data obtained from WikiArt (data set A, single document summarization), Reuter (data set B, multi document summarization) and Wiki Internet (data set C, multi document summarization) data sets.Considering both quantity and quality parameters, clustering is an unsupervised text summarization technique, which can be used as supervised by integrating it with a supervised approach.This may give an optimal solution for this problem.The research work should focus on improving the quality of clusters which directly relates with the gist of the original input document.
Recall=No. of correct matching sentences in the summary/No. of all relevant sentences in the original document F-Measure is the harmonic mean of precision and F-Measure= 2*((Precision*Recall)/(Precision+ Recall) Compression Ratio= No. of sentences in summary/Total No. of sentences in original document.Retention Ratio=No. of relevant query words in summary/No. of Query terms in original data.

TABLE I .
COMPARISON OF SINGLE AND MULTIPLE DOCUMENT SUMMARIZATION PROCESS

TABLE II .
PERFORMANCE EVALUATION FOR DATA SET A