Extractive summarization of english, general purpose web pages

Automatic summarization consists in reducing a textual document or a larger corpus of multiple documents into a short set of words, phrases or paragraph that conveys the main meaning of the text.
There are two methods for automatic text summarization - extractive and abstractive ones. Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary. In contrast, abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might generate. Such a summary might contain words not explicitly present in the original. The current paper is devoted to the extractive methods.

Extractive methods can be divided into two groups – single document oriented ones and corpus oriented ones. First approach is focused on discovery meaningful words inside the document in order to index it. The second approach assumes that there is a representative corpus (set of documents) which may be mined in order to get collection of significant concepts and their potential relations. Extracting terms from corpus is very often used in the problem called ontology learning from text. Our survey is focused on extraction keywords from single document (precisely text of a web page).

Keywords extraction methods may be categorized by the type of technique used to identify important words. The most popular keywords extraction methods can be categorized as:
(0) linguistic methods based on PoS (Part of Speech) patterns;
(1) unsupervised methods consuming statistical/graph properties e.g. RAKE;
(2) supervised methods consuming Bayesian, Random Forest approaches e.g. KEA, MAUI, TKE;
(3) knowledge-rich methods based on Wikipedia or BabelNet resources.

In our work, apart comparing the above mentioned methods, we also introduce the emerging class of neural network based approaches.
We propose new methods exploiting:
a) word embeddings and paragraph vectors;
b) deep belief and convolutional networks;
c) the ensemble approach combining different class of methods in order to overcome partial pitfalls.

The all above mentioned algorithms were trained, tunned and tested on a huge corpora of general puropose, english web pages comming from news providers (almost 100 000 web pages with the predefined manual keywords). The evaluation was performed using such quality measures as
precision, recall and ROUGE metrics.

Author: Marek Kozłowski
Conference: Title