Module for automatic summarization of text documents and HTML pages.
Sumy was created as my diploma thesis and the need for articles length reduction in Czech/Slovak language. Although its source code was always available publicly on Github I didn’t expect to adopt it by so many people. Don’t get me wrong. I am happy for it, but that’s also why the lack of documentation and sometimes hardcoded features for Slovak/Czech languages may be found in the codebase. Because the thesis is written in the Slovak language I will try to write some practical parts here for people using it.
Sumy is able to create extractive summary. That means that it tries to find the most significant sentences in the document(s) and compose it into the shortened text. There is another approach called abstractive summary but to create it one needs to understand the topic and create new shortened text from it. This is out of the scope of Sumy’s current capabilities.
Even I focused on Czech/Slovak language in my work I wanted Sumy to be extendable for other languages from the start. That’s why I created it as a set of independent objects that can be replaced by the user of the library to add better or new capabilities to it.
The central object is the Document
which represents the whole document ready to be summarized. It consists of the collection of the Paragraphs
which consists of the collection of the Sentences
. Every sentence has a boolean flag is_heading
indicating if it’s a normal sentence or heading. Also, it has tokenizer attached so you can get a list of words
from it. But the Word
is represented as a simple string.
To create a Document
(or Parser
) you will need a Tokenizer
. The Tokenizer
is one of the language-specific part of the puzzle. I use nltk library to do that so there is a great chance your language is covered by that library. Simply try to pass your language name to it and you will see if it will work :) If it raises the exception you have two choices. The 1st one is to send the pull request to Sumy with a new Tokenizer
for your language. And the 2nd is to create your own Tokenizer
and pass it to Sumy. And you know, now when you have it it should be easy to send the pull request with your code anyway. The tokenizer is any object with two methods to_sentences(paragraph: str)
and to_words(sentence: str)
.
You can create the Document
by hand but it would be not very convenient. That’s why there is DocumentParser
for the job. It’s the base class you can inherit and extend to create your transformation from the input document format to the Document
object. Sumy provides 2 implementations to do that. The first one is the PlainTextParser
. The name is not accurate because some very simple formatting is expected. Paragraphs
are separated by a single empty line and headings of the paragraphs can be created by writing the whole sentence in UPPER CASE letters. But that’s all. The more interesting implementation is the HtmlParser
. It is able to extract the main article from the HTML page with the help of breadability library and returns Document
with useful meta-information about the document extracted from HTML markup. Many other summarizers use XML format for the input documents and it should not be hard to implement it if you want to. All you should do it to inherit DocumentParser
and define the property DocumentParser.document
returning Document
object.
Ok, now you know how to create the Document
from your text. Next, you want to summarize it probably. Before we do that you should know that the Document
can be preprocessed in any way. You can transform/enhance it with important information for you. You can even add or remove parts of it. Whatever you need. In some edge cases, you can even create the new Document
as long as you adhere to the API.
Then you need a Stemmer
. The Stemmer
is just a fancy word for the algorithm that tries to normalize the words into the single one. The simplest stemmer implementation in Sumy is the so-called null_stemmer
. It is handy for cases like Chinese/Japanese/Korean languages where words do not need to be unified. But the Czech/Slovak language has custom Stemmer
in Sumy. All other languages use nltk for this. SO again, there is a good chance your language is covered. But stemmer is any callable
that takes a word and returns word. That is good news for you because you can implement your own by simply creating a new function with a custom implementation.
And we are reaching the finish line here. You have Document
created and you are not afraid to use your Stemmer
. Now you are ready to choose one of the Summarizers
. Probably except for the RandomSummarizer
which serves just as a lower limit when evaluating the quality of the summaries. The Summarizer
needs a Stemmer
as it’s dependency and optionally the list of the stop-words. Although it’s the optional dependency I really recommend to use it to get better results. You can use sumy.utils.get_stop_words(language: str)
or simply provide your list of the words. After all of this, your summarizer is ready to serve you. Simply provide it the Document
and the count of the sentences you want to return and you are done.
You can find some specifics to the summarizators at the separate page.