Module for automatic summarization of text documents and HTML pages.
Sumy was created as my diploma thesis and the need for articles length reduction in Czech/Slovak language. Although it’s source code was always available publicly on Github I didn’t expect to adopt it by so many people. Don’t get me wrong. I am happy for it, but that’s also why the lack of documentation and sometimes hardcoded features for Slovak/Czech languages may be found in the codebase. Because the thesis is written in the Slovak language I will try to write some practical parts here for people using it.
Sumy is able to create extractive summary. That means that it tries to find the most significant sentences in the document(s) and compose it into the shortened text. There is another approach called abstractive summary but to create it one needs to understand the topic and create new shortened text from it. This is out of the scope of Sumy’s current capabilities.
Even I focused on Czech/Slovak language in my work I wanted Sumy to be extendable for other languages from the start. That’s why I created it as a set of independent objects that can be replaced by the user of the library to add better or new capabilities to it.
The central object is the
Document which represents the whole document ready to be summarized. It consists of the collection of the
Paragraphs which consists of the collection of the
Sentences. Every sentence has a boolean flag
is_heading indicating if it’s a normal sentence or heading. Also, it has tokenizer attached so you can get a list of
words from it. But the
Word is represented as a simple string.
To create a
Parser) you will need a
Tokenizer is one of the language-specific part of the puzzle. I use nltk library to do that so there is a great chance your language is covered by that library. Simply try to pass your language name to it and you will see if it will work :) If it raises the exception you have two choices. The 1st one is to send the pull request to Sumy with a new
Tokenizer for your language. And the 2nd is to create your own
Tokenizer and pass it to Sumy. And you know, now when you have it it should be easy to send the pull request with your code anyway. The tokenizer is any object with two methods
to_sentences(paragraph: str) and
You can create the
Document by hand but it would be not very convenient. That’s why there is
DocumentParser for the job. It’s the base class you can inherit and extend to create your transformation from the input document format to the
Document object. Sumy provides 2 implementations to do that. The first one is the
PlainTextParser. The name is not accurate because some very simple formatting is expected.
Paragraphs are separated by a single empty line and headings of the paragraphs can be created by writing the whole sentence in UPPER CASE letters. But that’s all. The more interesting implementation is the
HtmlParser. It is able to extract the main article from the HTML page with the help of breadability library and returns
Document with useful meta-information about the document extracted from HTML markup. Many other summarizers use XML format for the input documents and it should not be hard to implement it if you want to. All you should do it to inherit
DocumentParser and define the property
Ok, now you know how to create the
Document from your text. Next, you want to summarize it probably. Before we do that you should know that the
Document can be preprocessed in any way. You can transform/enhance it with important information for you. You can even add or remove parts of it. Whatever you need. In some edge cases, you can even create the new
Document as long as you adhere to the API.
Then you need a
Stemmer is just a fancy word for the algorithm that tries to normalize the words into the single one. The simplest stemmer implementation in Sumy is the so-called
null_stemmer. It is handy for cases like Chinese/Japanese/Korean languages where words do not need to be unified. But the Czech/Slovak language has custom
Stemmer in Sumy. All other languages use nltk for this. SO again, there is a good chance your language is covered. But stemmer is any
callable that takes a word and returns word. That is good news for you because you can implement your own by simply creating a new function with a custom implementation.
And we are reaching the finish line here. You have
Document created and you are not afraid to use your
Stemmer. Now you are ready to choose one of the
Summarizers. Probably except for the
RandomSummarizer which serves just as a lower limit when evaluating the quality of the summaries. The
Summarizer needs a
Stemmer as it’s dependency and optionally the list of the stop-words. Although it’s the optional dependency I really recommend to use it to get better results. You can use
sumy.utils.get_stop_words(language: str) or simply provide your list of the words. After all of this, your summarizer is ready to serve you. Simply provide it the
Document and the count of the sentences you want to return and you are done.
You can find some specifics to the summarizators at the separate page.