Bistoon summarization corpus

Bistoon is a comprehensive human-labeled summarization corpus collected based on the crowdsourcing approach.


Saeed Farzi, Sahar Kianian

Full Description

A Web crawler is used to collect news texts from several well-known Persian news agencies like Tabnak, ISNA, IRNA, and so on. Title, body, journalist’s name, news date, and news domains are kept for every news. Unicode conversion, tokenization, sentence splitting, and spell-checking have been performed. Next, normalized texts are summarized then verified by human experts through a Web based crowdsourcing tool. In total, 285 human experts have been involved in the annotation, and approximately 52 texts have been summarized by each human expert. Some statistics about are following.
the correct paper to cite for BisToon summarizationCorpus is:
Farzi, Saeed, and Sahar Kianian. "Katibeh: A Persian news summarizer using the novel semi-supervised approach." Digital Scholarship in the Humanities (2018).

Download Page