ABSTRACTIVE SUMMARIZATION OF HISTORICAL DOCUMENTS: A NEW DATASET AND NOVEL METHOD USING A DOMAIN-SPECIFIC PRETRAINED MODEL

Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model

Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model

Blog Article

Automatic Text Summarization (ATS) systems aim to generate concise summaries of documents while preserving their essential aspects using either extractive or abstractive approaches.Transformer-based ATS methods have achieved success in various domains; however, there is a lack of research in the historical domain.In this paper, we introduce HistBERTSum-Abs, nightstick twm-850xl a novel method for abstractive historical single-document summarization.

A major challenge in this task is the lack of annotated datasets for historical text summarization.To address this issue, we create a new dataset using archived documents obtained from the Centre Virtuel de la Connaissance sur l’Europe group at the University of Luxembourg.Furthermore, we leverage the potential of HistBERT, a domain-specific bidirectional language model trained on the balanced Corpus of Historical American English, navy drapery fabric (https://www.

english-corpora.org/coha/) to capture the semantics of the input documents.Specifically, our method adopts an encoder-decoder architecture, combining the pre-trained HistBERT encoder with a randomly initialized Transformer decoder.

To address the mismatch between the pre-trained encoder and the non-pre-trained decoder, we employ a novel fine-tuning schedule that uses different optimizers for each component.Experimental results on our constructed dataset demonstrate that our HistBERTSum-Abs method outperforms recent state-of-the-art deep learning-based methods and achieves results comparable to state-of-the-art LLMs in zero-shot settings in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores.To the best of our knowledge, this is the first work on abstractive historical text summarization.

Report this page