TY - JOUR
T1 - Improving news headline text generation quality through frequent POS-tag patterns analysis
AU - Fatima, Noureen
AU - Daudpota, Sher Muhammad
AU - Kastrati, Zenun
AU - Imran, Ali Shariq
AU - Hassan, Saif
AU - Elmitwally, Nouh Sabri
N1 - Funding Information:
This work was supported in part by the Department of Computer Science (IDI) , Faculty of Information Technology and Electrical Engineering , Norwegian University of Science and Technology (NTNU) , Gjøvik, Norway; and in part by the Curricula Development and Capacity Building in Applied Computer Science for Pakistani Higher Education Institutions (CONNECT) Project NORPART-2021/10502 , funded by DIKU, Norway .
Funding Information:
This work was supported in part by the Department of Computer Science (IDI), Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Technology (NTNU), Gjøvik, Norway; and in part by the Curricula Development and Capacity Building in Applied Computer Science for Pakistani Higher Education Institutions (CONNECT) Project NORPART-2021/10502, funded by DIKU, Norway. We are also thankful to Sukkur IBA University students and faculty members who participated in the Turing Test for the subjective evaluation of generated text before and after applying POS tags.
Publisher Copyright:
© 2023 The Author(s)
PY - 2023/7/17
Y1 - 2023/7/17
N2 - Original synthetic content writing is one of the human abilities that algorithms aspire to emulate. The advent of sophisticated algorithms, especially based on neural networks has shown promising results in recent times. A watershed moment was witnessed when the attention mechanism was introduced which paved the way for transformers, a new exciting architecture in natural language processing. Recent sensations like GPT and BERT for synthetic text generation rely on NLP transformers. Although, GPT and BERT-based models are capable of generating creative text given they are properly trained on abundant data, however, the generated text suffers the quality aspect when limited data is available. This is especially an issue for low-resource languages where labeled data is still scarce. In such cases, the generated text, more often than not, lacks the proper sentence structure, thus unreadable. This study proposes a post-processing step in text generation that improves the quality of generated text through the GPT model. The proposed post-processing step is based on the analysis of POS tagging patterns in the original text and accepts only those generated sentences from GPT which satisfy POS patterns that are originally learned from the data. We exploit the GPT model to generate English headlines by utilizing Australian Broadcasting Corporation (ABC) news dataset. Furthermore, for assessing the applicability of the model in low-resource languages, we also train the model on the Urdu news dataset for Urdu news headlines generation. The experiments presented in this paper on these datasets from high- and low-resource languages show that the performance of generated headlines has a significant improvement by using the proposed headline POS pattern extraction. We evaluate the performance through subjective evaluation as well as using text generation quality metrics like BLEU and ROUGE.
AB - Original synthetic content writing is one of the human abilities that algorithms aspire to emulate. The advent of sophisticated algorithms, especially based on neural networks has shown promising results in recent times. A watershed moment was witnessed when the attention mechanism was introduced which paved the way for transformers, a new exciting architecture in natural language processing. Recent sensations like GPT and BERT for synthetic text generation rely on NLP transformers. Although, GPT and BERT-based models are capable of generating creative text given they are properly trained on abundant data, however, the generated text suffers the quality aspect when limited data is available. This is especially an issue for low-resource languages where labeled data is still scarce. In such cases, the generated text, more often than not, lacks the proper sentence structure, thus unreadable. This study proposes a post-processing step in text generation that improves the quality of generated text through the GPT model. The proposed post-processing step is based on the analysis of POS tagging patterns in the original text and accepts only those generated sentences from GPT which satisfy POS patterns that are originally learned from the data. We exploit the GPT model to generate English headlines by utilizing Australian Broadcasting Corporation (ABC) news dataset. Furthermore, for assessing the applicability of the model in low-resource languages, we also train the model on the Urdu news dataset for Urdu news headlines generation. The experiments presented in this paper on these datasets from high- and low-resource languages show that the performance of generated headlines has a significant improvement by using the proposed headline POS pattern extraction. We evaluate the performance through subjective evaluation as well as using text generation quality metrics like BLEU and ROUGE.
KW - POS tagging
KW - text generation
KW - low resource language
KW - generative pre-trained transformer
KW - attention mechanism
UR - https://www.open-access.bcu.ac.uk/14542/
U2 - 10.1016/j.engappai.2023.106718
DO - 10.1016/j.engappai.2023.106718
M3 - Article
SN - 0952-1976
VL - 125
JO - Engineering Applications of Artificial Intelligence
JF - Engineering Applications of Artificial Intelligence
M1 - 106718
ER -