TY - JOUR
T1 - Deep learning to refine the identification of high-quality clinical research articles from the biomedical literature
T2 - Performance evaluation
AU - Lokker, Cynthia
AU - Bagheri, Elham
AU - Abdelkader, Wael
AU - Parrish, Rick
AU - Afzal, Muhammad
AU - Navarro, Tamara
AU - Cotoi, Chris
AU - Germini, Federico
AU - Linkins, Lori
AU - Haynes, R. Brian
AU - Chu, Lingyang
AU - Iorio, Alfonso
N1 - Publisher Copyright:
© 2023 The Author(s)
PY - 2023/5/8
Y1 - 2023/5/8
N2 - Background Identifying practice-ready evidence-based journal articles in medicine is a challenge due to the sheer volume of biomedical research publications. Newer approaches to support evidence discovery apply deep learning techniques to improve the efficiency and accuracy of classifying sound evidence. Objective To determine how well deep learning models using variants of Bidirectional Encoder Representations from Transformers (BERT) identify high-quality evidence with high clinical relevance from the biomedical literature for consideration in clinical practice. Methods We fine-tuned variations of BERT models (BERTBASE, BioBERT, BlueBERT, and PubMedBERT) and compared their performance in classifying articles based on methodological quality criteria. The dataset used for fine-tuning models included titles and abstracts of >160,000 PubMed records from 2012-2020 that were of interest to human health which had been manually labeled based on meeting established critical appraisal criteria for methodological rigor. The data was randomly divided into 80:10:10 sets for training, validating, and testing. In addition to using the full unbalanced set, the training data was randomly undersampled into four balanced datasets to assess performance and select the best performing model. For each of the four sets, one model that maintained sensitivity (recall) at ?99% was selected and were ensembled. The best performing model was evaluated in a prospective, blinded test and applied to an established reference standard, the Clinical Hedges dataset. Results In training, three of the four selected best performing models were trained using BioBERTBASE. The ensembled model did not boost performance compared with the best individual model. Hence a solo BioBERT-based model (named DL-PLUS) was selected for further testing as it was computationally more efficient. The model had high recall (>99%) and 60% to 77% specificity in a prospective evaluation conducted with blinded research associates and saved >60% of the work required to identify high quality articles. Conclusions Deep learning using pretrained language models and a large dataset of classified articles produced models with improved specificity while maintaining >99% recall. The resulting DL-PLUS model identifies high-quality, clinically relevant articles from PubMed at the time of publication. The model improves the efficiency of a literature surveillance program, which allows for faster dissemination of appraised research.
AB - Background Identifying practice-ready evidence-based journal articles in medicine is a challenge due to the sheer volume of biomedical research publications. Newer approaches to support evidence discovery apply deep learning techniques to improve the efficiency and accuracy of classifying sound evidence. Objective To determine how well deep learning models using variants of Bidirectional Encoder Representations from Transformers (BERT) identify high-quality evidence with high clinical relevance from the biomedical literature for consideration in clinical practice. Methods We fine-tuned variations of BERT models (BERTBASE, BioBERT, BlueBERT, and PubMedBERT) and compared their performance in classifying articles based on methodological quality criteria. The dataset used for fine-tuning models included titles and abstracts of >160,000 PubMed records from 2012-2020 that were of interest to human health which had been manually labeled based on meeting established critical appraisal criteria for methodological rigor. The data was randomly divided into 80:10:10 sets for training, validating, and testing. In addition to using the full unbalanced set, the training data was randomly undersampled into four balanced datasets to assess performance and select the best performing model. For each of the four sets, one model that maintained sensitivity (recall) at ?99% was selected and were ensembled. The best performing model was evaluated in a prospective, blinded test and applied to an established reference standard, the Clinical Hedges dataset. Results In training, three of the four selected best performing models were trained using BioBERTBASE. The ensembled model did not boost performance compared with the best individual model. Hence a solo BioBERT-based model (named DL-PLUS) was selected for further testing as it was computationally more efficient. The model had high recall (>99%) and 60% to 77% specificity in a prospective evaluation conducted with blinded research associates and saved >60% of the work required to identify high quality articles. Conclusions Deep learning using pretrained language models and a large dataset of classified articles produced models with improved specificity while maintaining >99% recall. The resulting DL-PLUS model identifies high-quality, clinically relevant articles from PubMed at the time of publication. The model improves the efficiency of a literature surveillance program, which allows for faster dissemination of appraised research.
KW - bioinformatics
KW - machine learning
KW - evidence-based medicine
KW - literature retrieval
KW - medical informatics
KW - Natural Language Processing
UR - http://www.scopus.com/inward/record.url?scp=85160008686&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85160008686&partnerID=8YFLogxK
UR - https://www.open-access.bcu.ac.uk/14398
U2 - 10.1016/j.jbi.2023.104384
DO - 10.1016/j.jbi.2023.104384
M3 - Article
C2 - 37164244
SN - 1532-0464
VL - 142
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
M1 - 104384
ER -