¸ñÀû: Laryngoscopy imaging is essential for patients presenting with
vocal
cord lesions. However, various factors such as reduced patient
compliance due to coughing and fogging on the endoscope surface
often
result in low-quality endoscopic images where the lesions cannot
be
identified. These low-quality images do not aid in subsequent
patient
history evaluations and unnecessarily consume data storage.
Additionally, the time required to identify high-quality
endoscopic
images for inclusion in medical records leads to delays in
medical
consultations. This study aims to develop a model that
automatically
classifies high-quality laryngeal endoscopic images using the
latest
AI technique, the Swin Transformer. ¹æ¹ý:This study was conducted based on laryngoscopy images captured
from patients at the Department of Otolaryngology, Korea
University Ansan Hospital, between 2010 and 2023, who were
confirmed to have diagnosis codes for vocal cord diseases. These
images include not only laryngeal images but also images of
various lesions captured during the examination. Each image was
reviewed by an otolaryngology specialist and classified into
three quality levels from the perspective of diagnosing vocal
cord lesions. Pre-trained models such as ResNet50, Vision
Transformer, and Swin Transformer (versions 1 & 2) were applied
to this dataset via transfer learning to categorize the images
into three quality levels. Additionally, a binary classification
model was developed to isolate only high-quality images. The
final models were evaluated using the Explainable AI (XAI)
technique, GradCAM, to verify whether the models made appropriate
judgments. °á°ú:A total of 4,711 laryngeal endoscopic images were collected, of which 3,099 (65.8%) were low-quality, 698 (14.8%) were medium-quality, and 914 (19.4%) were high-quality images. Among the applied models, the Swin Transformer (version 1)-based model that implemented data augmentation, exhibited the best performance, achieving an accuracy of 0.8602, precision of 0.8582, recall of 0.8602, F1 score of 0.8587, AUROC of 0.9558, and AUPRC of 0.9023. For the high-quality image binary classification, the best results were obtained when images were first categorized into three quality levels and then the classified low and medium quality images were merged into a single category for binary classification. This approach achieved an accuracy of 0.9513, precision of 0.8485, recall of 0.9130, F1 score of 0.8796, AUROC of 0.9791, and AUPRC of 0.9270. Applying Grad-CAM to the model confirmed that the AI model determines quality by focusing on vocal cord lesions and performs classification tasks in the same manner as actual clinical physicians, thereby demonstrating that the model operates with medical validity. In the (version 2)–based model, although the performance was comparable, the GradCAM results revealed a mechanism of operation that was not medically valid. °á·Ð:In this study, a high-performance model was developed to classify
high-quality laryngeal endoscopic images in a medically
explainable
manner. Future integration of this model into PACS and EMR
systems to
automate the insertion of high-quality laryngoscopy images into
medical records could reduce consultation delays and achieve more
efficient data management. |