top of page
< Back

FPT216

Free Paper (Tumor)

Development and Validation a deep learning-based algorithm for differentiating malignant vertebral metastasis from benign vertebral fracture on plain spine radiographs

Po-Hsin Chou, Shih-Tien Wang, Yu-Cheng Yao Division of Spine Surgery, Department of Orthopedic and Traumatology, Taipei, Taiwan/ School of Medicine, National Yang Ming Chia Tung University, Taipei, Taiwan

Study Design: A retrospective, single-center study with external validation using plain spine radiographs (PSR) and corresponding MRI/CT as reference standards. Objective: To develop and validate deep learning (DL) models for differentiating malignant vertebral metastases (MVM) from benign vertebral fractures (BVF) on PSR and assess their diagnostic performance and generalizability.

Summary of Background Data: MVM is a common cancer complication, with spinal involvement in up to 20% of patients. Early detection is crucial for treatment, yet PSR—the most accessible imaging modality— often fails to reveal early-stage metastases. DL has demonstrated expert-level performance in orthopedic imaging but has not been applied specifically to MVM detection on PSR.

Methods: We included 426 patients with MRI- confirmed MVM and 1,088 patients with CT/MRI confirmed BVF treated between 2016 and 2022. PSR images were annotated by an orthopedic spine surgeon. Multiple single (AP, Lateral, Merge, Concat, ChConcat) and ensemble models were developed using EfficientNet-B0 with contrast-limited adaptive histogram equalization (CLAHE) and down-sampling preprocessing. Performance metrics included accuracy, sensitivity, specificity, precision, F1 score, and area under the receiver operating characteristic curve (AUC). External validation was performed with an independent dataset from another institution.

Results: The AP Views ensemble model with down- sampling achieved the best overall performance (F1 score 0.7786, AUC 0.8559). The Merge (Lateral) model demonstrated stable sensitivity for “sign” cases (0.833) and “no-sign” cases (0.8037), suggesting utility in early-stage detection. External validation showed fair performance with accuracy of 0.67, sensitivity of 0.76, and F1 score of 0.76, respectively.

Conclusion: DL models showed fair diagnostic performance in differentiating MVM from BVF on PSR, with the potential to assist clinicians in both specialized and resource-limited settings. Expanding datasets and conducting multi-center validations are essential to enhance

bottom of page