Evaluation Of The Wesper Lab Artificial Intelligence Automated Sleep Apnea Scoring Algorithm

Download Full Article (PDF)

Chelsie Rohrscheib, Ph.D., Antonio Artur Moura, M.S., Janna Raphelson, M.D., Jeremy E. Orr, M.D., Ruchir P. Patel, M.D., Atul Malhotra, M.D.

Study Objectives: To evaluate the accuracy, reliability, and generalizability of the Wesper Lab home sleep apnea test (HSAT) artificial intelligence (AI) automated scoring algorithm for detecting sleep-disordered breathing events. The study aimed to determine how well the algorithm’s outputs, specifically apnea-hypopnea index (AHI) and central apnea index (CAI), agree with gold-standard polysomnography (PSG) and expert human scoring across both controlled laboratory settings and real-world clinical environments.

Methods: A multi-tiered validation was conducted using two datasets and three analyses. The primary analysis compared AHI and CAI from Wesper HSATs with simultaneous PSG in 45 participants. The secondary analysis assessed inter-scorer consistency by comparing blinded scoring of raw Wesper signals among three independent scorers. The tertiary analysis evaluated clinical HSATs (n = 139) by comparing algorithm-derived AHI with expert rescoring across 11 independent clinics. Agreement metrics included correlation coefficients, Bland–Altman analysis, sensitivity and specificity at diagnostic thresholds (AHI ≥ 5 and ≥ 15 events/hour), and confusion matrices.

Results: The Wesper Lab algorithm demonstrated strong agreement with PSG and expert scoring. In the primary analysis, AHI correlation was r = 0.90 (Figure 1) and CAI correlation was r = 0.82 (Figure 2), with sensitivity/specificity of 0.90/0.60 (AHI ≥ 5) and 0.67/0.86 (AHI ≥ 15). The secondary analysis showed r = 0.95 and maintained ≥ 0.93 correlation across scorers. The tertiary real-world analysis achieved r = 0.98, with sensitivity/specificity of 0.83/0.93 (AHI ≥ 5) and 0.93/1.00 (AHI ≥ 15).

Figure 1. Agreement between Wesper automated AHI and human-scored PSG AHI. (A) Pearson’s Correlation. (B) Bland–Altman analysis. (C) Confusion matrix illustrating agreement between Wesper and PSG apnea severity classification (none, mild, moderate, severe).

Figure 2. Agreement between Wesper automated CAI and human-scored PSG CAI. (A) Pearson’s Correlation. (B) Bland–Altman analysis.

Conclusions: The Wesper Lab AI scoring algorithm shows robust agreement with gold-standard polysomnography and expert human scoring across both laboratory and real-world datasets. These findings support its reliability, reproducibility, and suitability for clinical use in diagnosing sleep apnea.

Back to Validations and Whitepapers