The Problem With 'Good Enough' Sleep Testing — And the Stakes for Millions of Americans

The Problem With 'Good Enough' Sleep Testing — And the Stakes for Millions of Americans

Dr. Chelsie Rohrscheib, Ph.D.

 

Millions of Americans are being tested for sleep apnea at home — but the device doing the testing may be guessing. Most home sleep tests on the market don't actually measure breathing directly; instead, they track secondary signals like heart rate and blood oxygen changes, then use algorithms to infer whether a patient stopped breathing.

 

New peer-reviewed research reveals this indirect approach produces error ranges nearly three times wider than technologies that measure breathing at the source — meaning a patient could be told they're fine when they have a serious condition, or steered toward the wrong treatment entirely. With sleep apnea linked to heart disease, diabetes, and cognitive decline, and affecting an estimated 936 million people worldwide, the difference between an accurate diagnosis and an educated guess has never mattered more.

 

Download the full article here.

Home sleep apnea tests (HSATs) may appear operationally similar but the physiologic signals used to derive that AHI fundamentally determine diagnostic reliability. For clinicians, this distinction becomes critical in borderline disease, central apnea evaluation, and in patients with common cardiometabolic or autonomic comorbidities, where small differences in AHI can materially alter diagnosis and management.

At a high level, contemporary HSAT technologies fall into two broad categories:

  • Wesper Lab, which detects disordered breathing using primary respiratory physiology, including airflow and/or respiratory effort signals, combined with oximetry [1, 2]. Signals are then evaluated based on AASM guidelines to determine severity and type of disorder.

  • Biosignal-inference systems, including photoplethysmography (PPG-HSAT) platforms, which estimate respiratory events indirectly using cardiovascular and autonomic proxies such as photoplethysmography (PPG), ECG-derived features, heart-rate variability, and oxygen desaturation [3].

While both approaches generate an AHI value, they differ meaningfully in how respiratory events are detected, classified, and interpreted, particularly when diagnostic precision rather than screening alone is required.


Why Direct Respiratory Sensing Is Fundamentally Different

Measuring Cause vs. Consequence

Respiratory events are defined physiologically by reductions in airflow and/or ventilation, often accompanied by characteristic respiratory effort patterns, followed by secondary consequences such as oxygen desaturation and autonomic activation. Systems based on direct respiratory sensing detect events at their physiologic source, enabling:

  • Direct identification of apneas and hypopneas

  • Differentiation between obstructive and central mechanisms

  • More stable performance across varying autonomic states

By contrast, PPG-HSAT and other biosignal-inference approaches primarily detect downstream consequences of respiratory events, including:

  • Heart-rate variability

  • ECG-derived features

  • Peripheral or autonomic responses

  • Oxygen desaturation

These signals are informative, but they represent proxies rather than the respiratory event itself, introducing inherent limitations related to specificity and physiologic dependence.

General Limitations of PPG-Based HSAT Technologies

PPG-based HSAT systems rely on optical detection of blood volume changes to infer respiratory disturbances. While PPG can provide useful physiologic information, its performance is fundamentally constrained by signal quality, sensor location, and susceptibility to motion and perfusion artifacts. These constraints are amplified in real-world, unsupervised home testing environments and directly influence AHI accuracy.

Finger-Based vs Chest-Based PPG: Not Equivalent Signals

PPG signal fidelity is highly dependent on sensor placement. Finger-based PPG sensors benefit from relatively high and stable perfusion, which is why they are widely used for pulse oximetry. In contrast, chest-mounted PPG sensors operate in a substantially lower-perfusion environment and are more susceptible to motion artifact, posture-related signal distortion, and variable tissue contact during sleep.

As a result, chest-based PPG signals typically demonstrate:

  • Lower signal-to-noise ratio

  • Increased susceptibility to motion and positional artifact

  • Greater signal dropout during sleep-related movement

  • Reduced stability in real-world home use

These signal-level limitations constrain the reliability of downstream features derived from PPG, including oxygen desaturation timing, pulse-rate variability, and cardiopulmonary coupling metrics.

Impact on Respiratory Event Inference

Because PPG-HSAT systems do not measure airflow or respiratory effort directly, respiratory events are inferred from secondary physiologic responses. When the underlying PPG signal is degraded, as is more common with chest-based PPG, this inference becomes less reliable.

In practical terms, degraded PPG signal quality increases:

  • Missed hypopneas without prominent desaturation

  • False event detection driven by motion or non-respiratory autonomic arousals

  • Imprecise event timing relative to true airflow cessation

  • Increased dispersion of individual AHI estimates

These limitations help explain why PPG-HSAT platforms may demonstrate acceptable performance under controlled validation conditions yet show wider limits of agreement and greater diagnostic variability in real-world clinical use.

Central Sleep Apnea: A Structural Limitation of Biosignal Inference

Central sleep apnea (CSA) requires documentation of absent respiratory effort during airflow cessation. Without direct airflow and respiratory effort measurements, reliable differentiation between central and obstructive events is inherently constrained.

Biosignal-based systems may detect that a physiologic disturbance occurred (e.g., desaturation or autonomic change), but they lack the mechanistic context necessary to determine whether airflow cessation occurred in the presence or absence of respiratory effort. This limitation is structural rather than algorithmic.

Clinically, this creates risk in patients with:

  • Heart failure

  • Opioid exposure

  • Neurologic disease

  • Mixed or complex sleep apnea

In these populations, accurate event typing is essential for selecting appropriate therapy (e.g., CPAP optimization versus adaptive servo-ventilation or further cardiopulmonary evaluation). Systems relying primarily on inferred biosignals cannot consistently provide this distinction. Direct respiratory sensing allows central and obstructive patterns to be evaluated using physiologic criteria rather than secondary autonomic markers.

Impact of Common Comorbidities on Biosignal-Based Systems

Because PPG-HSAT platforms rely heavily on cardiovascular and autonomic features, performance may degrade in conditions that alter these systems, including:

Arrhythmias (particularly atrial fibrillation).
Irregular RR intervals distort ECG-derived features and disrupt temporal coupling between respiratory events and autonomic responses, increasing both false positives and false negatives. Large multi-center studies have found that ~34% of patients with AFib have comorbid OSA [4].

Peripheral vascular disease and impaired perfusion.
Reduced or unstable peripheral circulation compromises PPG signal fidelity, particularly during sleep-related vasoconstriction.

Autonomic dysfunction.
Seen in diabetes, dysautonomia, Parkinsonian syndromes, and post-viral states, autonomic impairment blunts sympathetic responses that inference algorithms depend on to identify respiratory events.

These are not rare edge cases; they represent a substantial portion of real-world sleep clinic populations [5].

Importantly, PPG-HSAT systems do not typically adjust their AHI algorithms to compensate for rhythm irregularity. The same inference model and respiratory proxies are applied regardless of arrhythmia status, despite the physiologic impact of irregular cardiac timing on cardiopulmonary coupling. Direct respiratory sensing is far less vulnerable to these confounders because it does not depend on intact autonomic signaling to detect airflow limitation.

What This Means for Diagnostic Accuracy

Validation studies of HSAT technologies frequently emphasize correlation coefficients (e.g., Pearson r) to demonstrate agreement with polysomnography (PSG). Correlation reflects population-level tracking, whether PSG-derived apnea–hypopnea index (AHI) values are generally associated with device-derived AHI values, but does not quantify the accuracy of an individual patient’s measurement. Individual-level diagnostic uncertainty is instead captured by Bland–Altman limits of agreement (LoA), which describe the expected dispersion of device-derived AHI values around the PSG reference.

When evaluated through this framework, meaningful differences emerge between Wesper Lab performance, Wesper automated scoring performance, and biosignal-inference–based autoscoring platforms like PPG-HSAT. Published validation of PPG-HSAT autoscoring demonstrates population-level correlation with PSG AHI of approximately r = 0.90; however, Bland–Altman analysis reveals wide limits of agreement, on the order of −20.4 to +23.9 events/hour (Table 1) [3]. This degree of dispersion indicates substantial individual-level variability, such that a patient with a true PSG AHI of 10 could reasonably be reported anywhere from near-normal to moderate OSA, and a patient with moderate disease could be shifted into mild or severe categories depending on the direction of error. Severity classification analyses further illustrate this limitation, particularly under 3% criteria, where misclassification is concentrated in the none-to-mild and mild-to-moderate ranges, precisely where small absolute differences in AHI materially affect diagnosis, payer eligibility, and treatment decisions.

In contrast, Wesper demonstrates substantially tighter agreement with PSG across both device-level and autoscoring validation. In direct Wesper versus PSG comparison using 3% hypopnea criteria, Pearson correlation was 0.95, with Bland–Altman limits of agreement of −8.05 to +6.38 events/hour, indicating minimal dispersion around PSG-derived AHI values. These limits are narrower than those reported for independent PSG interscorer variability, supporting the interpretation that Wesper device-level measurements perform comparably to human PSG interpretation. Correlation was excellent when the Wesper automated scoring algorithm was evaluated against scoring by sleep physicians (r = 0.98). Importantly, Bland–Altman analysis demonstrated modestly wider but still clinically constrained limits of agreement (−8.32 to +8.00 events/hour), which remain approximately 3× tighter than those reported for PPG-HSAT autoscoring, substantially reducing the likelihood that individual patients will cross clinically meaningful severity thresholds due to measurement error alone (Table 1).

 

Table 1: Validation results of Wesper and PPG-HSAT 

Device 

Correlation r 

LOA Lower 

LOA Upper  

Total AHI Spread

Wesper vs. PSG 

44

95%

-8.05

+6.38

14.4 events/hr [1]

Wesper Autoscoring vs. Physician Scoring

139

98%

-8.88 

+8.32

17.2 events/hr [2]

PPG-HSAT Autoscoring vs PSG 

340

90%

-20.4

+23.9 

44.3 events/hour [3]

Severity classification analyses reinforce these findings. Under AASM 3% criteria, PPG-HSAT autoscoring demonstrates greater category overlap and instability outside of severe disease, whereas Wesper autoscoring shows higher concordance with expert scoring and misclassifications largely confined to adjacent severity categories. This distinction reflects fundamental physiologic differences in signal acquisition and event detection. 

Biosignal-inference platforms rely on downstream cardiovascular and autonomic responses to infer respiratory events, making them vulnerable to signal degradation, rhythm irregularity, autonomic dysfunction, and perfusion variability. Wesper, by contrast, detects respiratory events at their physiologic source, airflow limitation and respiratory effort, minimizing reliance on secondary proxies and supporting more consistent event detection, classification, and severity stratification.

From a diagnostic standpoint, these differences translate into concrete clinical implications. Wider limits of agreement increase the risk of misclassification near AHI thresholds of 5, 15, and 30 events/hour; reduce reliability for longitudinal monitoring where physiologic night-to-night variability is compounded by measurement variability; and introduce uncertainty in patients with borderline disease, REM- or position-dependent OSA, or cardiovascular and autonomic comorbidities. By providing tighter individual-level agreement with PSG across both device output and automated scoring, Wesper enables more stable severity classification and greater confidence in clinical decision-making.

In simple terms, correlation answers whether a device trends with PSG across a population, whereas limits of agreement determine how far off a result may be for an individual patient. By grounding automated scoring in direct respiratory physiology rather than inferred biosignals, Wesper minimizes individual-level error and delivers more actionable diagnostic accuracy in real-world clinical practice.

In practical terms:

  • PPG-HSAT demonstrates reasonable population-level correlation but higher individual diagnostic uncertainty.

  • Wesper’s direct respiratory sensing produces both higher correlation and markedly narrower LoA, supporting more stable severity classification and greater confidence in treatment decisions.

This distinction becomes particularly important in patients with borderline or mild OSA, REM- or position-dependent disease, and in those with cardiovascular or autonomic comorbidities, where small absolute AHI differences can materially alter diagnosis and care pathways.

In simple terms:

  • Correlation answers: Does this device trend with PSG across a population?

  • Limits of agreement answer: How far off could this be for my patient?

By detecting respiratory events at their physiologic source (airflow and effort), Wesper minimizes reliance on downstream cardiovascular or autonomic proxies, resulting in tighter agreement with PSG and more actionable AHI accuracy for individual patients.

Diagnostic Implications for Clinical Practice

These physiologic design differences are directly reflected in validation performance metrics. Biosignal-inference platforms typically demonstrate lower correlation and wider limits of agreement relative to systems based on direct respiratory sensing. Clinically, this translates to greater individual-level AHI uncertainty and increased likelihood of diagnostic ambiguity.

By measuring respiratory events at their physiologic source, Wesper supports more consistent severity stratification, improved central-versus-obstructive differentiation, and more confident clinical decision-making.

Diagnostic Implications for Clinical Practice

PPG-HSAT may be reasonable for:

  • Initial screening in patients with high pretest probability of moderate–severe OSA

  • When available, scenarios where ECG monitoring is a secondary objective

Wesper is better suited for:

  • Borderline or mild OSA where small AHI errors change diagnosis

  • Suspected central or complex sleep apnea

  • Patients with arrhythmias, vascular disease, or autonomic dysfunction

  • Longitudinal monitoring where consistent respiratory physiology is required

  • Clinical pathways requiring reliable event classification, not just event detection

Conclusion

Although PPG-HSAT demonstrates population-level correlation with PSG for AHI, its reliance on biosignal inference introduces clinically meaningful uncertainty at the individual level, particularly near diagnostic thresholds and in patients with cardiovascular or autonomic comorbidities.

Wesper’s approach, built around direct respiratory sensing aligns more closely with the physiologic definition of sleep-disordered breathing, enabling more robust event detection, more reliable severity classification, and improved identification of central apnea.

For clinicians, the key question is not whether an HSAT can approximate PSG on average, but whether it can provide actionable accuracy for the specific patient in front of you.

References

  1. Raphelson JR, Ahmed IM, Ancoli-Israel S, Ojile J, Pearson S, Bennett N, Uhles ML, Rohrscheib C, Malhotra A. Evaluation of a novel device to assess obstructive sleep apnea and body position. J Clin Sleep Med. 2023 Sep 1;19(9):1643-1649. doi: 10.5664/jcsm.
  2. Rohrscheib C, Moura AA, Raphelson J, Orr JE, Patel RP, Malhotra A. Evaluation of an automated sleep apnea scoring algorithm via the Wesper Lab home sleep apnea test. Sleep Med. 2026 Feb 9;141:108828. doi: 10.1016/j.sleep.2026.108828. 
  3. Goldstein C, Ghanbari H, Sharma S, Collop N, Namen A, Kirsch DB, Drucker M, Khayat R, Pollock M, Torstrick B, Walsh C, Herreshoff E, Frankel DS, Rosen IM. Polysomnography validation of SANSA to detect obstructive sleep apnea. Front Neurol. 2025 Jun 16;16:1592690. doi: 10.3389/fneur.2025.1592690. 
  4. Zhang D, Ma Y, Xu J, Yi F. Association between obstructive sleep apnea (OSA) and atrial fibrillation (AF): A dose-response meta-analysis. Medicine (Baltimore). 2022 Jul 29;101(30):e29443. doi: 10.1097/MD.0000000000029443.
  5. Steinberg R, Spector AR, McVeigh T, Fudim M. Home Sleep Apnoea Testing: Advances, Challenges and Considerations in Heart Failure. Card Fail Rev. 2025 Nov 20;11:e29. doi: 10.15420/cfr.2025.29. 
  6. Magalang UJ, Chen NH, Cistulli PA, Fedson AC, Gíslason T, Hillman D, Penzel T, Tamisier R, Tufik S, Phillips G, Pack AI; SAGIC Investigators. Agreement in the scoring of respiratory events and sleep among international sleep centers. Sleep. 2013 Apr 1;36(4):591-6. doi: 10.5665/sleep.2552.