This difference reflects a strategic design choice. Robot Screener prioritizes high recall to ensure no crucial studies are missed in the initial screening phase. The adjudication stage then focuses on the flagged articles, maintaining the rigor of the SLR process and ensuring that irrelevant articles are screened out even when recommended for inclusion by either the human or Robot Screener. This prioritization of comprehensiveness is particularly valuable in HEOR, where missing relevant studies could lead to incomplete HTAs with potentially significant implications for interpretation.
External Study: Comparable Recall with Focus on Up-to-Date Evidence
The external study, conducted by Cichewicz et al., designed to simulate updating existing HEOR-based SLRs, also produced promising results. Mean and SD of recall and precision was measured with values ranging from 0 to 1 and higher values indicate better recall or precision. Robot Screener’s Recall rate (0.79 ± 0.18) was comparable to human reviewers (0.80 ± 0.20). While overall Recall for both humans and Robot Screener were lower than in the internal study, this likely indicates different proportions of final-included studies, and the consistency of Robot Screener in matching or exceeding Recall in HEOR-focused SLRs shows that the tool is generalizable across disciplines. Furthermore, the Cichewicz et al. study, rather than employing Robot Screener throughout the entire initial review, focused on updates, demonstrating Robot Screener’s ability to efficiently identify potentially relevant studies even when refreshing older reviews with new research.
As with the internal study, Robot Screener’s precision rate (0.46 ± 0.13) was lower than human reviewers (0.77 ± 0.19) in Cichewicz et al. However, the study’s conclusions emphasized that the time saved through efficient screening outweighs the need for additional human effort to adjudicate these “false positives.” Moreover, the low false negative rate (2%) for Robot Screener underscores its effectiveness in minimizing the risk of excluding truly relevant studies.
Large Language Models for Screening and “Recall-first” philosophy
Comparing Robot Screener to human screeners is helpful in determining workflows that can improve accuracy and timelines in SLR. However, Robot Screener, as a machine learning algorithm, depends on project-specific training on users’ decisions. A recent study by Tran et al. (2024) explored a related question, the effectiveness of GPT-3.5 Turbo models for title and abstract screening in five systematic reviews without any project-specific training. In different configurations, GPT-3.5 achieved high sensitivity (recall) rates, ranging from 81.1% to 99.8% depending on the optimization rule, though the specificity ranged from as low as 2.2% to as high as 80.4%. This indicates that Large Language Models (LLMs) may show promise for screening without the need for training data, but that overall accuracy ranged widely based on whether the model is optimized for maximizing recall vs. maximizing ‘correct’ exclusions.
We look forward to the further development of both machine learning and LLM approaches to screening, and this publication also drives home that it is likely as important to optimize the thresholds used by models as it is to train or tune them correctly–the thresholding itself had a far greater impact on recall and specificity in Tran et al. To paraphrase Cichewicz et al., a more conservative approach (prioritizing and optimizing recall, so that no includable study is missed) mitigates the risk of non-comprehensiveness, and when used in a dual-screening approach, does not compromise the final review screening accuracy despite presenting lower precision (and specificity). Thus, approaches that include human adjudication and maximize recall can provide AI-assisted screening while maintaining review quality.
Confidence in Your HEOR Research: The Power of Robot Screener
These studies highlight the significant benefits Robot Screener offers researchers conducting HEOR-focused SLRs within Nested Knowledge. Here’s a quick recap of the key takeaways:
- High Recall Rates: Robot Screener excels at capturing relevant studies, with a recall rate exceeding 97% (significantly higher than human screeners) in the internal study. This ensures a comprehensive foundation for your research.
- Focus on Up-to-Date Evidence: The external study demonstrates Robot Screener’s effectiveness in keeping your reviews current, crucial for informing HTAs with the latest research.
- Prioritizing Comprehensiveness: While precision may be lower, the studies emphasize the strategic choice to prioritize capturing all potentially relevant studies (high recall) to safeguard the comprehensiveness of your research.
- Time Savings and Efficiency: Robot Screener’s efficient screening translates to faster completion of SLRs, allowing you to focus your expertise on critical analysis.
- LLM screening: Recent research shows GPT 3.5 can achieve high recall in screening without training data; LLMs and machine learning approaches are both powerful assets for streamlining research workflows within SLRs for publication, HTA, gap analysis, and other purposes, without sacrificing overall review quality–so long as adjudicated methods with high recall are employed.
The Future of SLR Research: A Collaborative Approach
These validation studies pave the way for adoption of Robot Screener in adjudicated systems with human and AI collaborative screening, bringing time savings but also providing different strengths, with humans outperforming in Precision and AI potentially providing higher Recall. This also paves the way for a future research exploring the use of Robot Screener and other AI screening tools in different configurations, with different oversight or workflows, and in different review types beyond publishable SLRs and SLRs for HTA.
By embracing Robot Screener, researchers can leverage the power of AI to streamline their workflows, ensure high-quality evidence informs SLRs and evidence synthesis generally, and ultimately contribute to better evidence synthesis and health outcomes.