NICE Guidelines for AI Methods: The Nested Knowledge Approach

Jeff Johnson

In the rapidly evolving landscape of health economics and outcomes research (HEOR), artificial intelligence (AI) is becoming an increasingly valuable tool for systematic literature reviews (SLRs). At Nested Knowledge, we are committed to providing AI that users can employ responsibly to enhance the efficiency and quality of their evidence synthesis projects. Toward that goal, we recognize that ensuring compliance with leading guidelines for developing and employing AI is critical.

This article outlines how to achieve compliance with HEOR and SLR guidelines for employing AI:

AI Approach: First, we outline our development practices and approach to integrating AI into the human workflows for SLR and evidence synthesis, with a focus on human oversight, transparency, validation, and compliance with emerging industry standards.
NICE Compliance: In part two of this article, we’ll examine how our philosophy stacks up against the latest position statement from the National Institute for Health and Care Excellence (NICE), including AI-enabled practices that fall within specific guidance from this statement.

Part One: AI Philosophy and Oversight Practices

At Nested Knowledge, we adhere to three core principles when implementing any new AI feature:

Data Provenance: Where relevant, we provide the source and the specific data from that source that informs that finding. This ensures full traceability of information. In practice, this means that when our AI tools identify a relevant piece of information or make a recommendation, users can easily trace back to the original source document and language. This level of transparency is crucial for maintaining the integrity of the systematic review process and allows researchers to verify the accuracy of AI-generated recommendations.

Methodological Transparency: We offer complete methodological information on how our AI is trained and employed. Where applicable, we also provide validation and accuracy data on its performance. This transparency extends to the algorithms used, the training data sets, and any known biases or limitations in our AI models. By sharing this information, we enable users to make informed decisions about how to best utilize our AI tools within their research workflows. Additionally, this openness fosters trust and allows for continuous improvement based on user feedback and evolving industry standards.

Human Oversight: In tasks where AI takes the place of human effort, we ensure that AI outputs are placed into an oversight workflow so that all AI extractions can be reviewed by a human expert. This maintains the critical balance between efficiency and accuracy. Our AI tools are designed to augment human expertise, not replace it. For example, in the screening process, while our Robot Screener can rapidly process thousands of articles, the Robot’s recommendations are surfaced to a human Adjudicator for oversight. This dual-layer approach combines the speed and consistency of AI with the nuanced understanding and critical thinking of experienced researchers.

By following these principles, we provide an evidence synthesis solution with AI enhancements that are supported by:

Full transparency on data sources: Users can always trace back to the original documents and data points that informed AI decisions or recommendations.
Complete AI methods disclosure: We provide detailed documentation on our AI methodologies, including model architectures, training procedures, and performance metrics.
Certainty that AI is not operating without supervision; our system is designed with multiple checkpoints where human experts can intervene, validate, or override AI decisions when necessary.

Validations to Date: Robot Screener

Our commitment to transparency extends to sharing our validation results:

Internal Validation: Published at ISPOR 2024, our team demonstrated improved Recall over humans, albeit with lower Precision, across approximately 100,000 decisions in 19 SLRs. This extensive validation process involved comparing the Robot Screener’s performance against that of experienced human reviewers. The results showed that our AI tool was able to identify a significantly higher number of relevant studies (higher Recall) compared to human reviewers. However, it also included more irrelevant studies (lower Precision). This trade-off is often acceptable and even desirable for recommendations and for title/abstract screening, as it’s generally preferable to cast a wider net and then refine the selection in subsequent stages.

External Validation: Also presented at ISPOR 2024, showing similar results across 15 SLRs. This external validation was crucial in demonstrating the consistency and generalizability of our Robot Screener’s performance across different types of systematic reviews and research questions, though for these externally-validated reviews, Robot Screener was equivalent to, not better than, human Recall. The alignment between internal and external validation results strengthens confidence in the tool’s reliability and effectiveness in diverse research contexts.

Time Savings: When used in Dual Screening, our Robot Screener saves roughly 45% to 50% of screening time. This significant time reduction is achieved without compromising the quality of the review process. By automating the initial screening of large volumes of literature, researchers can focus their time and expertise on more complex tasks such as data extraction, quality assessment, and synthesis of findings. This efficiency gain is particularly valuable in the context of rapid reviews or when dealing with fields that have a fast-growing body of literature.

AI Methods

For a comprehensive understanding of our AI methods across all AI models employed in Nested Knowledge’s software, including Bibliomine and RoboPICO, please check out our AI Methods and AI Disclosures. Our flagship features – Robot Screener for screening assistance, the recently introduced Core Smart Tags for classification, and Custom Smart Tagging Recommendations for extraction – are at the forefront of our AI integration efforts.

Robot Screener: This tool uses advanced machine learning algorithms to rapidly screen large volumes of literature. It learns from human decisions to improve its accuracy over time. Robot Screener can be used to provide early scores across all candidate records, and can be substituted for one team member in a dual-review process. Best of all, Robot Screener provides constant Cross-Validation on a project-by-project basis to enable responsible decision-making about when and how to employ it; see here for interpretation of Cross-Validation statistics.
Core Smart Tags: This system reads study abstracts to automatically identify and categorize key information within studies, including automatic hierarchy generation showing the relationships between core concepts. Core Smart Tags cover common elements such as study design, location, and size, as well as populations, interventions, comparators and outcomes (PICO). By automating this initial categorization, researchers can quickly gain an overview of the literature landscape and identify trends or gaps in the evidence–or use these as extraction recommendations. See here for more information!
Custom Smart Tagging Recommendations: This feature allows users to create tailored tags specific to their research question or field of study. The AI takes in all user instruction to apply these custom tags across the studies in a project. This flexibility enables researchers to efficiently extract and organize information that is uniquely relevant to their specific review objectives. See here for an overview of how these recommendations work.
Bibliomine: This tool leverages natural language processing to analyze and extract key information from bibliographies and reference lists. It can help identify additional relevant studies that might have been missed in the initial database searches, enhancing the comprehensiveness of the review.
RoboPICO: This AI-powered tool assists in structuring research questions and defining inclusion/exclusion criteria. It helps ensure that the PICO (Population, Intervention, Comparison, Outcome) elements are clearly defined, which is crucial for maintaining consistency throughout the review process.

In summary, the AI tools in Nested Knowledge cover the major steps in an SLR or evidence synthesis project (Search strategy, Screening, and Extraction, or tagging, of key data from sources). Each tool is integrated into a workflow for human oversight, and if you need a reference for full transparency, look no further than our Disclosure.

Part Two: Compliance with the NICE Position Statement

The National Institute for Health and Care Excellence (NICE) recently published a position statement on the use of AI in evidence synthesis. We’re pleased to note that our approach aligns closely with NICE’s recommendations, as well as those outlined in the key reference, “Generative AI for Health Technology Assessment: Opportunities, Challenges, and Policy Considerations,” by Fleurence et al. Both emphasize methodological transparency and human oversight, which are cornerstones of our AI implementation.

Using AI under the NICE Position Statement

Here’s how users of Nested Knowledge can ensure compliance with key positions from the NICE statement:

Augmentation, Not Replacement: NICE states: “Any use of AI methods should be based on the principle of augmentation, not replacement, of human involvement.”

Our approach: Across each step (Search, Screen, Tag/Extract), we maintain human oversight and use a recommendation-based approach. AI augments human expertise rather than replacing it. In the Search phase, RoboPICO suggests search terms and strategies, but researchers can modify and refine these suggestions based on their expertise and the specific requirements of the review. During Screening, the Robot Screener provides initial recommendations, but human reviewers make the final decisions on inclusion or exclusion. This dual-screening approach combines the efficiency of AI with the nuanced judgment of experienced researchers. In the Tag/Extract phase, our Core Smart Tags and Custom Smart Tags provide initial categorization and extract data based on user queries, but researchers can verify or modify these extractions. This ensures that the nuanced interpretation required in data extraction benefits from both the efficiency of state of the art AI and human expertise.

Use of ML and LLMs in Various Stages: NICE recommends: “Machine learning methods and large language model prompts may be able to support evidence identification by generating search strategies, automating the classification of studies, the primary and full-text screening of records to identify eligible studies, and the visualisation of search results.”

Our implementation:

Generating Search Strategies: RoboPICO and Core Smart Tags assist with Search Exploration, generating preliminary search strategies by providing feedback on the contents of a starter search. These tools analyze existing literature to suggest relevant search terms and help refine search strategies. For instance, RoboPICO helps structure PICO elements, ensuring comprehensive coverage of all relevant aspects of the research question.
Primary screening of records: Robot Screener performs primary screening WITH human oversight AND cross-validation. It rapidly processes large volumes of literature, flagging potentially relevant studies. However, human reviewers always verify these selections, and our cross-validation process ensures consistency and accuracy across different reviewers or AI iterations.
Classification and visualization of studies: Core Smart Tags, Custom Smart Tags, and RoboPICO help classify and visualize results. These tools tag or extract from underlying studies and provide interactive visualizations of the literature landscape, from maps to histograms to interactive sunburst diagrams of tagged concepts. These tools enable researchers to quickly identify trends, gaps, and patterns in the evidence. Following user curation, custom Dashboards of interactive tables and figures, as well as interactive network diagrams of interventions and outcomes, are available as outputs.

Data Extraction Automation: NICE notes: “Large language models could be used to automate data extraction,” while acknowledging this as a less established use.

Our approach: Especially given that this is more novel and unique, our Core and Custom Smart Tagging Recommendations are set up to assist with extraction under oversight. These models can recognize and extract key data points from abstracts, full texts, and bibliographic information, such as PICO, but are not limited to pre-structured extraction. Our custom tools employ large language models to read user-generated queries and extract relevant data, with exact provenance. However, we recognize the complexity and nuance involved in data extraction for systematic reviews. Therefore, all AI-extracted data is presented as a recommendation, which researchers can easily verify, modify, or expand upon. This semi-automated approach significantly speeds up the data extraction process while maintaining the high level of accuracy required for systematic reviews.

AI Methods Disclosure: NICE requires: “When AI is used, the submitting organisation and authors should clearly declare its use, explain the choice of method and report how it was used, including human input.”

To ensure full disclosure and transparency, we are committed to providing AI methods disclosures that detail how AI is used in Nested Knowledge as a way to help our users fully explain their methods. In addition, for each systematic review conducted using Nested Knowledge tools, we can offer a full audit trail containing a history of all the actions a review team undertook to complete it (including Robot screening decisions, human intervention to override AI recommendations, and the history of Robot screening performance on a project-by-project basis). This level of disclosure ensures full transparency and allows for proper evaluation of the review methodology by peers, readers, and regulatory bodies. It also facilitates reproducibility, as other researchers can understand exactly how a given review was completed, regardless of how or whether AI was employed.

AI Philosophy and Approach: Considerations for Implementation

In addition, we read with interest the work of Fleurence et al. in “Generative AI for Health Technology Assessment: Opportunities, Challenges, and Policy Considerations,” which was cited in the NICE position statement and which called out several key areas that are promising for AI involvement in evidence synthesis and SLR with appropriate oversight:

The authors note “LLMs can assist in generating a search strategy by proposing Mesh terms and keywords to input in biomedical search engines such as PubMed,” as executed in Search Exploration by RoboPICO and Core Smart Tags;
The authors cite several studies that validate various language models for screening with comparable accuracy to humans; for comparison, see the Robot Screener validation studies above;
The authors found a wide range of performance in data extraction tasks, from near-perfect meta-analytical extraction to moderate extraction performance. Our Core and Custom Smart Tagging Recommendations provide such extraction as recommendations, to ensure that humans can capture the correct data when performance of AI extraction of custom concepts does not succeed.
The authors supported data provenance, stating “Strategies proposed to improve transparency have included requirements for the outputs to cite which part of the dataset contributed to the answer,” which is one of our AI Philosophies and is enacted in Core and Custom Smart Tagging Recommendations.

The authors finish their evaluation of AI for SLR with the following statement that matches closely with our own Human Oversight philosophy: “In summary, these early applications show that there is promise in using foundation models to support a range of tasks required in SLRs but this rapid overview indicates that … human verification is necessary.”

Conclusion

At Nested Knowledge, we are committed to leveraging AI responsibly to enhance the efficiency and quality of systematic literature reviews and evidence synthesis generally. Our approach aligns closely with industry standards, including NICE’s position statement, ensuring that HEOR professionals can confidently use our tools for evidence synthesis in HTA submissions.

By designing a balance between the speed of AI and the discretion of human expertise, we are driving forward the field of evidence synthesis while upholding the highest standards of transparency and accuracy as set forth by NICE and other guideline developers. The Nested Knowledge platform is designed from the ground up to handle the time-consuming, repetitive tasks in the systematic review process, allowing researchers to focus their expertise on critical analysis and interpretation of the evidence.

The integration of AI in systematic reviews represents a significant advancement in the field of HEOR. It not only accelerates the review process but also enhances the comprehensiveness and consistency of reviews when used under human oversight. However, we recognize that the responsible use of AI requires ongoing validation, transparent reporting, and a clear understanding of its capabilities and limitations. We invite HEOR professionals and regulatory bodies alike to explore our implementation of AI features and provide feedback directly to us, as well as to conduct external validation studies. We remain committed to working with our users to shape the future of evidence synthesis, ensuring that we can meet the evolving needs of researchers, policymakers, and ultimately, patients.

If you would like a demonstration of our software, fill out the form below, and we will be happy to meet with you individually. Or, if you would prefer to explore our platform on your own, sign up and pilot these AI-assisted tools for free.

Get a demo of Nested Knowledge today.

A blog about systematic literature reviews?

Yep, you read that right. We started making software for conducting systematic reviews because we like doing systematic reviews. And we bet you do too.

If you do, check out this featured post and come back often! We post all the time about best practices, new software features, and upcoming collaborations (that you can join!).

Better yet, subscribe to our blog, and get each new post straight to your inbox.