Model Name: Screening Model #
Version: 1.0 #
Overview #
The Screening Model is a machine learning system designed to assist systematic review teams in prioritizing and screening literature records based on their likelihood of inclusion in a review.
The model learns from screening decisions made within a specific nest, identifying patterns associated with included, excluded, or advanced records. It then generates inclusion or advancement probabilities for unscreened records.
These probabilities can be used to:
- Assist manual screening workflows by prioritizing records based on predicted relevance, or
- Power the Robot Screener, an automated reviewer that can act as a second reviewer in dual screening workflows.
The system is designed to accelerate the screening stage of evidence synthesis while maintaining a human-in-the-loop workflow.
Intended Use #
Primary Purpose #
To support systematic review workflows by predicting the probability that a study should be included or advanced during screening, enabling reviewers to prioritize high-relevance records and reduce manual screening workload.
Intended Users #
- Systematic review teams
- Clinical researchers
- Evidence synthesis professionals
- Health technology assessment groups
- Enterprise research organizations conducting literature reviews
Limitations #
- Model performance depends on the quantity and quality of prior screening decisions in the nest.
- Early in a review, limited training data may reduce model reliability.
- Screening decisions rely on human-defined inclusion criteria, which may be complex or difficult for the model to learn.
- The model should not replace expert review for final screening decisions unless used within controlled workflows (e.g., Robot Screener in dual-review mode).
Training Data #
Dataset #
The Screening Model is trained dynamically using screening decisions within a specific nest, including:
- Included records
- Excluded records
- Advanced records (in two-pass screening)
Training data is generated from reviewer decisions made during the screening process.
Input Features #
The model uses multiple features derived from bibliographic metadata and textual information, including:
- Bibliographic metadata
- Publication age
- Page count
- Keywords and descriptors
- Abstract content
- Text n-grams
- Text embeddings (OpenAI embedding model)
- Citation metrics from Scite
- Number of citing publications
- Supporting and contrasting citation statements
Missing features are imputed using distributions derived from other records within the nest.
Language #
Primarily English.
Evaluation #
Performance Metrics #
Internal testing across several hundred systematic review projects produced the following representative performance metrics.
Standard Screening #
| Metric | Value |
|---|---|
| AUC | 0.88 |
| Classification Accuracy | 0.92 |
| Recall | 0.76 |
| Precision | 0.40 |
| F1 Score | 0.51 |
Two-Pass Screening #
| Metric | Value |
|---|---|
| AUC | 0.88 |
| Classification Accuracy | 0.93 |
| Recall | 0.81 |
| Precision | 0.44 |
| F1 Score | 0.56 |
Performance Characteristics #
- High recall by design, minimizing the risk of excluding relevant studies.
- Precision is lower due to class imbalance and the model’s preference for conservative exclusion.
- Accuracy is typically high due to the large proportion of excluded records in screening datasets.
Known Issues #
- Performance varies based on the size of the screened dataset used for training.
- Class imbalance (many more exclusions than inclusions) can reduce precision.
- Some records may lack sufficient metadata or abstract text for reliable prediction.
- Predictions early in the screening process may be unstable due to limited training examples.
Ethical Considerations #
Human-in-the-Loop Limitations #
The Screening Model is intended to augment human screening workflows, not replace expert judgment.
When used solely to generate inclusion probabilities, the model provides decision support that reviewers may use to prioritize records.
When used as Robot Screener, the model may act as a second reviewer in dual-review workflows. While this can significantly accelerate screening, it increases reliance on automated predictions.
Best practice is to:
- Monitor model performance metrics
- Validate screening outcomes
- Use human adjudication where disagreements occur
Limitations #
- Model accuracy depends on the number of screened records available for training.
- Early-stage reviews may not provide sufficient training data for strong performance.
- Abstract-only information may not fully capture eligibility criteria.
- Missing metadata may reduce prediction reliability.
- Precision is intentionally lower than recall, meaning the model may suggest inclusion of some irrelevant records.
Contact Information #
For questions, feedback, or support, please contact
support@nested-knowledge.com
PALISADE Compliance #
Purpose #
To assist literature screening during systematic reviews by estimating the probability that a record should be included or advanced based on patterns learned from prior screening decisions.
Appropriateness #
The Screening Model is appropriate for:
- Evidence synthesis workflows
- Systematic literature reviews
- Research prioritization tasks
It is not intended for clinical decision-making or diagnostic use.
Limitations #
- Predictions depend on patterns learned from reviewer behavior within a specific nest.
- Performance improves as more records are screened.
- Automated screening workflows (e.g., Robot Screener) may introduce unreviewed errors if not monitored.
Implementation #
The Screening Model uses a gradient-boosted decision tree ensemble trained on screening decisions within each nest.
At a high level, the model evaluates records by asking a series of binary questions about their characteristics (e.g., metadata, textual features, citation signals). These decisions collectively produce a posterior probability of inclusion or advancement.
Model characteristics include:
- Gradient-boosted decision tree ensemble
- Logistic loss optimization
- Cross-validation–based hyperparameter tuning
- SMOTE oversampling to address class imbalance
- Per-nest model training
Training begins once the following thresholds are met:
- 50 screened records
- 10 included or advanced records
After training, the model may update automatically as additional screening decisions are made.
Sensitivity and Specificity #
The Screening Model is designed to prioritize high recall (sensitivity) over precision.
This design reflects the asymmetric cost of screening errors:
- False exclusions (missing a relevant study) are highly costly.
- False inclusions can be corrected later during downstream review stages.
Typical performance characteristics include:
- Recall typically between 0.75–0.80
- Precision typically between 0.40–0.45
- AUC around 0.88
Algorithm Characteristics #
- Gradient-boosted decision tree ensemble
- Probabilistic prediction of inclusion or advancement
- Trained on reviewer decisions within each nest
- Cross-validation–based performance evaluation
- Class imbalance correction via SMOTE
The model is deterministic once trained but continuously updated as new screening data becomes available.
Data Characteristics #
The model processes data derived from records within the nest, including:
- Bibliographic metadata
- Abstract text
- Keyword and descriptor fields
- Citation metrics
- Text embeddings
- Derived linguistic features (n-grams)
Records may contain incomplete metadata; missing values are imputed during training.
Explainability #
The Screening Model provides transparency through:
- Probability scores representing likelihood of inclusion or advancement
- Cross-validation performance metrics
These outputs allow reviewers to:
- Assess model reliability
- Monitor screening progress
- Identify high-relevance records
Model predictions should be interpreted as decision-support signals rather than definitive screening decisions.
Additional Notes on Compliance #
The Screening Model operates only on user-provided records within a nest and does not retain content beyond the review workflow.
Because the model learns from reviewer behavior and dataset characteristics, performance may vary between reviews. Human oversight remains essential to ensure that relevant studies are not inadvertently excluded.