Model Name: Smart Study Size #
Version: 1.0
Overview #
Smart Study Size is a heuristic-based algorithm designed to extract participant numbers from study abstracts. It extracts participant (experimental unit) counts using a three-phase approach:
- Checks for numbers in the text. If there aren’t any then; there are no sample size values.
- Applies regular expression (regex) matching for common phrases indicating participant numbers.
- Selects the largest number amongst population entities, with filters to exclude non-integers or irrelevant large values.
The output is a single numeric value representing the estimated participant count.
Intended Use #
- Primary Purpose: Automatically extract participant numbers from research abstracts for meta-analyses or systematic reviews.
- Intended Users: Researchers, healthcare analysts, and systematic review teams.
Training Data #
- Dataset: This algorithm was not trained in the traditional sense. Instead, it was developed heuristically by analyzing approximately 300 research papers.
- Validation Dataset: the PICO Corpus dataset, comprising 1,011 abstracts.
- Data Limitations:
- Heuristics were developed exclusively on English-language abstracts.
Evaluation #
- Performance Metrics:
- Accuracy: 91% on the PICO Corpus dataset.
- Error Rate: 23%, with errors arising entirely from the third method (largest number), which triggers in 40% of cases.
Ethical Considerations #
- Bias: The algorithm does not utilize training data and does not store or process data in a manner that introduces bias. However, as the heuristics were manually created, this does represent a potential source of bias.
Limitations #
- Works exclusively with English-language abstracts.
- Regex patterns are fixed, potentially missing less common phrasing.
- The fallback method of selecting the largest number may misidentify participant counts in some cases.
- Due to lack of contextual understanding; conversion of numbers in text form (e.g., “four thousand and twenty five”) to digits remains imperfect.
Planned Improvements #
- Allow for contextual understanding in the text to digit conversion.
- Extend capabilities to non-English abstracts to increase global applicability.
Contact Information #
For questions, feedback, or support, please contact support@nested-knowledge.com.
#
PALISADE Compliance #
Purpose
The purpose of Smart Study Size is clearly defined: extract participant numbers from research abstracts. Its implementation is ethically sound, aiming to assist in research synthesis without introducing undue risk or bias.
Appropriateness
The tool is appropriate for the task of extracting study sample sizes from research abstracts because it is designed to recognize common patterns in the reporting of sample sizes, which are typically expressed in a consistent manner (e.g., “n = 100” or “sample size of 150”). The heuristics employed are based on the most frequent ways sample size information is presented in academic literature, making the tool well-suited for extracting data from abstracts in the target domain.
Limitations
- The algorithm relies on predefined regex patterns and may not generalize to abstracts with uncommon phrasing.
- The fallback method (largest number) can occasionally lead to errors.
- Limitations of the data: Restricted to English-language abstracts; performance may vary for ambiguous or poorly structured abstracts.
Implementation
The model is made easily available in cloud software and may run on standard hardware or GPUs for faster computation.
Sensitivity and Specificity
Performance has been evaluated with 91% accuracy using the PICO Corpus dataset. As this is ultimately a regression problem, specificity and sensitivity are not reasonable metrics in this case.
Algorithm Characteristics
- Design: Heuristic-based, with three-phase processing.
- Transparency: Regex patterns and fallback rules are explicitly documented for reproducibility.
Data Characteristics
- Development: Regex patterns were based on manual analysis of approximately 300 abstracts; specifically RCT’s on a range of interventions.
- Evaluation Dataset: PICO Corpus dataset (1,011 abstracts).
Explainability
The algorithm’s operation is simple and transparent. Outputs (a single numeric value, and its location in the abstract) are straightforward and understandable by end-users.
Additional Notes on Compliance #
This algorithm is not trained on the input data. It does not learn, store, or transmit input data, ensuring privacy and avoiding potential biases or ethical concerns related to training data.