So, how accurate are Core Smart Tags? In part because we are using an amalgamation of models and methods, but also because each Core Tag type has its own unique extraction challenges, accuracy varies across Core Tag types. While our accuracy statistics should be considered preliminary––we welcome external validation studies––they should give you a good sense of what this new feature is capable of.
PICOs
The primary powerhouse behind the PICO core tag is an open-source entity extraction model with an F1 score of 0.74 (SOTA at time of writing) on the EBM PICO dataset.
However, we are adding a bit more magic to create a hierarchy and context for the PICO terms once they are recognized; due to the qualitative nature of that process, we do not have a quantitative measure of the accuracy of that output. As such we will be relying on user feedback, in order to figure out ways to improve those results. If you notice anything funky or undesirable in your automatically generated PICO hierarchy, feel free to let us know, we’d love to hear from you.
Study Type
In the process of developing the Study Type Core Tag, we built a test data set composed of 1,000 randomly sampled studies from PubMed that were coded for study size by experts. Our supervised model achieved an exact match 74% of the time in this data set. In practice, we expect higher accuracy in SLR settings, where study types are less a random sample (the model biases towards clinical study types) and hierarchical accuracy is higher.
Study Size
Using the PICO Corpus dataset, our model achieved 91% accuracy. While we’re proud of the headline number, it’s worth noting that this dataset is composed entirely of RCTs. We expect our NER + heuristics approach to generalize well to other study types but performance may vary. It’s important to note that “size” is ambiguous for certain study types; for example, in a meta-analysis, the model will prefer the pooled number of patients when reported, else the number of studies.
Study Location
The Study Location Core Tag model achieves an overall accuracy of 78%, or an F1 of 0.8 on a random sample of PubMed records with NCT ID linkage. This accuracy statistic will be most reliable on RCT and cohort designs, which are most frequently registered. The definition of a “study location” will vary with study types. When in doubt, our model will attempt to provide the most sensible answer. The location reported will always be a country, but it may be the author’s location in the case of a narrative review, for example.
Other Details
Whether Core Smart Tags are applied directly to studies or are offered to reviewers as recommendations, each has slightly different characteristics. The PICO Core Tag and its children, for example, will always expect text contents, and will always be accompanied by an abstract annotation; depending on the contents of the study’s abstract, you may find many recommendations/applications for that study. The following table provides a breakdown of roughly what you can expect for each tag type:
CST | Content Type | Annotation | Multiplicity in study |
PICO | Text | Yes | Many |
Type | – | No | One |
Size | Numeric | Yes | One |
Location | Text Options | When in abstract | Many |
Ongoing Testing and Backfill Timeline
While Core Smart Tags is still in beta, we are pleased to offer access to the feature to all users, though this may be subject to change as we evaluate costs and continue to develop new Core Tag types. Core Smart Tags are available now in any newly created nests, and we are working on backfilling older nests, so that existing nests can have these available in the next few months. If you are an existing customer looking for an expedited backfill, feel free to reach out.