Smart Tag Recommendations Explained: How GPT-4 Extracts Data From Full Texts
So...can ChatGPT Do Your Systematic Review?
The short answer is no…but it might be able help speed up the process for you. When we first started to experiment with ChatGPT back in January, we knew the technology had serious promise, and even more serious flaws. For one, we noted even then its impressive classification and identification capabilities. However, it was (and still is!) equally capable of just making up false answers, and even false citations, that look and sound true. Since January, the technology has improved, in fits and starts. GPT-3.5 has received numerous updates under the hood, making it faster and cheaper. GPT-4 is also now publically available to anyone with a premium subscription to ChatGPT, and API access is slowly being rolled out to developers (like us!). As predicted, GPT-4 is superior to its predecessor in many ways, so much so that we recently incorporated it into the Nested Knowledge platform…though not without thinking carefully about how to position it in order to take advantage of its strengths and account for its weaknesses.
Why Tag Recommendations?
From the beginning, it was apparent that data extraction is where AI could save reviewers the most time. Screening and tagging are far and away the most time-intensive parts of conducting a systematic review, and we already have a Robot Screener with excellent accuracy. Given our experience testing a few studies in ChatGPT, we knew a large language model could not outperform human accuracy, but it also didn’t need to in order to be useful. After all, we employ far less powerful AI/ML models elsewhere in the product to simply lighten the load: classifying, reordering, or making preliminary decisions for humans to review later. In our initial testing, it seemed as though GPT-3.5 was pretty good at rapidly comparing one word (a tag name, for example), or even one word and one sentence (a tag name and tag question), against the full text of a study in order to pluck out the exact snippet of text which is most semantically similar to that tag. So we set out to augment data extraction using GPT-3.5…and immediately ran into issues.
If you think about it, the fact that GPT-3.5 struggles with tag recommendations makes sense: inferring from minimal context is tricky, even for human reviewers. Even if the tag names are chosen carefully, one word can only hold so much meaning, and much is left up to the expertise of the reviewer. That said, the performance of GPT-3.5 was bad. So bad, in fact, that we almost shelved the half-built “Smart Tag Recommendations” feature because it simply was not useful enough to launch. However, a few days later, OpenAI gave us API access to GPT-4, we swapped out the model, and saw an immediate, massive leap in accuracy.
For this particular task at least, GPT-4 demonstrates much better results when compared to GPT-3.5. In our internal testing, we see it identify approximately 80% of the same tag contents as human reviewers.
How Does It Work?
The first thing to know about Smart Tag Recommendations is that the feature is off by default. This is to ensure that review administrators make a conscious decision about whether or not to employ AI augmentation workflows in the study design. Once turned on in settings, a background batch job is run on every included record that has a full text attached to it.
Smart Tag Recommendations are derived from a number of types of data:
- Tag Names
- Tag Questions (If Forms-based Tagging is on)
- Full Texts
- A Prompt
Both the overall ontological structure and the tag names themselves are sent to the model. You can imagine the structure that GPT-4 receives by taking each of your tags, and representing the parent/child relationships with indentations:
Root Parent Tag
Child Tag 1
Child Tag 2
Child Tag 3
For most use cases, the tag names and structure provide enough context for GPT-4 to make useful recommendations. However, in some cases, it’s performance can be improved by adding additional context in the form of Tag Questions. These are only sent as a part of the API request if Forms-based Tagging is enabled in Settings, however.
For each full text, the Nested Knowledge software will parse the full text PDF (pull out all of the textual information) and keep track of where each letter, word, sentence, etc. is located in the PDF file. This enables us to locate any recommendations generated by GPT-4 in order to confirm that each recommendation does in fact exist in the document.
The prompt is the only aspect of the process that is not user configurable; we use it to tell GPT-4 to return the tag recommendations in a format that our software can interpret. As the background job runs, we make one request to the OpenAI API per study, then once OpenAI has returned the result, we attempt to match each generated recommendation back to its location in the PDF. If no location can be found, or if the recommendation does not exactly match the text in that location, it is discarded.
Caveats. Known Issues.
With this new integration, it’s important to note that we are sending some information to a third party, namely OpenAI. However, we take care to limit that data to only the above mentioned; none of your personally identifiable data is shared as part of our API requests. Additionally, according the the OpenAI API data use policies at the time of this writing, none of the data sent to OpenAI will be used to train or improve their models, and will not be retained for more than 30 days. However, if you are uncomfortable with sharing any data with OpenAI, simply don’t turn on Smart Tag Recommendations.
For the time being, we are still rate limited by OpenAI; this means that there is a delay between each API call; this delay can add up when you are requesting lots of tag recommendations, and as such, we’ve built a number of (hopefully temporary!) limitations into the feature. First, we limit size of the nest to 100 studies. Second, we limit the number of times you can request a “regeneration” of tag recommendations. Lastly, we only make one API call per study, regardless of how many words the full text of that study may contain. This means that for studies only the first 5-6 pages will have tag recommendations; any subsequent pages will not be read. Additionally, abstracts will not be sent along with the full text. As OpenAI builds more capacity and begins to lift their restrictions, we will lift our self-imposed restrictions in order to make the feature as performant as possible. Given our current limitations, this feature is still in Beta.
This feature is available to Nested Knowledge users with Enterprise subscriptions.
While this is our first implementation of GPT-4 into the Nested Knowledge platform, it is likely not the last. We have several R&D projects in the works; short term, expect to see a revamped Nest Guidance, which will allow you to ask questions about how Nested Knowledge works. Longer term, we’re working on a “Chat with your papers” feature, which will allow you to ask questions about key concepts and trends in full texts.
A blog about systematic literature reviews?
Yep, you read that right. We started making software for conducting systematic reviews because we like doing systematic reviews. And we bet you do too.
If you do, check out this featured post and come back often! We post all the time about best practices, new software features, and upcoming collaborations (that you can join!).
Better yet, subscribe to our blog, and get each new post straight to your inbox.
What happens when we substitute Robot Screener for a human screener? Most reviewers do not take the time to set up this feature, but it can shave off up to 46% of the total screening time.