Understanding Deduplication in Nested Knowledge

Luke Sims

If you’ve ever uploaded a set of articles into Nested Knowledge and noticed that the numbers shown in your Literature Search, PRISMA diagram, or uploaded files don’t quite match, you’re not alone. Deduplication is a nuanced process, and while Nested Knowledge handles it automatically, there are several key implementation details to ensure deduplication remains sane in a new paradigm: living systematic review. Understanding how and where deduplication happens will help you interpret your record counts with confidence and avoid panic when things don’t add up at first glance.

This guide explains how deduplication works in Nested Knowledge, where each number comes from, and how to set up your project for clean, auditable reporting.

Interpreting Deduplicated Record Counts in Nested Knowledge

Deduplication is central to producing a trustworthy foundation for your review. But it can also be one of the most confusing aspects of systematic review software. Nested Knowledge automatically deduplicates records so that you screen and extract from unique evidence, but different types of duplication are handled differently depending on their relevance for audit and reporting.

This guide explains:

The two types of duplication that matter (and why they are treated differently).
Where you see record counts across the platform, and how each relates to those duplication types.
How to think about audit outputs, especially PRISMA, and how this differs from concepts like Related Reports.

1. The Key Concept: Between-Database Duplicates vs. Within-Search Matches

Not all “duplicates” are created equal. While any repeated record must be collapsed to enable screening of unique evidence, only some duplicates are meaningful for audit and reporting purposes.

Nested Knowledge distinguishes between two categories by design.

Between-Database Duplicates (Audit-Relevant)

A between-database duplicate occurs when the same record appears in two or more distinct databases that were searched. For example, a study indexed in both PubMed and Embase.

How Nested Knowledge treats them

Automatically identified using metadata such as DOI, title, and abstract
Displayed transparently in Duplicate Review, including which database(s) contributed the record
Explicitly counted in audit outputs, including the PRISMA diagram

Between-database duplicates represent overlap between independent sources. Reporting this overlap is essential for demonstrating the breadth of a search strategy and for tracing how records flowed from multiple databases into a single, deduplicated evidence set.

Within-Search Matches (Not Audit-Relevant)

A within-search match occurs when the same record appears more than once within a single database search, such as:

A PubMed export returning the same PMID multiple times
Multiple uploads from the same database that overlap with one another

These are not independent records. They are effectively repeated copies of the same metadata originating from a single source.

How Nested Knowledge treats them

Automatically collapsed during import
Visible in Duplicate Review for transparency or manual inspection
Excluded from external duplicate counts, including PRISMA

Counting within-search matches as “duplicates” would artificially inflate duplicate totals and misrepresent search overlap. Conceptually, these matches are equivalent to a database internally deduplicating results when multiple queries are ORed together.

2. Where You See Deduplication and Duplicate Counts

There are several locations across the platform where record counts appear. Each reflects a different stage of import and deduplication and each aligns differently with the inter- vs. intra-search distinction.

Execution History

Execution History shows the raw number of records imported from each uploaded file. This reflects the total rows present in the file, regardless of duplication.

No deduplication is applied here
Within-search matches are counted in full
Between-database duplication is not evaluated at this stage

Execution History represents input volume, not usable evidence.

Results Column

The “Results” column shows the number of unique studies attributed to a search after within-search deduplication.

Within-search matches are collapsed
Between-database duplicates may still exist until cross-source comparison occurs
This count reflects what is actually included from that source for screening

This number matches the Study Inspector when filtered by the same search.

Intersections Diagram

The Intersections view displays a Venn diagram of overlap between searches.

Based on deduplicated, attributed records
Reflects the same logic as the Results column
Useful for understanding how searches contribute unique vs. shared records

This view aligns with within-search deduplicated counts, not raw imports.

Duplicate Queue

The Duplicate Queue is a manual review and auditing interface used to inspect or override deduplication decisions.

May include records from deleted searches
Must be manually cleared
Not intended for reporting or reconciliation of counts

3. Audit Reporting, PRISMA, and Related Reports

PRISMA Diagram (Synthesis)

The PRISMA flowchart in Nested Knowledge follows the PRISMA 2020 guidelines and therefore includes only between-database duplicate records that appear across distinct sources such as PubMed and Embase.

Within-search matches are removed before PRISMA accounting
They do not appear in the “Duplicates removed” line

This is intentional. PRISMA is designed to show how many records were uniquely identified across sources, not how many times a single database repeated the same record.

Important Clarification: Duplicates vs. Related Reports

Duplicate records and related reports are often confused, but they represent fundamentally different concepts:

Duplicate records
- Same metadata
- Same contents
- Same record appearing more than once
Related reports
- Different metadata
- Independent records
- Report the same underlying study or overlapping patient populations

Related reports are not a deduplication issue. They should be handled using the Related Reports feature, which ensures data are extracted once per underlying study.

Deduplication in Action: Examples

Example A: Single Database, One File (Within-Search Deduplication)

File: PubMed_A.nbib, contains 1,000 records
Within-search duplicates: 50
Execution History: 1,000
Results: 950
PRISMA:
- Records identified: 950
- Duplicates removed: 0

What happened: The 50 duplicates were within a single file, so they were removed silently and don’t appear in PRISMA.

Example B: Same Database, Two Files

Files: PubMed_A.nbib (800 records), PubMed_B.nbib (300 records)
Overlap between the files: 100
Execution History: 1,100
Results: 1,000
PRISMA:
- Records identified: 1,000
- Duplicates removed: 0

What happened: Since both files are from the same database, the overlap is treated as within-database duplication and excluded from PRISMA’s duplicate count.

Example C: Two Distinct Databases (Between-Database Deduplication)

Files: PubMed.nbib (1,000 records), Embase.xml (900 records)
Between-database duplicates: 200
Execution History: 1,900
Results: 1,700
PRISMA:
- Records identified: 1,900
- Duplicates removed: 200
- Records after deduplication: 1,700

What happened: 200 studies appeared in both databases and were correctly reported in PRISMA.

Example D: Mislabeling Databases

E.g. you bring two files: PubMed (1,000), Embase (900)

Uploaded under label: “Other” (for both)
Actual overlap: 200
Execution History: 1,900
Results: 1,700
PRISMA:
- Records identified: 1,700
- Duplicates removed: 0

What happened: The system treated both files as one source (since the user labelled them as the same database), so the between-database duplicates collapsed silently. This setup prevents PRISMA from correctly showing the duplicate count.

Avoiding Panic Points

If you see that Execution History = 1,900, but PRISMA only shows 1,700 records identified and no duplicates removed, that doesn’t mean records disappeared. Most likely, those duplicates occurred within a single labeled source and were removed during import, but not counted in the PRISMA duplicate line because they were not cross-database.

How to Set Up Your Project for Clean, Auditable Counts

Recommended Practices

Label each database accurately (e.g., “PubMed,” “Embase,” not “Other”).
→ Ensures that cross-database duplicates are reflected in PRISMA.
Name files clearly with query type and date (e.g., PubMed_2025-07-28_main-query.nbib).
→ Makes Execution History traceable and search updates auditable.
For updates, upload a new file under the same database label.
→ Prevents unintended inflation of record counts.
Keep grey literature and hand-searches separate, labeled distinctly.
→ PRISMA displays these separately, maintaining transparency.
Use Study Inspector or the Intersections view to verify attribution and clustering.
→ Helpful for reconciliation and record traceability.

Common Mistakes to Avoid

Uploading multiple databases under a generic label like “Other”
→ Hides true between-database duplication and skews PRISMA output.
Re-uploading the same export repeatedly under a new name
→ Silently creates within-database duplicates that won’t appear in PRISMA.
Mixing grey literature and database records under the same source
→ Breaks the reporting structure and makes origin tracking harder.

Reconciliation Checklist

Use this workflow when you’re comparing counts or validating your setup:

Check Execution History to sum the raw number of imported records.
Confirm correct source labels for each upload (PubMed, Embase, etc.).
Compare to PRISMA:
1. “Records identified” ≈ deduplicated count per source
2. “Duplicates removed” = between-database overlap only
Review the Results column to see unique studies per search.
Use Study Inspector or cluster view to trace which files contributed to each unique record.

Still Have Questions?

We know how important accurate record counts are in your review. If you’re unsure about what you’re seeing or how to resolve a discrepancy, see our Duplicate Review documentation or reach out to our support team or request a walkthrough. We’re happy to help!

A blog about systematic literature reviews?

Yep, you read that right. We started making software for conducting systematic reviews because we like doing systematic reviews. And we bet you do too.

If you do, check out this featured post and come back often! We post all the time about best practices, new software features, and upcoming collaborations (that you can join!).

Better yet, subscribe to our blog, and get each new post straight to your inbox.

Blog

How Smart Study Type Tags Are Reinventing Evidence Synthesis

One of the features of Core Smart Tags is Smart Study Type – this refers to our AI system that automatically categorises the study type

October 7, 2025 No Comments