
Tracking how a literature changes between retrievals
Source:vignettes/tracking-literature-change.Rmd
tracking-literature-change.RmdA literature is a moving target. Run the same search a few months
apart and the result will have grown, and perhaps lost a record that was
re-indexed. This article shows how to see exactly what changed and how
to merge retrievals safely. It runs offline: the baseline is the bundled
example_records, and the later retrieval is built from a
synthetic entry list of the same shape the API returns.
The baseline
baseline <- example_records
nrow(baseline)
#> [1] 6A later retrieval
Months on, the search is repeated. Here we mimic that second pull: it keeps most of the original records, drops one that was re-indexed and adds two new papers.
later_raw <- list(entry = list(
# carried over from the baseline
list(`dc:identifier` = "SCOPUS_ID:85000000001", `prism:doi` = "10.1038/s41586-019-0001-1",
`dc:title` = "Genome editing with CRISPR-Cas9: principles and applications",
`prism:coverDate` = "2019-04-12"),
list(`dc:identifier` = "SCOPUS_ID:85000000002", `prism:doi` = "10.1038/s41586-020-0002-2",
`dc:title` = "Deep learning for medical image analysis: a review",
`prism:coverDate` = "2020-02-20"),
list(`dc:identifier` = "SCOPUS_ID:85000000006", `prism:doi` = "10.1103/PhysRevLett.116.061102",
`dc:title` = "Observation of gravitational waves from a binary black hole merger",
`prism:coverDate` = "2016-02-11"),
# newly indexed since the baseline
list(`dc:identifier` = "SCOPUS_ID:85000000007", `prism:doi` = "10.1126/science.abc1234",
`dc:title` = "A room-temperature superconductor candidate",
`prism:coverDate` = "2023-03-08"),
list(`dc:identifier` = "SCOPUS_ID:85000000008", `prism:doi` = "10.1038/s41586-023-0008-8",
`dc:title` = "Large language models for scientific discovery",
`prism:coverDate` = "2023-06-01")
))
later <- scopus_records(later_raw, query = "illustrative later retrieval")
nrow(later)
#> [1] 5What changed
scopus_diff_dois() reports which DOIs were added,
removed or unchanged between the two retrievals, and prints the counts
in each category.
changes <- scopus_diff_dois(old = baseline, new = later)
changes
#> <scopus_doi_diff> 2 added, 3 removed, 3 unchanged
#> # A tibble: 8 × 2
#> doi status
#> <chr> <fct>
#> 1 10.1038/s41586-023-0008-8 added
#> 2 10.1126/science.abc1234 added
#> 3 10.1002/adma.202100001 removed
#> 4 10.1016/S1470-2045(20)30013-9 removed
#> 5 10.1038/s41558-018-0085-1 removed
#> 6 10.1038/s41586-019-0001-1 unchanged
#> 7 10.1038/s41586-020-0002-2 unchanged
#> 8 10.1103/PhysRevLett.116.061102 unchangedThe newly indexed papers come back as added, the records
present both times as unchanged, and anything dropped from
the later pull as removed. To act on one category, filter
the table.
changes[changes$status == "added", ]
#> <scopus_doi_diff> 2 added, 0 removed, 0 unchanged
#> # A tibble: 2 × 2
#> doi status
#> <chr> <fct>
#> 1 10.1038/s41586-023-0008-8 added
#> 2 10.1126/science.abc1234 addedMerging without duplicates
To keep a cumulative set across retrievals, combine them.
scopus_combine() renumbers the records and, with
dedupe = TRUE, keeps each one once by ‘Scopus’ identifier
or DOI, so the records the two pulls share are not doubled.
combined <- scopus_combine(baseline, later, dedupe = TRUE)
nrow(combined)
#> [1] 8Keeping a record of each pull
Saving each retrieval lets you compare against it next time. The
.rds form round-trips exactly.
path <- file.path(tempdir(), "baseline.rds")
write_scopus_records(baseline, path)
identical(read_scopus_records(path), baseline)
#> [1] TRUEIn a live setting the later retrieval would come from the API rather than a synthetic list, with everything else unchanged.
later <- scopus_fetch("TITLE-ABS-KEY(CRISPR)", field = "TITLE-ABS-KEY")
scopus_diff_dois(old = read_scopus_records(path), new = later)