The Elsevier Scopus Search API is generous but bounded. A weekly quota limits how many requests you may make, a short-term rate limit caps how fast you may make them, and no single query will return more than its first 5000 records. This article shows how scopusflow works within those bounds so that a large retrieval is reproducible, efficient and resumable. The steps that contact the API need a key and are not run here; everything else runs offline.
A query, built safely
Most queries combine a few terms under a field tag.
scopus_query() assembles them without the bracket and tag
mistakes that creep in when fragments are pasted together by hand.
q <- scopus_query("language learning", "effect size", .field = "TITLE-ABS-KEY")
q
#> [1] "TITLE-ABS-KEY(language learning) AND TITLE-ABS-KEY(effect size)"The recognised field tags, and what each one searches, are listed by
scopus_field_tags().
scopus_field_tags()
#> # A tibble: 12 × 2
#> tag searches
#> <chr> <chr>
#> 1 TITLE Words in the document title
#> 2 TITLE-ABS-KEY Title, abstract and keywords
#> 3 TITLE-ABS-KEY-AUTH Title, abstract, keywords and author names
#> 4 ABS Abstract text
#> 5 KEY Indexed and author keywords
#> 6 AUTH Author names
#> 7 AUTHKEY Author-supplied keywords
#> 8 AFFIL Affiliation, any part
#> 9 AFFILORG Affiliation organisation name
#> 10 SRCTITLE Source (publication) title
#> 11 DOI Digital Object Identifier
#> 12 ALL All available fieldsDescribing the search as a plan
A plan records exactly what will be fetched, so it can be saved, reviewed and re-run. Partitioning by year is the recommended way to stay under the 5000-record ceiling, since each year becomes its own cell.
plan <- scopus_plan(q, years = 2010:2020, partition = "year")
plan
#> <scopus_plan> (11 cells, view "STANDARD", partition "year")
#> # A tibble: 11 × 6
#> cell query date year view page_size
#> * <int> <chr> <chr> <int> <chr> <int>
#> 1 1 TITLE-ABS-KEY(language learning) AND TITLE… 2010 2010 STAN… 200
#> 2 2 TITLE-ABS-KEY(language learning) AND TITLE… 2011 2011 STAN… 200
#> 3 3 TITLE-ABS-KEY(language learning) AND TITLE… 2012 2012 STAN… 200
#> 4 4 TITLE-ABS-KEY(language learning) AND TITLE… 2013 2013 STAN… 200
#> 5 5 TITLE-ABS-KEY(language learning) AND TITLE… 2014 2014 STAN… 200
#> 6 6 TITLE-ABS-KEY(language learning) AND TITLE… 2015 2015 STAN… 200
#> 7 7 TITLE-ABS-KEY(language learning) AND TITLE… 2016 2016 STAN… 200
#> 8 8 TITLE-ABS-KEY(language learning) AND TITLE… 2017 2017 STAN… 200
#> 9 9 TITLE-ABS-KEY(language learning) AND TITLE… 2018 2018 STAN… 200
#> 10 10 TITLE-ABS-KEY(language learning) AND TITLE… 2019 2019 STAN… 200
#> 11 11 TITLE-ABS-KEY(language learning) AND TITLE… 2020 2020 STAN… 200Each cell carries the query, the year, the view and the page size. The page size deserves a moment’s attention, because it is where quota is won or lost.
Why page size is a quota decision
Scopus charges quota per request, not per record. A page may hold up
to 200 records under the STANDARD view, or 25 under
COMPLETE. Retrieving a thousand records in pages of 200
therefore costs five requests, where pages of 25 would cost forty. For
that reason page_size defaults to the largest the view
allows, which is the same efficiency rscopus relies on, and
is in no sense an evasion of the quota: every request is counted, and
the 5000-record ceiling still holds.
scopus_plan(q, view = "STANDARD")$page_size[1]
#> [1] 200
scopus_plan(q, view = "COMPLETE")$page_size[1]
#> [1] 25Sizing before spending
Counting is cheap and does not download records, so it is worth doing first. The count comes back with the parsed quota attached, which lets a workflow decide whether it has the allowance to proceed.
n <- scopus_count(q, years = 2010:2020)
n
attr(n, "quota")Fetching, with caching and resume
scopus_fetch_plan() runs each cell in turn. Given a
cache directory it writes each cell to disk as it completes, so a run
interrupted halfway, or stopped by the quota, resumes from where it left
off rather than paying for the same cells again.
records <- scopus_fetch_plan(
plan,
cache_dir = scopus_cache_dir(),
resume = TRUE
)
recordsThe result is a scopus_records tibble, the same shape
returned by scopus_fetch() for a single query and by the
bundled example_records.
example_records
#> <scopus_records> 6 records
#> query: "illustrative multi-disciplinary sample"
#> # A tibble: 6 × 9
#> entry_number scopus_id doi title authors year date publication citations
#> <int> <chr> <chr> <chr> <chr> <int> <chr> <chr> <int>
#> 1 1 85000000001 10.1… Geno… Zhang … 2019 2019… Nature 540
#> 2 2 85000000002 10.1… Deep… Kumar … 2020 2020… Nature 210
#> 3 3 85000000003 10.1… Clim… Okafor… 2018 2018… Nature Cli… 122
#> 4 4 85000000004 10.1… Grap… Tanaka… 2021 2021… Advanced M… 45
#> 5 5 85000000005 10.1… Chec… Garcia… 2020 2020… The Lancet… 388
#> 6 6 85000000006 10.1… Obse… Abbott… 2016 2016… Physical R… 4200Combining separate retrievals
Results gathered in separate runs combine safely with
scopus_combine(), which renumbers the records and can drop
duplicates by Scopus identifier or DOI. This is preferable to
rbind(), which would leave duplicate entry numbers.
scopus_combine(example_records, example_records, dedupe = TRUE)
#> <scopus_records> 6 records
#> query: "illustrative multi-disciplinary sample"
#> # A tibble: 6 × 9
#> entry_number scopus_id doi title authors year date publication citations
#> <int> <chr> <chr> <chr> <chr> <int> <chr> <chr> <int>
#> 1 1 85000000001 10.1… Geno… Zhang … 2019 2019… Nature 540
#> 2 2 85000000002 10.1… Deep… Kumar … 2020 2020… Nature 210
#> 3 3 85000000003 10.1… Clim… Okafor… 2018 2018… Nature Cli… 122
#> 4 4 85000000004 10.1… Grap… Tanaka… 2021 2021… Advanced M… 45
#> 5 5 85000000005 10.1… Chec… Garcia… 2020 2020… The Lancet… 388
#> 6 6 85000000006 10.1… Obse… Abbott… 2016 2016… Physical R… 4200When the ceiling bites
A query matching more than 5000 records cannot be retrieved in full
from a single call; scopus_fetch() returns the first 5000
and warns. The remedy is the plan: split the search by year, or by any
other facet, so that each cell stays under the ceiling.
scopus_count() tells you in advance whether a split is
needed.
Handling interruptions
Network and API problems are raised as typed conditions, all
inheriting from scopus_error, so a long retrieval can
respond to them rather than stopping dead.
result <- tryCatch(
scopus_fetch_plan(plan, cache_dir = scopus_cache_dir()),
scopus_error_rate_limit = function(e) {
message("Rate limited; the cached cells are safe. Try again later.")
NULL
}
)Because each completed cell is already cached, resuming after such a pause costs nothing for the work already done.
