Title: | NEON Data Store |
---|---|
Description: | The National Ecological Observatory Network (NEON) provides access to its numerous data products through its REST API, <https://data.neonscience.org/data-api/>. This package provides a high-level user interface for downloading and storing NEON data products. Unlike 'neonUtilities', this package will avoid repeated downloading, provides persistent storage, and improves performance. 'neonstore' can also construct a local 'duckdb' database of stacked tables, making it possible to work with tables that are far to big to fit into memory. |
Authors: | Carl Boettiger [aut, cre] , Quinn Thomas [aut] , Christine Laney [aut] , Claire Lunch [aut] , Noam Ross [ctb] |
Maintainer: | Carl Boettiger <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.5.1 |
Built: | 2024-11-03 03:56:20 UTC |
Source: | https://github.com/cboettig/neonstore |
Generate the appropriate citation for your data
neon_citation(product = NULL, download_date = Sys.Date(), dir = neon_dir())
neon_citation(product = NULL, download_date = Sys.Date(), dir = neon_dir())
product |
A NEON |
download_date |
Date of download to be included in citation. default is today's date, see details. |
dir |
Location where files should be downloaded. By default will
use the appropriate applications directory for your system
(see |
Note that the neon_download()
does not record download date for each file.
Citing a single product download date is after all rather meaningless, as
parts of a products may not have all been downloaded on different dates.
Indeed, neon_download()
is designed in precisely this way, to allow easy
updating of downloads without re-downloading older data.
returns a utils::bibentry object, which can be used as text or formatted for bibtex.
https://www.neonscience.org/data-samples/data-policies-citation
# may be slow neon_citation("DP1.10003.001") ## or the citation for all products in store: neon_citation() ## as bibtex format(neon_citation("DP1.10003.001"), "bibtex")
# may be slow neon_citation("DP1.10003.001") ## or the citation for all products in store: neon_citation() ## as bibtex format(neon_citation("DP1.10003.001"), "bibtex")
neon cloud
neon_cloud( table, product, start_date = NA, end_date = NA, site = NA, type = "basic", release = NA, quiet = FALSE, api = "https://data.neonscience.org/api/v0", unify_schemas = FALSE, .token = Sys.getenv("NEON_TOKEN") )
neon_cloud( table, product, start_date = NA, end_date = NA, site = NA, type = "basic", release = NA, quiet = FALSE, api = "https://data.neonscience.org/api/v0", unify_schemas = FALSE, .token = Sys.getenv("NEON_TOKEN") )
table |
NEON table name |
product |
A NEON |
start_date |
Download only files as recent as ( |
end_date |
Download only files up to end_date ( |
site |
4-letter site code(s) to filter on. Leave as |
type |
Should we prefer the basic or expanded version of this product? Note that not all products have expanded formats. |
release |
Select only data files associated with a particular release tag, see https://www.neonscience.org/data-samples/data-management/data-revisions-releases, e.g. "RELEASE-2021". Releases are associated with a specific DOI and the promise that files associated with a particular release will not change. |
quiet |
Should download progress be displayed? |
api |
the URL to the NEON API, leave as default. |
unify_schemas |
if cloud-read fails to collect data due to miss-matched
schemas, set this to |
.token |
an authentication token from NEON. A token is not
required but will allow access to a higher number of requests before
rate limiting applies, see
https://data.neonscience.org/data-api/rate-limiting/#api-tokens.
Note that once files are downloaded once, |
lazy data frame
Query the NEON API for URLs of matching data products Repeated requests will be cached
neon_data( product, start_date = NA, end_date = NA, site = NA, type = NA, release = NA, quiet = FALSE, api = "https://data.neonscience.org/api/v0", .token = Sys.getenv("NEON_TOKEN") )
neon_data( product, start_date = NA, end_date = NA, site = NA, type = NA, release = NA, quiet = FALSE, api = "https://data.neonscience.org/api/v0", .token = Sys.getenv("NEON_TOKEN") )
product |
A NEON |
start_date |
Download only files as recent as ( |
end_date |
Download only files up to end_date ( |
site |
4-letter site code(s) to filter on. Leave as |
type |
Should we prefer the basic or expanded version of this product? Note that not all products have expanded formats. |
release |
Select only data files associated with a particular release tag, see https://www.neonscience.org/data-samples/data-management/data-revisions-releases, e.g. "RELEASE-2021". Releases are associated with a specific DOI and the promise that files associated with a particular release will not change. |
quiet |
Should download progress be displayed? |
api |
the URL to the NEON API, leave as default. |
.token |
an authentication token from NEON. A token is not
required but will allow access to a higher number of requests before
rate limiting applies, see
https://data.neonscience.org/data-api/rate-limiting/#api-tokens.
Note that once files are downloaded once, |
a data.frame containing the name, filesize (in bytes), checksums (columns md5, crc32, or crc32c, though each product will use only one of these), url, and release status.
x <- neon_data("DP1.10003.001") x <- neon_data("DP1.10003.001", release="RELEASE-2021")
x <- neon_data("DP1.10003.001") x <- neon_data("DP1.10003.001", release="RELEASE-2021")
Cache-able duckdb database connection
neon_db( dir = neon_db_dir(), read_only = TRUE, memory_limit = getOption("duckdb_memory_limit", NA), ... )
neon_db( dir = neon_db_dir(), read_only = TRUE, memory_limit = getOption("duckdb_memory_limit", NA), ... )
dir |
Location where files should be downloaded. By default will
use the appropriate applications directory for your system
(see |
read_only |
allow concurrent connections by enforcing read_only. See details. |
memory_limit |
Set a memory limit for duckdb, in GB. This can
also be set for the session by using options, e.g.
|
... |
additional arguments to dbConnect |
Creates a connection to a permanent duckdb database
instance in the provided directory (see neon_dir()
). This
connection is also cached, so that code which repeatedly calls
[neon_db]
will not stall or hang. Only read_only
connections
will be cached.
NOTE: [duckdb::duckdb()]
can only support a single read-write connection
at a time. The default option of read_only = TRUE
allows
multiple connections. [neon_store()]
will automatically set this to
FALSE
to allow data import.
# tempfile used for illustration only neon_db(tempfile())
# tempfile used for illustration only neon_db(tempfile())
Use neon_db_dir()
to view or access the currently active database
directory. By default, this uses the appropriate application directory
for your operating system, see tools::R_user_dir()
.
This location can be overridden by setting
the environmental variable NEONSTORE_DB
.
neon_db_dir()
neon_db_dir()
the active neonstore
directory.
neon_db_dir() ## Override with an environmental variable: Sys.setenv(NEONSTORE_DB = tempdir()) neon_db_dir() ## Unset Sys.unsetenv("NEONSTORE_DB")
neon_db_dir() ## Override with an environmental variable: Sys.setenv(NEONSTORE_DB = tempdir()) neon_db_dir() ## Unset Sys.unsetenv("NEONSTORE_DB")
delete the local NEON database
neon_delete_db(db_dir = neon_db_dir(), ask = interactive())
neon_delete_db(db_dir = neon_db_dir(), ask = interactive())
db_dir |
neon database location (configurable with the NEONSTORE_DB environmental variable) |
ask |
Ask for confirmation first? |
Just a helper function that deletes the NEON database
files, which are found under file.path(neon_dir(), "database")
.
This does not delete downloaded raw data, which can easily be
re-loaded with neon_store()
. Usually unnecessary but can be
helpful in resetting a corrupt database.
If you want to delete all raw data files downloaded by neonstore
as well, simply delete the entire directory given by neon_dir()
# Create a db dir <- tempfile() db <- neon_db(dir) # Delete it neon_delete_db(dir, ask = FALSE)
# Create a db dir <- tempfile() db <- neon_db(dir) # Delete it neon_delete_db(dir, ask = FALSE)
Use neon_dir()
to view or access the currently active local store.
By default, neon_download()
downloads files into the neon_dir()
,
which uses an appropriate application directory for your operating system,
see tools::R_user_dir()
. This location can be overridden by setting
the environmental variable NEONSTORE_HOME
. neonstore
functions
(e.g. neon_index()
, and neon_read()
) look for files in
the neon_dir()
directory by default. (All functions can also take
a one-off argument to dir
in the function call in place of the calling
neon_dir()
to access the default.
neon_dir()
neon_dir()
the active neonstore
directory.
neon_dir() ## Override with an environmental variable: Sys.setenv(NEONSTORE_HOME = tempdir()) neon_dir() ## Unset Sys.unsetenv("NEONSTORE_HOME")
neon_dir() ## Override with an environmental variable: Sys.setenv(NEONSTORE_HOME = tempdir()) neon_dir() ## Unset Sys.unsetenv("NEONSTORE_HOME")
Disconnect from the neon database
neon_disconnect(db = neon_db())
neon_disconnect(db = neon_db())
db |
link to an existing database connection |
Download NEON data products into a local store
neon_download( product, table = NA, site = NA, start_date = NA, end_date = NA, type = "basic", release = NA, quiet = FALSE, verify = TRUE, unique = TRUE, dir = neon_dir(), get_zip = FALSE, unzip = FALSE, api = "https://data.neonscience.org/api/v0", .token = Sys.getenv("NEON_TOKEN") )
neon_download( product, table = NA, site = NA, start_date = NA, end_date = NA, type = "basic", release = NA, quiet = FALSE, verify = TRUE, unique = TRUE, dir = neon_dir(), get_zip = FALSE, unzip = FALSE, api = "https://data.neonscience.org/api/v0", .token = Sys.getenv("NEON_TOKEN") )
product |
A NEON |
table |
Include only files matching this table name (or regex pattern). (optional). |
site |
4-letter site code(s) to filter on. Leave as |
start_date |
Download only files as recent as ( |
end_date |
Download only files up to end_date ( |
type |
Should we prefer the basic or expanded version of this product? Note that not all products have expanded formats. |
release |
Select only data files associated with a particular release tag, see https://www.neonscience.org/data-samples/data-management/data-revisions-releases, e.g. "RELEASE-2021". Releases are associated with a specific DOI and the promise that files associated with a particular release will not change. |
quiet |
Should download progress be displayed? |
verify |
Should downloaded files be compared against the MD5 hash
reported by the NEON API to verify integrity? (default |
unique |
Should we skip downloads of files we already have? Note: file comparisons are based on file hash, which will omit files that have identical content but different names. |
dir |
Location where files should be downloaded. By default will
use the appropriate applications directory for your system
(see |
get_zip |
should we attempt to download .zip archive versions of files?
default |
unzip |
should we extract .zip files? (default |
api |
the URL to the NEON API, leave as default. |
.token |
an authentication token from NEON. A token is not
required but will allow access to a higher number of requests before
rate limiting applies, see
https://data.neonscience.org/data-api/rate-limiting/#api-tokens.
Note that once files are downloaded once, |
Each NEON data product consists of a collection of objects (e.g. tables), which are in turn broken into individual files by site and sampling month. Additionally, many NEON products have been expanded, including some additional columns. Consequently, users must specify if they want the "basic" or "expanded" version of this data.
In the products table (see neon_products), the productHasExpanded
column indicates if the data
product has expanded, and the columns productHasBasicDescription
and
productHasExpandedDescription
provide a detailed explanation of the
differences between the "expanded"
and "basic"
versions of that
particular product.
The API allows users to request component files directly.
By default, neon-download()
will download all available
extensions. Users can request only products of a certain format
(e.g. .csv
or .h5
) by altering the file_regex
argument
(see examples).
Prior to 2021, the API provided
access to a .zip
file containing all the component objects
(e.g. tables) for that product at that site and sampling month.
neon_download()
will avoid downloading metadata files which are bitwise
identical to other files in the same download request, as indicated by the
crc32 hash reported by the API. These typically include metadata that are
shared across the product as a whole, but are for some reason included in
each sampling month for each site – potentially thousands of duplicates.
These duplicates are also packaged within the .zip
downloads where it
is not possible to exclude them from the download.
## Omit dir=tempfile() to use persistent storage neon_download("DP1.10003.001", start_date = "2018-01-01", end_date = "2019-01-01", site = "YELL", dir = tempfile()) ## Advanced use: filter for a particular table in the product neon_download(product = "DP1.10003.001", start_date = "2018-01-01", end_date = "2019-01-01", site = "YELL", table = "countdata", dir = tempfile())
## Omit dir=tempfile() to use persistent storage neon_download("DP1.10003.001", start_date = "2018-01-01", end_date = "2019-01-01", site = "YELL", dir = tempfile()) ## Advanced use: filter for a particular table in the product neon_download(product = "DP1.10003.001", start_date = "2018-01-01", end_date = "2019-01-01", site = "YELL", table = "countdata", dir = tempfile())
Export all or select files from your neon store as a zip archive. This can be useful if you want to bypass accessing the API, such as for archiving the files required for your analysis so that they can be re-created by other users without an API key, or without waiting for the individual download, or any other tiem you want to share or distribute your local store.
neon_export( archive = paste(Sys.Date(), "neonstore.zip", sep = "-"), product = NA, table = NA, site = NA, start_date = NA, end_date = NA, type = NA, ext = NA, timestamp = NA, hash = NULL, dir = neon_dir() )
neon_export( archive = paste(Sys.Date(), "neonstore.zip", sep = "-"), product = NA, table = NA, site = NA, start_date = NA, end_date = NA, type = NA, ext = NA, timestamp = NA, hash = NULL, dir = neon_dir() )
archive |
path to the zip archive to be created.#' |
product |
A NEON |
table |
Include only files matching this table name (or regex pattern). (optional). |
site |
4-letter site code(s) to filter on. Leave as |
start_date |
Download only files as recent as ( |
end_date |
Download only files up to end_date ( |
type |
Should we prefer the basic or expanded version of this product? Note that not all products have expanded formats. |
ext |
only match files with this file extension(s) |
timestamp |
only match timestamps prior this. See details in |
hash |
name of a hashing algorithm to check file integrity. Can be
|
dir |
Location where files should be downloaded. By default will
use the appropriate applications directory for your system
(see |
table of selected files and metadata, from neon_index()
, invisibly.
neon_import()
, neon_citation()
archive <- tempfile() dir <- tempdir() neon_export(archive, dir = dir)
archive <- tempfile() dir <- tempdir() neon_export(archive, dir = dir)
Export your current database. This can be important to (1) archive and share your database files with another user or machine, (2) expose your database using an S3 bucket using neon_remote_db(), (3) assist in upgrading your duckdb version.
neon_export_db(dir = file.path(neon_dir(), "parquet"), db = neon_db())
neon_export_db(dir = file.path(neon_dir(), "parquet"), db = neon_db())
dir |
directory to which parquet export is written. |
db |
Connection to your local NEON database |
Parse filenames into their component metadata. See details for definition of each metadata field, or consult the NEON documentation linked below. https://data.neonscience.org/file-naming-conventions
neon_filename_parser(x)
neon_filename_parser(x)
x |
vector of NEON filenames |
NEON A four-character alphanumeric code, denoting the organizational origin of the data product and identifying the product as operational; data collected as part of a special data collection exercise are designated by a separate, unique alphanumeric code created by the PI.
DOM A three-character alphanumeric code, referring to the domain of data acquisition (D01 - D20).
SITE A four-character alphanumeric code, referring to the site of data acquisition; all sites are designated by a standardized four-character alphabetic code.
DPL A three-character alphanumeric code, referring to data product processing level.
PRNUM A five-character numeric code, referring to the data product number (see the Data Product Catalog at http://data.neonscience.org/data-product-catalog).
REV A three-digit designation, referring to the revision number of the data product. The REV value is incremented by 1 each time a major change is made in instrumentation, data collection protocol, or data processing such that data from the preceding revision is not directly comparable to the new.
HOR A three-character alphanumeric code for Spatial Index #1. Refers to measurement locations within one horizontal plane. For example, if five surface measurements were taken, one at each of the five soil array plots, the number in the HOR field would range from 001-005.
VER A three-character alphanumeric code for Spatial Index #2. Refers to measurement locations within one vertical plane. For example, if eight temperature measurements are collected, one at each tower vertical level, the number in the VER field would range from 010-080.
TMI A three-character alphanumeric code for the Temporal Index. Refers to the temporal representation, averaging period, or coverage of the data product (e.g., minute, hour, month, year, sub-hourly, day, lunar month, single instance, seasonal, annual, multi-annual). 000 = native resolution, 001 = native resolution or 1 minute, 002 = 2 minute, 005 = 5 minute, 015 = 15 minute, 030 = 30 minute, 060 = 60 minutes or 1 hour, 101-103 = native resolution of replicate sensor 1, 2, and 3 respectively, 999 = Sensor conducts measurements at varied interval depending on air mass.
DESC An abbreviated description of the data file or table.
YYYY-MM Represents the year and month of the data in the file.
PKGTYPE The type of data package downloaded. Options are 'basic', representing the basic download package, or 'expanded', representing the expanded download package (see more information below).
GENTIME The date-time stamp when the file was generated, in UTC. The format of the date-time stamp is YYYYMMDDTHHmmSSZ.
FLHTDATE Date of flight, YYYYMMDD
FLIGHTSTRT Start time of flight, YYYYMMDDHH
FLHTSTRT Start time of flight, YYMMDDHH
IMAGEDATETIME Date and time of image capture, YYYYMMDDHHmmSS
CCCCCC Digital camera serial number
NNNN Sequential number for indexing files
NNN Planned flightline number
R Repeat number
FFFFFF Numeric code for an individual flightline
EEEEEE UTM easting of lower left corner
NNNNNNN UTM northing of lower left corner
a data frame in which filenames have been split into metadata components. Column names indicate the metadata field code, see details section for complete descriptions.
https://data.neonscience.org/file-naming-conventions
neon_import()
only reads in previously saved archives from neon_export()
.
This can be useful in cases where
see neon_download()
to download data directly from NEON.
neon_import(archive, overwrite = TRUE, dir = neon_dir())
neon_import(archive, overwrite = TRUE, dir = neon_dir())
archive |
path to the zip archive to be imported |
overwrite |
should we overwrite any existing files? |
dir |
Location where files should be downloaded. By default will
use the appropriate applications directory for your system
(see |
## tempfiles for example purposes only! archive <- tempfile() neondir <- tempdir() neon_export(archive, dir = neondir) neon_import(archive)
## tempfiles for example purposes only! archive <- tempfile() neondir <- tempdir() neon_export(archive, dir = neondir) neon_import(archive)
Import a NEON database exported from neon_export_db()
neon_import_db( dir = file.path(neon_dir(), "parquet"), db = neon_db(read_only = FALSE) )
neon_import_db( dir = file.path(neon_dir(), "parquet"), db = neon_db(read_only = FALSE) )
dir |
directory to which parquet export is written. |
db |
Connection to your local NEON database |
NEON products consist of several individual components, which are in turn broken up by site and sampling month. By storing these individual files, neonstore enables more reproducible workflows that can be traced back to original, unaltered input data. These atomized files can be quickly and easily combined into unified tables, see neon_read.
neon_index( product = NA, table = NA, site = NA, start_date = NA, end_date = NA, type = NA, ext = NA, timestamp = NA, release = NA, hash = NULL, dir = neon_dir(), deprecated = TRUE )
neon_index( product = NA, table = NA, site = NA, start_date = NA, end_date = NA, type = NA, ext = NA, timestamp = NA, release = NA, hash = NULL, dir = neon_dir(), deprecated = TRUE )
product |
A NEON |
table |
Include only files matching this table name (or regex pattern). (optional). |
site |
4-letter site code(s) to filter on. Leave as |
start_date |
Download only files as recent as ( |
end_date |
Download only files up to end_date ( |
type |
Should we prefer the basic or expanded version of this product? Note that not all products have expanded formats. |
ext |
only match files with this file extension(s) |
timestamp |
only match timestamps prior this. See details in |
release |
Select only data files associated with a particular release tag, see https://www.neonscience.org/data-samples/data-management/data-revisions-releases, e.g. "RELEASE-2021". Releases are associated with a specific DOI and the promise that files associated with a particular release will not change. |
hash |
name of a hashing algorithm to check file integrity. Can be
|
dir |
Location where files should be downloaded. By default will
use the appropriate applications directory for your system
(see |
deprecated |
Should the index include files that have since been deprecated by more recent downloads? logical, default TRUE. |
File names include metadata such as the file productCode,
table name, site, and sampling month, as well as timestamp of creation.
neon_index()
parses this metadata from the file name string and returns
the information in a convenient table, along with a path to each file.
Regarding timestamps: NEON will occasionally publish new versions of
previously-released raw data files (which may or may not actually differ).
The NEON download API, and hence neon_download()
, only serve the most recent
of such files, but earlier versions may still exist in your local neonstore
if you downloaded them before the updated files were released. By default,
neon_read()
will always select the most recent of such files, thus avoiding
duplication and providing the most updated data. For reproducibility however,
it may be necessary to access older version instead. Setting the timestamp
argument allows the user to filter out newer files and select the original
ones instead. Unfortunately, at this time users cannot request the outdated
data files from NEON API. For strict reproducibility, users should also
archive their local store.
neon_index() ## Just bird survey product neon_index("DP1.10003.001")
neon_index() ## Just bird survey product neon_index("DP1.10003.001")
This function launches the RStudio "Connection" pane to interactively explore the database.
neon_pane()
neon_pane()
if (!is.null(getOption("connectionObserver"))) neon_pane()
if (!is.null(getOption("connectionObserver"))) neon_pane()
Return a table of all NEON Data Products, including product descriptions and the productCode needed for neon_download. (including list-columns).
neon_products( api = "https://data.neonscience.org/api/v0", .token = Sys.getenv("NEON_TOKEN") )
neon_products( api = "https://data.neonscience.org/api/v0", .token = Sys.getenv("NEON_TOKEN") )
api |
the URL to the NEON API, leave as default. |
.token |
an authentication token from NEON. A token is not
required but will allow access to a higher number of requests before
rate limiting applies, see
https://data.neonscience.org/data-api/rate-limiting/#api-tokens.
Note that once files are downloaded once, |
products <- neon_products() # Or search for a keyword i <- grepl("bird", products$keywords) products[i, c("productCode", "productName")]
products <- neon_products() # Or search for a keyword i <- grepl("bird", products$keywords) products[i, c("productCode", "productName")]
read in neon tabular data
neon_read( table = NA, product = NA, site = NA, start_date = NA, end_date = NA, ext = NA, timestamp = NA, release = NA, dir = neon_dir(), files = NULL, sensor_metadata = TRUE, keep_filename = FALSE, altrep = FALSE, ... )
neon_read( table = NA, product = NA, site = NA, start_date = NA, end_date = NA, ext = NA, timestamp = NA, release = NA, dir = neon_dir(), files = NULL, sensor_metadata = TRUE, keep_filename = FALSE, altrep = FALSE, ... )
table |
the name of a downloaded NEON table in the store, see neon_index |
product |
A NEON |
site |
4-letter site code(s) to filter on. Leave as |
start_date |
Download only files as recent as ( |
end_date |
Download only files up to end_date ( |
ext |
only match files with this file extension(s) |
timestamp |
only match timestamps prior this. See details in |
release |
Select only data files associated with a particular release tag, see https://www.neonscience.org/data-samples/data-management/data-revisions-releases, e.g. "RELEASE-2021". Releases are associated with a specific DOI and the promise that files associated with a particular release will not change. |
dir |
Location where files should be downloaded. By default will
use the appropriate applications directory for your system
(see |
files |
optionally, specify a vector of file paths directly (e.g. as
provided from neon_index) and specify |
sensor_metadata |
logical, default TRUE. Should we add metadata fields from file names of sensor data into the table? Adds DomainID, SiteID, horizontalPosition, verticalPosition, and publicationDate. Results in slower parsing. |
keep_filename |
Should we include a column indicating the original
file name for each row? Can be a useful source of additional metadata that
NEON may omit from the raw files (i.e. |
altrep |
enable or disable altrep. Logical, default |
... |
additional arguments to vroom::vroom, can usually be omitted. |
NEON's tabular data files are separated out into separate .csv
files for each site for each month of sampling. In principle,
each file has identical columns. vroom::vroom can read in a
data table that has been sharded into many files like this much
much faster than other parsers can read in each table iteratively,
(and thus can greatly out-perform the 'stacking" methods in neonUtilities
).
When reading in very large numbers of files, it may be helpful to set
altrep = FALSE
to opt out of vroom
's fast altrep mechanism, which
can cause neon_read()
to fail when stacking thousands of files.
Unfortunately, not all datasets are entirely consistent in their use
of columns. neon_read
works around this by parsing such tables in
groups of matching schema, which is still reasonably fast.
NEON sensor data products currently do not include important metadata columns
containing DomainID, SiteID, horizontalPosition, verticalPosition, and
publicationDate in the data files themselves, but only encode this in the
in the raw file names. All though these values are shared across a raw
data file, this information is lost when stacking the tables unless explicit
columns are added to the data. This requires us to parse the files
one-by-one, which is much slower. By default this information is added to
the table, altering the stacked table schema from that of the raw table.
Disable this behavior by setting sensor_metadata = FALSE
. Future
NEON sensor data products may start including this information in
the raw data files, as is already the case for observational data.
neon_read("brd_countdata-expanded") ## Sensor inputs will add metadata columns by default neon_read("waq_instantaneous", site = c("CRAM","SUGG"))
neon_read("brd_countdata-expanded") ## Sensor inputs will add metadata columns by default neon_read("waq_instantaneous", site = c("CRAM","SUGG"))
neon_remote
select a table from the remote connection
neon_remote(table = "", product = "", type = "", db = neon_remote_db())
neon_remote(table = "", product = "", type = "", db = neon_remote_db())
table |
table name (pattern match regex) |
product |
product code |
type |
basic or extended (if necessary to distinguish) |
db |
a neon_remote_db connection. If not provided, one will be created,
but it is faster to pass this on for re-use in multiple |
a arrow::FileSystemDataset object, or a named list of such
objects if multiple matches are found. This table is not downloaded
but remains on the remote storage location, but can be filtered
with dplyr functions like filter and select, and can also be
grouped and summarised, all without ever downloading the whole table.
Use dplyr::collect()
to download the (possibly filtered) table into
and pull into memory.
arrow
Establish a remote database connection using arrow
neon_remote_db( bucket = arrow::s3_bucket("neon4cast-targets/neon", endpoint_override = "data.ecoforecast.org") )
neon_remote_db( bucket = arrow::s3_bucket("neon4cast-targets/neon", endpoint_override = "data.ecoforecast.org") )
bucket |
an |
db <- neon_remote_db()
db <- neon_remote_db()
Returns a table of all NEON sites by making a single API call
to the /sites
endpoint.
neon_sites( api = "https://data.neonscience.org/api/v0", .token = Sys.getenv("NEON_TOKEN") )
neon_sites( api = "https://data.neonscience.org/api/v0", .token = Sys.getenv("NEON_TOKEN") )
api |
the URL to the NEON API, leave as default. |
.token |
an authentication token from NEON. A token is not
required but will allow access to a higher number of requests before
rate limiting applies, see
https://data.neonscience.org/data-api/rate-limiting/#api-tokens.
Note that once files are downloaded once, |
import neon data into a local database
neon_store( table = NA, product = NA, type = NA, dir = neon_dir(), db = neon_db(neon_db_dir(), read_only = FALSE), n = 500L, quiet = FALSE, ... )
neon_store( table = NA, product = NA, type = NA, dir = neon_dir(), db = neon_db(neon_db_dir(), read_only = FALSE), n = 500L, quiet = FALSE, ... )
table |
Include only files matching this table name (or regex pattern). (optional). |
product |
A NEON |
type |
Should we prefer the basic or expanded version of this product? Note that not all products have expanded formats. |
dir |
Location where files should be downloaded. By default will
use the appropriate applications directory for your system
(see |
db |
A connection to a write-able relational database backend,
see |
n |
number of files that should be read per iteration |
quiet |
show progress? |
... |
Arguments passed on to
|
the index of files read in (invisibly)
sync local parquet export to an S3 database
neon_sync_db(s3, dir = file.path(neon_dir(), "parquet"))
neon_sync_db(s3, dir = file.path(neon_dir(), "parquet"))
s3 |
an |
dir |
directory to which parquet export is written. |
Remote files are named according to the table name (including product id, not according to the 'sanitized' file name duckdb uses when generating exports.)
Return a neon table from the database
neon_table( table, product = NA, type = NA, site = NA, db = neon_db(), lazy = FALSE )
neon_table( table, product = NA, type = NA, site = NA, db = neon_db(), lazy = FALSE )
table |
the name of a downloaded NEON table in the store, see neon_index |
product |
A NEON |
type |
filter for basic or expanded. Can be omitted unless you have imported both types a given table into your database. |
site |
4-letter site code(s) to filter on. Leave as |
db |
a connection to the database, see |
lazy |
logical, default FALSE. Should we return a remote dplyr
connection to the table in duckdb? This can substantially improve
performance and avoid out-of-memory errors when working with very large
tables. However, not all R operations can be performed on a remote table,
only (most) functions from |
We cannot filter on start_date or end_date since these come only from the filename metadata and are only added to instrument tables, not observation tables etc.
Show the file information for any raw data files which have been deprecated by the release of modified historical data to the NEON API.
show_deprecated_data( product = NA, table = NA, site = NA, start_date = NA, end_date = NA, type = NA, ext = NA, timestamp = NA, release = NA, dir = neon_dir() )
show_deprecated_data( product = NA, table = NA, site = NA, start_date = NA, end_date = NA, type = NA, ext = NA, timestamp = NA, release = NA, dir = neon_dir() )
product |
A NEON |
table |
Include only files matching this table name (or regex pattern). (optional). |
site |
4-letter site code(s) to filter on. Leave as |
start_date |
Download only files as recent as ( |
end_date |
Download only files up to end_date ( |
type |
Should we prefer the basic or expanded version of this product? Note that not all products have expanded formats. |
ext |
only match files with this file extension(s) |
timestamp |
only match timestamps prior this. See details in |
release |
Select only data files associated with a particular release tag, see https://www.neonscience.org/data-samples/data-management/data-revisions-releases, e.g. "RELEASE-2021". Releases are associated with a specific DOI and the promise that files associated with a particular release will not change. |
dir |
Location where files should be downloaded. By default will
use the appropriate applications directory for your system
(see |
NEON data files are sometimes updated to correct errors. Old files are
removed from access from the API, but may be present in your local store
from an earlier download. neonstore
stacking functions ([neon_read()]
and neon_store()
) automatically exclude these deprecated files, though
neon_read()
can be instructed to use older files by passing a file list.
A data file is identified as deprecated whenever the local file store contains a second data file with the same product, table, site, month, and position (sensor products only) information, but having an updated timestamp. If such a change occurs in a file with a non-missing "month" code, it may indicate a data file has been updated. This could result in changes to the results of any previous analyses.
Note that metadata files, (readme, variables, positions) are 'pre-stacked': the metadata file in a given product-site-month set contains metadata going back to the start and not just for that month. As a result, each new version deprecates the old metadata file, but the old files are always available from the NEON API and always present in the store. Users will only need to care about the most recent ones, and the presence of old files is no cause for concern. This function will only show data files that have changed, and not metadata files. This can help pinpoint specific altered data.
neon_index, neon_read
show_deprecated_data()
show_deprecated_data()
standardize export names
standardize_export_names(dir = file.path(neon_dir(), "parquet"))
standardize_export_names(dir = file.path(neon_dir(), "parquet"))
dir |
directory to which parquet export is written. |
DUCKDB clobbers database filenames to avoid potentially incompatible characters. This is pretty unnecessary, so we can restore the original table names for use with S3-based remote access which assumes parquet files map to the desired table names (i.e. including product numbers.)
However, note that [neon_import_db()]
uses native duckdb functions
that assume the original mangled names.