Title: | High Performance Remote File System, Database and 'Geospatial' Access Using 'duckdb' |
---|---|
Description: | Provides friendly wrappers for creating 'duckdb'-backed connections to tabular datasets ('csv', parquet, etc) on local or remote file systems. This mimics the behaviour of "open_dataset" in the 'arrow' package, but in addition to 'S3' file system also generalizes to any list of 'http' URLs. |
Authors: | Carl Boettiger [aut, cre] , Michael D. Sumner [ctb] |
Maintainer: | Carl Boettiger <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.7 |
Built: | 2024-10-28 05:51:48 UTC |
Source: | https://github.com/cboettig/duckdbfs |
Push a local (in-memory) dataset into a the duckdb database as a table. This enables it to share the connection source with other data. This is equivalent to the behavior of copy=TRUE on many (but not all) of the two-table verbs in dplyr.
as_dataset(df, conn = cached_connection())
as_dataset(df, conn = cached_connection())
df |
a local data frame. Otherwise will be passed back without side effects |
conn |
A connection to a database. |
a remote dplyr::tbl
connection to the table.
Create a View of the current query. This can be an effective way to allow a query chain to remain lazy
as_view(x, tblname = tmp_tbl_name(), conn = cached_connection())
as_view(x, tblname = tmp_tbl_name(), conn = cached_connection())
x |
a duckdb spatial dataset |
tblname |
The name of the table to create in the database. |
conn |
A connection to a database. |
path <- system.file("extdata/spatial-test.csv", package="duckdbfs") df <- open_dataset(path) library(dplyr) df |> filter(latitude > 5) |> as_view()
path <- system.file("extdata/spatial-test.csv", package="duckdbfs") df <- open_dataset(path) library(dplyr) df |> filter(latitude > 5) |> as_view()
This function is primarily intended for internal use by other
duckdbfs
functions. However, it can be called directly by
the user whenever it is desirable to have direct access to the
connection object.
cached_connection( dbdir = ":memory:", read_only = FALSE, bigint = "numeric", config = list(temp_directory = tempfile()) )
cached_connection( dbdir = ":memory:", read_only = FALSE, bigint = "numeric", config = list(temp_directory = tempfile()) )
dbdir |
Location for database files. Should be a path to an existing
directory in the file system. With the default (or |
read_only |
Set to |
bigint |
How 64-bit integers should be returned. There are two options: |
config |
Named list with DuckDB configuration flags, see https://duckdb.org/docs/configuration/overview#configuration-reference for the possible options. These flags are only applied when the database object is instantiated. Subsequent connections will silently ignore these flags. |
When first called (by a user or internal function),
this function both creates a duckdb connection and places
that connection into a cache (duckdbfs_conn
option).
On subsequent calls, this function returns the cached connection,
rather than recreating a fresh connection.
This frees the user from the responsibility of managing a connection object, because functions needing access to the connection can use this to create or access the existing connection. At the close of the global environment, this function's finalizer should gracefully shutdown the connection before removing the cache.
By default, this function creates an in-memory connection. When reading from on-disk or remote files (parquet or csv), this option can still effectively support most operations on much-larger-than-RAM data. However, some operations require additional working space, so by default we set a temporary storage location in configuration as well.
a duckdb::duckdb()
connection object
con <- cached_connection() close_connection(con)
con <- cached_connection() close_connection(con)
close connection
close_connection(conn = cached_connection())
close_connection(conn = cached_connection())
conn |
a duckdb connection (leave blank) Closes the invisible cached connection to duckdb |
Shuts down connection before gc removes it. Then clear cached reference to avoid using a stale connection This avoids complaint about connection being garbage collected.
returns nothing.
close_connection()
close_connection()
This function is used to configure S3 settings for a database connection. It allows you to set various S3-related parameters such as access key, secret access key, endpoint, region, session token, uploader settings, URL compatibility mode, URL style, and SSL usage.
duckdb_s3_config( conn = cached_connection(), s3_access_key_id = NULL, s3_secret_access_key = NULL, s3_endpoint = NULL, s3_region = NULL, s3_session_token = NULL, s3_uploader_max_filesize = NULL, s3_uploader_max_parts_per_file = NULL, s3_uploader_thread_limit = NULL, s3_url_compatibility_mode = NULL, s3_url_style = NULL, s3_use_ssl = NULL, anonymous = NULL )
duckdb_s3_config( conn = cached_connection(), s3_access_key_id = NULL, s3_secret_access_key = NULL, s3_endpoint = NULL, s3_region = NULL, s3_session_token = NULL, s3_uploader_max_filesize = NULL, s3_uploader_max_parts_per_file = NULL, s3_uploader_thread_limit = NULL, s3_url_compatibility_mode = NULL, s3_url_style = NULL, s3_use_ssl = NULL, anonymous = NULL )
conn |
A database connection object created using the
|
s3_access_key_id |
The S3 access key ID (default: |
s3_secret_access_key |
The S3 secret access key (default: |
s3_endpoint |
The S3 endpoint (default: |
s3_region |
The S3 region (default: |
s3_session_token |
The S3 session token (default: |
s3_uploader_max_filesize |
The maximum filesize for S3 uploader (between 50GB and 5TB, default 800GB). |
s3_uploader_max_parts_per_file |
The maximum number of parts per file for S3 uploader (between 1 and 10000, default 10000). |
s3_uploader_thread_limit |
The thread limit for S3 uploader (default: 50). |
s3_url_compatibility_mode |
Disable Globs and Query Parameters on S3 URLs (default: 0, allows globs/queries). |
s3_url_style |
The style of S3 URLs to use. Default is "vhost" unless s3_endpoint is set, which makes default "path" (i.e. MINIO systems). |
s3_use_ssl |
Enable or disable SSL for S3 connections (default: 1 (TRUE)). |
anonymous |
request anonymous access (sets |
see https://duckdb.org/docs/sql/configuration.html
Returns silently (NULL) if successful.
# Configure S3 settings duckdb_s3_config( s3_access_key_id = "YOUR_ACCESS_KEY_ID", s3_secret_access_key = "YOUR_SECRET_ACCESS_KEY", s3_endpoint = "YOUR_S3_ENDPOINT", s3_region = "YOUR_S3_REGION", s3_uploader_max_filesize = "800GB", s3_uploader_max_parts_per_file = 100, s3_uploader_thread_limit = 8, s3_url_compatibility_mode = FALSE, s3_url_style = "vhost", s3_use_ssl = TRUE, anonymous = TRUE)
# Configure S3 settings duckdb_s3_config( s3_access_key_id = "YOUR_ACCESS_KEY_ID", s3_secret_access_key = "YOUR_SECRET_ACCESS_KEY", s3_endpoint = "YOUR_S3_ENDPOINT", s3_region = "YOUR_S3_REGION", s3_uploader_max_filesize = "800GB", s3_uploader_max_parts_per_file = 100, s3_uploader_thread_limit = 8, s3_url_compatibility_mode = FALSE, s3_url_style = "vhost", s3_use_ssl = TRUE, anonymous = TRUE)
load the duckdb geospatial data plugin
load_spatial( conn = cached_connection(), nightly = getOption("duckdbfs_use_nightly", FALSE) )
load_spatial( conn = cached_connection(), nightly = getOption("duckdbfs_use_nightly", FALSE) )
conn |
A database connection object created using the
|
nightly |
should we use the nightly version or not?
default FALSE, configurable as |
loads the extension and returns status invisibly.
https://duckdb.org/docs/extensions/spatial.html
This function opens a dataset from a variety of sources, including Parquet, CSV, etc, using either local file system paths, URLs, or S3 bucket URI notation.
open_dataset( sources, schema = NULL, hive_style = TRUE, unify_schemas = FALSE, format = c("parquet", "csv", "tsv", "sf"), conn = cached_connection(), tblname = tmp_tbl_name(), mode = "VIEW", filename = FALSE, recursive = TRUE, ... )
open_dataset( sources, schema = NULL, hive_style = TRUE, unify_schemas = FALSE, format = c("parquet", "csv", "tsv", "sf"), conn = cached_connection(), tblname = tmp_tbl_name(), mode = "VIEW", filename = FALSE, recursive = TRUE, ... )
sources |
A character vector of paths to the dataset files. |
schema |
The schema for the dataset. If NULL, the schema will be inferred from the dataset files. |
hive_style |
A logical value indicating whether to the dataset uses Hive-style partitioning. |
unify_schemas |
A logical value indicating whether to unify the schemas of the dataset files (union_by_name). If TRUE, will execute a UNION by column name across all files (NOTE: this can add considerably to the initial execution time) |
format |
The format of the dataset files. One of |
conn |
A connection to a database. |
tblname |
The name of the table to create in the database. |
mode |
The mode to create the table in. One of |
filename |
A logical value indicating whether to include the filename in the table name. |
recursive |
should we assume recursive path? default TRUE. Set to FALSE if trying to open a single, un-partitioned file. |
... |
optional additional arguments passed to |
A lazy dplyr::tbl
object representing the opened dataset backed
by a duckdb SQL connection. Most dplyr
(and some tidyr
) verbs can be
used directly on this object, as they can be translated into SQL commands
automatically via dbplyr
. Generic R commands require using
dplyr::collect()
on the table, which forces evaluation and reading the
resulting data into memory.
# A remote, hive-partitioned Parquet dataset base <- paste0("https://github.com/duckdb/duckdb/raw/main/", "data/parquet-testing/hive-partitioning/union_by_name/") f1 <- paste0(base, "x=1/f1.parquet") f2 <- paste0(base, "x=1/f2.parquet") f3 <- paste0(base, "x=2/f2.parquet") open_dataset(c(f1,f2,f3), unify_schemas = TRUE) # Access an S3 database specifying an independently-hosted (MINIO) endpoint efi <- open_dataset("s3://neon4cast-scores/parquet/aquatics", s3_access_key_id="", s3_endpoint="data.ecoforecast.org")
# A remote, hive-partitioned Parquet dataset base <- paste0("https://github.com/duckdb/duckdb/raw/main/", "data/parquet-testing/hive-partitioning/union_by_name/") f1 <- paste0(base, "x=1/f1.parquet") f2 <- paste0(base, "x=1/f2.parquet") f3 <- paste0(base, "x=2/f2.parquet") open_dataset(c(f1,f2,f3), unify_schemas = TRUE) # Access an S3 database specifying an independently-hosted (MINIO) endpoint efi <- open_dataset("s3://neon4cast-scores/parquet/aquatics", s3_access_key_id="", s3_endpoint="data.ecoforecast.org")
spatial_join
spatial_join( x, y, by = c("st_intersects", "st_within", "st_dwithin", "st_touches", "st_contains", "st_containsproperly", "st_covers", "st_overlaps", "st_crosses", "st_equals", "st_disjoint"), args = "", join = "left", tblname = tmp_tbl_name(), conn = cached_connection() )
spatial_join( x, y, by = c("st_intersects", "st_within", "st_dwithin", "st_touches", "st_contains", "st_containsproperly", "st_covers", "st_overlaps", "st_crosses", "st_equals", "st_disjoint"), args = "", join = "left", tblname = tmp_tbl_name(), conn = cached_connection() )
x |
a duckdb table with a spatial geometry column called "geom" |
y |
a duckdb table with a spatial geometry column called "geom" |
by |
A spatial join function, see details. |
args |
additional arguments to join function (e.g. distance for st_dwithin) |
join |
JOIN type (left, right, inner, full) |
tblname |
name for the temporary view |
conn |
the duckdb connection (imputed by duckdbfs by default, must be shared across both tables) |
Possible spatial joins include:
Function | Description |
st_intersects | Geometry A intersects with geometry B |
st_disjoint | The complement of intersects |
st_within | Geometry A is within geometry B (complement of contains) |
st_dwithin | Geometries are within a specified distance, expressed in the same units as the coordinate reference system. |
st_touches | Two polygons touch if the that have at least one point in common, even if their interiors do not touch. |
st_contains | Geometry A entirely contains to geometry B. (complement of within) |
st_containsproperly | stricter version of st_contains (boundary counts as external) |
st_covers | geometry B is inside or on boundary of A. (A polygon covers a point on its boundary but does not contain it.) |
st_overlaps | geometry A intersects but does not completely contain geometry B |
st_equals | geometry A is equal to geometry B |
st_crosses | Lines or points in geometry A cross geometry B. |
All though SQL is not case sensitive, this function expects only lower case names for "by" functions.
a (lazy) view of the resulting table. Users can continue to operate on using dplyr operations and call to_st() to collect this as an sf object.
# note we can read in remote data in a variety of vector formats: countries <- paste0("/vsicurl/", "https://github.com/cboettig/duckdbfs/", "raw/spatial-read/inst/extdata/world.gpkg") |> open_dataset(format = "sf") cities <- paste0("/vsicurl/https://github.com/cboettig/duckdbfs/raw/", "spatial-read/inst/extdata/metro.fgb") |> open_dataset(format = "sf") countries |> dplyr::filter(iso_a3 == "AUS") |> spatial_join(cities)
# note we can read in remote data in a variety of vector formats: countries <- paste0("/vsicurl/", "https://github.com/cboettig/duckdbfs/", "raw/spatial-read/inst/extdata/world.gpkg") |> open_dataset(format = "sf") cities <- paste0("/vsicurl/https://github.com/cboettig/duckdbfs/raw/", "spatial-read/inst/extdata/metro.fgb") |> open_dataset(format = "sf") countries |> dplyr::filter(iso_a3 == "AUS") |> spatial_join(cities)
At this time, reads a subset of spatial metadata.
This is similar to what is reported by ogrinfo -json
st_read_meta( path, layer = 1L, tblname = tbl_name(path), conn = cached_connection(), ... )
st_read_meta( path, layer = 1L, tblname = tbl_name(path), conn = cached_connection(), ... )
path |
URL or path to spatial data file |
layer |
layer number to read metadata for, defaults to first layer. |
tblname |
metadata will be stored as a view with this name, by default this is based on the name of the file. |
conn |
A connection to a database. |
... |
optional additional arguments passed to |
A lazy dplyr::tbl
object containing core spatial metadata such
as projection information.
st_read_meta("https://github.com/duckdb/duckdb_spatial/raw/main/test/data/amsterdam_roads.fgb")
st_read_meta("https://github.com/duckdb/duckdb_spatial/raw/main/test/data/amsterdam_roads.fgb")
Convert output to sf object
to_sf(x, crs = NA, conn = cached_connection())
to_sf(x, crs = NA, conn = cached_connection())
x |
a remote duckdb |
crs |
The coordinate reference system, any format understood by |
conn |
the connection object from the tbl.
Takes a duckdb table (from Note: |
an sf
class object (in memory).
library(dplyr) csv_file <- system.file("extdata/spatial-test.csv", package="duckdbfs") # Note that we almost always must first create a `geometry` column, e.g. # from lat/long columns using the `st_point` method. sf <- open_dataset(csv_file, format = "csv") |> mutate(geom = ST_Point(longitude, latitude)) |> to_sf() # We can use the full space of spatial operations, including spatial # and normal dplyr filters. All operations are translated into a # spatial SQL query by `to_sf`: open_dataset(csv_file, format = "csv") |> mutate(geom = ST_Point(longitude, latitude)) |> mutate(dist = ST_Distance(geom, ST_Point(0,0))) |> filter(site %in% c("a", "b", "e")) |> to_sf()
library(dplyr) csv_file <- system.file("extdata/spatial-test.csv", package="duckdbfs") # Note that we almost always must first create a `geometry` column, e.g. # from lat/long columns using the `st_point` method. sf <- open_dataset(csv_file, format = "csv") |> mutate(geom = ST_Point(longitude, latitude)) |> to_sf() # We can use the full space of spatial operations, including spatial # and normal dplyr filters. All operations are translated into a # spatial SQL query by `to_sf`: open_dataset(csv_file, format = "csv") |> mutate(geom = ST_Point(longitude, latitude)) |> mutate(dist = ST_Distance(geom, ST_Point(0,0))) |> filter(site %in% c("a", "b", "e")) |> to_sf()
write_dataset
write_dataset( dataset, path, conn = cached_connection(), format = c("parquet", "csv"), partitioning = dplyr::group_vars(dataset), overwrite = TRUE, ... )
write_dataset( dataset, path, conn = cached_connection(), format = c("parquet", "csv"), partitioning = dplyr::group_vars(dataset), overwrite = TRUE, ... )
dataset |
a remote tbl object from |
path |
a local file path or S3 path with write credentials |
conn |
duckdbfs database connection |
format |
export format |
partitioning |
names of columns to use as partition variables |
overwrite |
allow overwriting of existing files? |
... |
additional arguments to |
Returns the path, invisibly.
write_dataset(mtcars, tempfile()) write_dataset(mtcars, tempdir())
write_dataset(mtcars, tempfile()) write_dataset(mtcars, tempdir())