| Title: | High Performance Remote File System, Database and 'Geospatial' Access Using 'duckdb' |
|---|---|
| Description: | Provides friendly wrappers for creating 'duckdb'-backed connections to tabular datasets ('csv', parquet, etc) on local or remote file systems. This mimics the behaviour of "open_dataset" in the 'arrow' package, but in addition to 'S3' file system also generalizes to any list of 'http' URLs. |
| Authors: | Carl Boettiger [aut, cre] (ORCID: <https://orcid.org/0000-0002-1642-628X>), Michael D. Sumner [ctb] (ORCID: <https://orcid.org/0000-0002-2471-7511>) |
| Maintainer: | Carl Boettiger <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.2.99 |
| Built: | 2026-05-21 10:08:00 UTC |
| Source: | https://github.com/cboettig/duckdbfs |
Push a local (in-memory) dataset into a the duckdb database as a table. This enables it to share the connection source with other data. This is equivalent to the behavior of copy=TRUE on many (but not all) of the two-table verbs in dplyr.
as_dataset(df, conn = cached_connection())as_dataset(df, conn = cached_connection())
df |
a local data frame. Otherwise will be passed back without side effects |
conn |
A connection to a database. |
a remote dplyr::tbl connection to the table.
Create a View of the current query. This can be an effective way to allow a query chain to remain lazy
as_view(x, tblname = tmp_tbl_name(), conn = cached_connection())as_view(x, tblname = tmp_tbl_name(), conn = cached_connection())
x |
a duckdb spatial dataset |
tblname |
The name of the table to create in the database. |
conn |
A connection to a database. |
path <- system.file("extdata/spatial-test.csv", package="duckdbfs") df <- open_dataset(path) library(dplyr) df |> filter(latitude > 5) |> as_view()path <- system.file("extdata/spatial-test.csv", package="duckdbfs") df <- open_dataset(path) library(dplyr) df |> filter(latitude > 5) |> as_view()
This function is primarily intended for internal use by other
duckdbfs functions. However, it can be called directly by
the user whenever it is desirable to have direct access to the
connection object.
cached_connection( dbdir = ":memory:", read_only = FALSE, bigint = "numeric", config = list(temp_directory = tempfile()), autoload_exts = getOption("duckdbfs_autoload_extensions", TRUE), with_spatial = not_windows() && getOption("duckdbfs_autoload_extensions", TRUE), with_h3 = not_windows() && getOption("duckdbfs_autoload_extensions", TRUE) )cached_connection( dbdir = ":memory:", read_only = FALSE, bigint = "numeric", config = list(temp_directory = tempfile()), autoload_exts = getOption("duckdbfs_autoload_extensions", TRUE), with_spatial = not_windows() && getOption("duckdbfs_autoload_extensions", TRUE), with_h3 = not_windows() && getOption("duckdbfs_autoload_extensions", TRUE) )
dbdir |
Location for database files. Should be a path to an existing
directory in the file system. With the default (or |
read_only |
Set to |
bigint |
How 64-bit integers should be returned. There are two options: |
config |
Named list with DuckDB configuration flags, see https://duckdb.org/docs/configuration/overview#configuration-reference for the possible options. These flags are only applied when the database object is instantiated. Subsequent connections will silently ignore these flags. |
autoload_exts |
should we auto-load extensions? TRUE by default,
can be configured with |
with_spatial |
install (if missing) and load spatial extension, default TRUE
Opt out by closing any active cached connection first (with
|
with_h3 |
install (if missing) and load the h3 spatial index extension. Default TRUE |
When first called (by a user or internal function),
this function both creates a duckdb connection and places
that connection into a cache (duckdbfs_conn option).
On subsequent calls, this function returns the cached connection,
rather than recreating a fresh connection.
This frees the user from the responsibility of managing a connection object, because functions needing access to the connection can use this to create or access the existing connection. At the close of the global environment, this function's finalizer should gracefully shutdown the connection before removing the cache.
By default, this function creates an in-memory connection. When reading from on-disk or remote files (parquet or csv), this option can still effectively support most operations on much-larger-than-RAM data. However, some operations require additional working space, so by default we set a temporary storage location in configuration as well.
a duckdb::duckdb() connection object
con <- cached_connection() close_connection(con)con <- cached_connection() close_connection(con)
close connection
close_connection(conn = cached_connection())close_connection(conn = cached_connection())
conn |
a duckdb connection (leave blank) Closes the invisible cached connection to duckdb |
Shuts down connection before gc removes it. Then clear cached reference to avoid using a stale connection This avoids complaint about connection being garbage collected.
returns nothing.
close_connection()close_connection()
duckdb configuration
duckdb_config(..., conn = cached_connection())duckdb_config(..., conn = cached_connection())
... |
named argument of the parameters to set, see examples see all possible configuration options at https://duckdb.org/docs/sql/configuration.html |
conn |
A connection to a database. |
Accepts any duckdb setting as a named argument. Commonly useful:
threads, memory_limit, temp_directory, max_temp_directory_size —
resource tuning for large local jobs (see second example).
http_retries, http_retry_wait_ms — back off and retry when a remote
server returns HTTP 429 (rate-limited) or other transient errors while
streaming parquet / CSV over HTTP(S).
Note: in I/O bound tasks such as streaming data, it can be helpful to set thread parallelism significantly higher than available CPU cores.
the active duckdb connection, invisibly
duckdb_reset, duckdb_get_config
duckdb_config(threads = 1, memory_limit = '10GB') duckdb_get_config("threads") duckdb_reset("threads") # Common settings for larger local jobs duckdb_config( threads = 64, memory_limit = "30GB", temp_directory = "/tmp/duckdb_swap", max_temp_directory_size = "100GB" ) # Retry HTTP requests (e.g. a server returning HTTP 429 on parquet reads) duckdb_config(http_retries = 5, http_retry_wait_ms = 2000)duckdb_config(threads = 1, memory_limit = '10GB') duckdb_get_config("threads") duckdb_reset("threads") # Common settings for larger local jobs duckdb_config( threads = 64, memory_limit = "30GB", temp_directory = "/tmp/duckdb_swap", max_temp_directory_size = "100GB" ) # Retry HTTP requests (e.g. a server returning HTTP 429 on parquet reads) duckdb_config(http_retries = 5, http_retry_wait_ms = 2000)
show duckdb extensions
duckdb_extensions(conn = cached_connection())duckdb_extensions(conn = cached_connection())
conn |
A connection to a database. |
a data frame listing all available extensions, with boolean columns indicating which extensions are installed or loaded, and a description of each extension.
duckdb_extensions()duckdb_extensions()
duckdb reset configuration to default
duckdb_get_config(x = NULL, conn = cached_connection())duckdb_get_config(x = NULL, conn = cached_connection())
x |
parameter name. Omit to see a table of all settings. |
conn |
A connection to a database. |
duckdb_config, duckdb_get_config
# Full config table duckdb_get_config() # look up single config value duckdb_get_config("threads") # set a different value, test, reset. duckdb_config(threads = 10) duckdb_get_config("threads") duckdb_reset("threads")# Full config table duckdb_get_config() # look up single config value duckdb_get_config("threads") # set a different value, test, reset. duckdb_config(threads = 10) duckdb_get_config("threads") duckdb_reset("threads")
duckdb reset configuration to default
duckdb_reset(x, conn = cached_connection())duckdb_reset(x, conn = cached_connection())
x |
parameter name |
conn |
A connection to a database. |
duckdb_config, duckdb_get_config
duckdb_config(threads = 10) duckdb_get_config("threads") duckdb_reset("threads")duckdb_config(threads = 10) duckdb_get_config("threads") duckdb_reset("threads")
This function is used to configure S3 settings for a database connection. It allows you to set various S3-related parameters such as access key, secret access key, endpoint, region, session token, uploader settings, URL compatibility mode, URL style, and SSL usage.
duckdb_s3_config( conn = cached_connection(), s3_access_key_id = NULL, s3_secret_access_key = NULL, s3_endpoint = NULL, s3_region = NULL, s3_session_token = NULL, s3_uploader_max_filesize = NULL, s3_uploader_max_parts_per_file = NULL, s3_uploader_thread_limit = NULL, s3_url_compatibility_mode = NULL, s3_url_style = NULL, s3_use_ssl = NULL, anonymous = NULL )duckdb_s3_config( conn = cached_connection(), s3_access_key_id = NULL, s3_secret_access_key = NULL, s3_endpoint = NULL, s3_region = NULL, s3_session_token = NULL, s3_uploader_max_filesize = NULL, s3_uploader_max_parts_per_file = NULL, s3_uploader_thread_limit = NULL, s3_url_compatibility_mode = NULL, s3_url_style = NULL, s3_use_ssl = NULL, anonymous = NULL )
conn |
A database connection object created using the
|
s3_access_key_id |
The S3 access key ID (default: |
s3_secret_access_key |
The S3 secret access key (default: |
s3_endpoint |
The S3 endpoint (default: |
s3_region |
The S3 region (default: |
s3_session_token |
The S3 session token (default: |
s3_uploader_max_filesize |
The maximum filesize for S3 uploader (between 50GB and 5TB, default 800GB). |
s3_uploader_max_parts_per_file |
The maximum number of parts per file for S3 uploader (between 1 and 10000, default 10000). |
s3_uploader_thread_limit |
The thread limit for S3 uploader (default: 50). |
s3_url_compatibility_mode |
Disable Globs and Query Parameters on S3 URLs (default: 0, allows globs/queries). |
s3_url_style |
The style of S3 URLs to use. Default is "vhost" unless s3_endpoint is set, which makes default "path" (i.e. MINIO systems). |
s3_use_ssl |
Enable or disable SSL for S3 connections (default: 1 (TRUE)). |
anonymous |
request anonymous access (sets |
see https://duckdb.org/docs/sql/configuration.html
Returns silently (NULL) if successful.
# Configure S3 settings duckdb_s3_config( s3_access_key_id = "YOUR_ACCESS_KEY_ID", s3_secret_access_key = "YOUR_SECRET_ACCESS_KEY", s3_endpoint = "YOUR_S3_ENDPOINT", s3_region = "YOUR_S3_REGION", s3_uploader_max_filesize = "800GB", s3_uploader_max_parts_per_file = 100, s3_uploader_thread_limit = 8, s3_url_compatibility_mode = FALSE, s3_url_style = "vhost", s3_use_ssl = TRUE, anonymous = TRUE)# Configure S3 settings duckdb_s3_config( s3_access_key_id = "YOUR_ACCESS_KEY_ID", s3_secret_access_key = "YOUR_SECRET_ACCESS_KEY", s3_endpoint = "YOUR_S3_ENDPOINT", s3_region = "YOUR_S3_REGION", s3_uploader_max_filesize = "800GB", s3_uploader_max_parts_per_file = 100, s3_uploader_thread_limit = 8, s3_url_compatibility_mode = FALSE, s3_url_style = "vhost", s3_use_ssl = TRUE, anonymous = TRUE)
Configure the duckdb secrets for remote access.
duckdb_secrets( key = Sys.getenv("AWS_ACCESS_KEY_ID", ""), secret = Sys.getenv("AWS_SECRET_ACCESS_KEY", ""), endpoint = Sys.getenv("AWS_S3_ENDPOINT", "s3.amazonaws.com"), region = Sys.getenv("AWS_REGION", "us-east-1"), bucket = NULL, url_style = NULL, use_ssl = Sys.getenv("AWS_HTTPS", "TRUE"), url_compatibility_mode = TRUE, session_token = Sys.getenv("AWS_SESSION_TOKEN", ""), type = "S3", conn = cached_connection() )duckdb_secrets( key = Sys.getenv("AWS_ACCESS_KEY_ID", ""), secret = Sys.getenv("AWS_SECRET_ACCESS_KEY", ""), endpoint = Sys.getenv("AWS_S3_ENDPOINT", "s3.amazonaws.com"), region = Sys.getenv("AWS_REGION", "us-east-1"), bucket = NULL, url_style = NULL, use_ssl = Sys.getenv("AWS_HTTPS", "TRUE"), url_compatibility_mode = TRUE, session_token = Sys.getenv("AWS_SESSION_TOKEN", ""), type = "S3", conn = cached_connection() )
key |
key |
secret |
secret |
endpoint |
endpoint address |
region |
AWS region (ignored by some other S3 providers) |
bucket |
restricts the "SCOPE" of this key to only objects in this bucket-name. note that the bucket name is currently insensitive to endpoint |
url_style |
path or vhost, for S3 |
use_ssl |
Use SSL address (https instead of http), default TRUE |
url_compatibility_mode |
optional mode for increased compatibility with some endpoints |
session_token |
AWS session token, used in some AWS authentication with short-lived tokens |
type |
Key type, e.g. S3. See duckdb docs for details. references https://duckdb.org/docs/configuration/secrets_manager.html |
conn |
A connection to a database. |
load the duckdb geospatial data plugin
load_h3( conn = cached_connection(), repo = "http://community-extensions.duckdb.org" )load_h3( conn = cached_connection(), repo = "http://community-extensions.duckdb.org" )
conn |
A database connection object created using the
|
repo |
repository path for community extensions |
loads the extension and returns status invisibly.
https://github.com/isaacbrodsky/h3-duckdb
library(dplyr) load_h3() ex <- system.file("extdata/spatial-test.csv", package="duckdbfs") zoom <- 9L # Zoom must be explicit integer, L query <- ex |> open_dataset(format = "csv") |> mutate(h3id = h3_latlng_to_cell_string(latitude, longitude, zoom)) # as data.frame collect(query) # write to a file path <- tempfile(fileext = ".h3j") query |> to_h3j(path)library(dplyr) load_h3() ex <- system.file("extdata/spatial-test.csv", package="duckdbfs") zoom <- 9L # Zoom must be explicit integer, L query <- ex |> open_dataset(format = "csv") |> mutate(h3id = h3_latlng_to_cell_string(latitude, longitude, zoom)) # as data.frame collect(query) # write to a file path <- tempfile(fileext = ".h3j") query |> to_h3j(path)
load the duckdb geospatial data plugin
load_spatial( conn = cached_connection(), nightly = getOption("duckdbfs_use_nightly", FALSE), force = FALSE )load_spatial( conn = cached_connection(), nightly = getOption("duckdbfs_use_nightly", FALSE), force = FALSE )
conn |
A database connection object created using the
|
nightly |
should we use the nightly version or not?
default FALSE, configurable as |
force |
force re-install? |
loads the extension and returns status invisibly.
https://duckdb.org/docs/extensions/spatial.html
This function opens a dataset from a variety of sources, including Parquet, CSV, etc, using either local file system paths, URLs, or S3 bucket URI notation.
open_dataset( sources, schema = NULL, hive_style = TRUE, unify_schemas = FALSE, format = c("parquet", "csv", "tsv", "sf"), conn = cached_connection(), tblname = tmp_tbl_name(), mode = "VIEW", filename = FALSE, recursive = TRUE, parser_options = list(), ... )open_dataset( sources, schema = NULL, hive_style = TRUE, unify_schemas = FALSE, format = c("parquet", "csv", "tsv", "sf"), conn = cached_connection(), tblname = tmp_tbl_name(), mode = "VIEW", filename = FALSE, recursive = TRUE, parser_options = list(), ... )
sources |
A character vector of paths to the dataset files. |
schema |
The schema for the dataset. If NULL, the schema will be inferred from the dataset files. |
hive_style |
A logical value indicating whether to the dataset uses Hive-style partitioning. |
unify_schemas |
A logical value indicating whether to unify the schemas of the dataset files (union_by_name). If TRUE, will execute a UNION by column name across all files (NOTE: this can add considerably to the initial execution time) |
format |
The format of the dataset files. One of |
conn |
A connection to a database. |
tblname |
The name of the table to create in the database. |
mode |
The mode to create the table in. One of |
filename |
A logical value indicating whether to include the filename in the table name. |
recursive |
should we assume recursive path? default TRUE. Set to FALSE if trying to open a single, un-partitioned file. |
parser_options |
additional options passed to the parser, e.g. to read_csv(), see https://duckdb.org/docs/stable/data/csv/overview.html#parameters |
... |
optional additional arguments passed to |
A lazy dplyr::tbl object representing the opened dataset backed
by a duckdb SQL connection. Most dplyr (and some tidyr) verbs can be
used directly on this object, as they can be translated into SQL commands
automatically via dbplyr. Generic R commands require using
dplyr::collect() on the table, which forces evaluation and reading the
resulting data into memory.
# A remote, hive-partitioned Parquet dataset base <- paste0("https://github.com/duckdb/duckdb/raw/main/", "data/parquet-testing/hive-partitioning/union_by_name/") f1 <- paste0(base, "x=1/f1.parquet") f2 <- paste0(base, "x=1/f2.parquet") f3 <- paste0(base, "x=2/f2.parquet") open_dataset(c(f1,f2,f3), unify_schemas = TRUE) # Access an S3 database specifying an independently-hosted (MINIO) endpoint efi <- open_dataset("s3://neon4cast-scores/parquet/aquatics", s3_access_key_id="", s3_endpoint="data.ecoforecast.org") # Use parser-options for non-standard csv: cars <- tempfile() # dummy data write.table(mtcars, cars, row.names = FALSE) # Note nested quotes on parser option for delimiter: df <- open_dataset(cars, format = "csv", parser_options = c(delim = "' '", header = TRUE))# A remote, hive-partitioned Parquet dataset base <- paste0("https://github.com/duckdb/duckdb/raw/main/", "data/parquet-testing/hive-partitioning/union_by_name/") f1 <- paste0(base, "x=1/f1.parquet") f2 <- paste0(base, "x=1/f2.parquet") f3 <- paste0(base, "x=2/f2.parquet") open_dataset(c(f1,f2,f3), unify_schemas = TRUE) # Access an S3 database specifying an independently-hosted (MINIO) endpoint efi <- open_dataset("s3://neon4cast-scores/parquet/aquatics", s3_access_key_id="", s3_endpoint="data.ecoforecast.org") # Use parser-options for non-standard csv: cars <- tempfile() # dummy data write.table(mtcars, cars, row.names = FALSE) # Note nested quotes on parser option for delimiter: df <- open_dataset(cars, format = "csv", parser_options = c(delim = "' '", header = TRUE))
An escape hatch for SQL that can't easily be expressed through the
dplyr/dbplyr translation layer, analogous to ibis's con.sql().
Useful for DuckDB-specific syntax such as UNION ALL BY NAME, which
aligns columns by name across queries with differing schemas and fills
missing columns with NULL.
raw_sql(query, tblname = NULL, conn = cached_connection())raw_sql(query, tblname = NULL, conn = cached_connection())
query |
A character string of SQL to execute against |
tblname |
Optional name for a temporary view. If |
conn |
A duckdb connection, by default |
A lazy dplyr::tbl object that can be further refined with
dplyr verbs or collected into memory with dplyr::collect().
# Combine two datasets with differing schemas using UNION ALL BY NAME base <- paste0("https://github.com/duckdb/duckdb/raw/main/", "data/parquet-testing/hive-partitioning/union_by_name/") a <- open_dataset(paste0(base, "x=1/f1.parquet"), tblname = "a") b <- open_dataset(paste0(base, "x=2/f2.parquet"), tblname = "b") raw_sql("SELECT * FROM a UNION ALL BY NAME SELECT * FROM b")# Combine two datasets with differing schemas using UNION ALL BY NAME base <- paste0("https://github.com/duckdb/duckdb/raw/main/", "data/parquet-testing/hive-partitioning/union_by_name/") a <- open_dataset(paste0(base, "x=1/f1.parquet"), tblname = "a") b <- open_dataset(paste0(base, "x=2/f2.parquet"), tblname = "b") raw_sql("SELECT * FROM a UNION ALL BY NAME SELECT * FROM b")
spatial_join
spatial_join( x, y, by = c("st_intersects", "st_within", "st_dwithin", "st_touches", "st_contains", "st_containsproperly", "st_covers", "st_overlaps", "st_crosses", "st_equals", "st_disjoint"), args = "", join = "left", tblname = tmp_tbl_name(), conn = cached_connection() )spatial_join( x, y, by = c("st_intersects", "st_within", "st_dwithin", "st_touches", "st_contains", "st_containsproperly", "st_covers", "st_overlaps", "st_crosses", "st_equals", "st_disjoint"), args = "", join = "left", tblname = tmp_tbl_name(), conn = cached_connection() )
x |
a duckdb table with a spatial geometry column called "geom" |
y |
a duckdb table with a spatial geometry column called "geom" |
by |
A spatial join function, see details. |
args |
additional arguments to join function (e.g. distance for st_dwithin) |
join |
JOIN type (left, right, inner, full) |
tblname |
name for the temporary view |
conn |
the duckdb connection (imputed by duckdbfs by default, must be shared across both tables) |
Possible spatial joins include:
| Function | Description |
| st_intersects | Geometry A intersects with geometry B |
| st_disjoint | The complement of intersects |
| st_within | Geometry A is within geometry B (complement of contains) |
| st_dwithin | Geometries are within a specified distance, expressed in the same units as the coordinate reference system. |
| st_touches | Two polygons touch if the that have at least one point in common, even if their interiors do not touch. |
| st_contains | Geometry A entirely contains to geometry B. (complement of within) |
| st_containsproperly | stricter version of st_contains (boundary counts as external) |
| st_covers | geometry B is inside or on boundary of A. (A polygon covers a point on its boundary but does not contain it.) |
| st_overlaps | geometry A intersects but does not completely contain geometry B |
| st_equals | geometry A is equal to geometry B |
| st_crosses | Lines or points in geometry A cross geometry B. |
All though SQL is not case sensitive, this function expects only lower case names for "by" functions.
Geometry columns are cast to plain GEOMETRY before the join, dropping
any CRS type annotation. Newer versions of the DuckDB spatial extension
refuse to call ST_* functions when the two inputs have differing CRS
tags, even for semantically equivalent labels such as EPSG:4326 and
OGC:CRS84. The cast compares coordinates directly, so ensuring that
both datasets are in a compatible spatial reference system remains the
caller's responsibility.
a (lazy) view of the resulting table. Users can continue to operate on using dplyr operations and call to_st() to collect this as an sf object.
# note we can read in remote data in a variety of vector formats: countries <- paste0("/vsicurl/", "https://github.com/cboettig/duckdbfs/", "raw/spatial-read/inst/extdata/world.gpkg") |> open_dataset(format = "sf") cities <- paste0("/vsicurl/https://github.com/cboettig/duckdbfs/raw/", "spatial-read/inst/extdata/metro.fgb") |> open_dataset(format = "sf") countries |> dplyr::filter(iso_a3 == "AUS") |> spatial_join(cities)# note we can read in remote data in a variety of vector formats: countries <- paste0("/vsicurl/", "https://github.com/cboettig/duckdbfs/", "raw/spatial-read/inst/extdata/world.gpkg") |> open_dataset(format = "sf") cities <- paste0("/vsicurl/https://github.com/cboettig/duckdbfs/raw/", "spatial-read/inst/extdata/metro.fgb") |> open_dataset(format = "sf") countries |> dplyr::filter(iso_a3 == "AUS") |> spatial_join(cities)
At this time, reads a subset of spatial metadata.
This is similar to what is reported by ogrinfo -json
st_read_meta( path, layer = 1L, tblname = tbl_name(path), conn = cached_connection(), ... )st_read_meta( path, layer = 1L, tblname = tbl_name(path), conn = cached_connection(), ... )
path |
URL or path to spatial data file |
layer |
layer number to read metadata for, defaults to first layer. |
tblname |
metadata will be stored as a view with this name, by default this is based on the name of the file. |
conn |
A connection to a database. |
... |
optional additional arguments passed to |
A lazy dplyr::tbl object containing core spatial metadata such
as projection information.
st_read_meta("https://github.com/duckdb/duckdb_spatial/raw/main/test/data/amsterdam_roads.fgb")st_read_meta("https://github.com/duckdb/duckdb_spatial/raw/main/test/data/amsterdam_roads.fgb")
Write geojson using duckdb's native JSON writer
to_geojson( dataset, path, conn = cached_connection(), id_col = NULL, as_http = FALSE, server = Sys.getenv("AWS_S3_ENDPOINT", "s3.amanzonaws.com"), use_ssl = Sys.getenv("AWS_HTTPS", "TRUE") )to_geojson( dataset, path, conn = cached_connection(), id_col = NULL, as_http = FALSE, server = Sys.getenv("AWS_S3_ENDPOINT", "s3.amanzonaws.com"), use_ssl = Sys.getenv("AWS_HTTPS", "TRUE") )
dataset |
a remote tbl object from |
path |
a local file path or S3 path with write credentials |
conn |
duckdbfs database connection |
id_col |
(deprecated). to_geojson() will preserve all atomic columns as properties. |
as_http |
convert returned S3 path to URL (e.g. for public buckets) |
server |
aws endpoint if converting s3 path to URL |
use_ssl |
should url use https |
path, invisibly
hexid
Write H3 hexagon data out as an h3j-compliant JSON file
NOTE: the column containing H3 hashes must be named hexid
to_h3j(dataset, path, conn = cached_connection(), as_http = FALSE)to_h3j(dataset, path, conn = cached_connection(), as_http = FALSE)
dataset |
a remote tbl object from |
path |
a local file path or S3 path with write credentials |
conn |
duckdbfs database connection |
as_http |
if path is an S3 location, will return corresponding HTTP address. |
# example code# example code
to_json write data out as a JSON object
to_json( dataset, path, conn = cached_connection(), array = TRUE, options = NULL, as_http = FALSE )to_json( dataset, path, conn = cached_connection(), array = TRUE, options = NULL, as_http = FALSE )
dataset |
a remote tbl object from |
path |
a local file path or S3 path with write credentials |
conn |
duckdbfs database connection |
array |
generate a JSON array? |
options |
additional options as a char string, see |
as_http |
if path is an S3 location, will return corresponding HTTP address. |
path, invisibly
Convert output to sf object
to_sf(x, crs = NA, conn = cached_connection())to_sf(x, crs = NA, conn = cached_connection())
x |
a remote duckdb |
crs |
The coordinate reference system, any format understood by |
conn |
the connection object from the tbl.
Takes a duckdb table (from Note: |
an sf class object (in memory).
library(dplyr) csv_file <- system.file("extdata/spatial-test.csv", package="duckdbfs") # Note that we almost always must first create a `geometry` column, e.g. # from lat/long columns using the `st_point` method. sf <- open_dataset(csv_file, format = "csv") |> mutate(geom = ST_Point(longitude, latitude)) |> to_sf() # We can use the full space of spatial operations, including spatial # and normal dplyr filters. All operations are translated into a # spatial SQL query by `to_sf`: open_dataset(csv_file, format = "csv") |> mutate(geom = ST_Point(longitude, latitude)) |> mutate(dist = ST_Distance(geom, ST_Point(0,0))) |> filter(site %in% c("a", "b", "e")) |> to_sf()library(dplyr) csv_file <- system.file("extdata/spatial-test.csv", package="duckdbfs") # Note that we almost always must first create a `geometry` column, e.g. # from lat/long columns using the `st_point` method. sf <- open_dataset(csv_file, format = "csv") |> mutate(geom = ST_Point(longitude, latitude)) |> to_sf() # We can use the full space of spatial operations, including spatial # and normal dplyr filters. All operations are translated into a # spatial SQL query by `to_sf`: open_dataset(csv_file, format = "csv") |> mutate(geom = ST_Point(longitude, latitude)) |> mutate(dist = ST_Distance(geom, ST_Point(0,0))) |> filter(site %in% c("a", "b", "e")) |> to_sf()
write_dataset
write_dataset( dataset, path, conn = cached_connection(), format = c("parquet", "csv"), partitioning = dplyr::group_vars(dataset), overwrite = TRUE, options = list(), as_http = FALSE, ... )write_dataset( dataset, path, conn = cached_connection(), format = c("parquet", "csv"), partitioning = dplyr::group_vars(dataset), overwrite = TRUE, options = list(), as_http = FALSE, ... )
dataset |
a remote tbl object from |
path |
a local file path or S3 path with write credentials |
conn |
duckdbfs database connection |
format |
export format |
partitioning |
names of columns to use as partition variables |
overwrite |
allow overwriting of existing files? |
options |
Additional arguments to COPY, see https://duckdb.org/docs/stable/sql/statements/copy.html#copy–to-options Note, uses duckdb native syntax, e.g. c("PER_THREAD_OUTPUT false"), for named arguments, see examples. (Recall SQL is case-insensitive). |
as_http |
if path is an S3 location, will return corresponding HTTP address. |
... |
additional arguments to |
Returns the path, invisibly.
to_sf to_json to_geojson write_geo
write_dataset(mtcars, tempfile()) write_dataset(mtcars, tempdir()) write_dataset(mtcars, tempdir(), options = c("PER_THREAD_OUTPUT FALSE", "RETURN_STATS TRUE"))write_dataset(mtcars, tempfile()) write_dataset(mtcars, tempdir()) write_dataset(mtcars, tempdir(), options = c("PER_THREAD_OUTPUT FALSE", "RETURN_STATS TRUE"))
Write out to any spatial data format supported by GDAL.
write_geo( dataset, path, conn = cached_connection(), driver = "GeoJSON", layer_creation_options = "WRITE_BBOX=YES", srs = "ESPG:4326", as_http = FALSE )write_geo( dataset, path, conn = cached_connection(), driver = "GeoJSON", layer_creation_options = "WRITE_BBOX=YES", srs = "ESPG:4326", as_http = FALSE )
dataset |
a remote tbl object from |
path |
a local file path or S3 path with write credentials |
conn |
duckdbfs database connection |
driver |
driver, see https://duckdb.org/docs/stable/extensions/spatial/gdal |
layer_creation_options |
to GDAL, see https://duckdb.org/docs/stable/extensions/spatial/gdal |
srs |
Set a spatial reference system as metadata to use for the export. This can be a WKT string, an EPSG code or a proj-string, basically anything you would normally be able to pass to GDAL. Note that this will not perform any reprojection of the input geometry, it just sets the metadata if the target driver supports it. |
as_http |
if path is an S3 location, will return corresponding HTTP address. |
NOTE: This uses the version of GDAL packaged inside of duckdb, and not the system GDAL. At this time, duckdb's pre-packaged GDAL does not support s3 writes, and will produce a "Error: Not implemented Error: GDAL Error (6): Seek not supported on writable /vsis3/ files". Use to_geojson() to export using duckdb's native JSON serializer instead.
path, invisibly
local_file <- system.file("extdata/spatial-test.csv", package="duckdbfs") load_spatial() tbl <- open_dataset(local_file, format='csv') write_geo(tbl, "spatial.geojson")local_file <- system.file("extdata/spatial-test.csv", package="duckdbfs") load_spatial() tbl <- open_dataset(local_file, format='csv') write_geo(tbl, "spatial.geojson")