Interacting with GeoParquet files with R

The Atlas database is regularly backed up in GeoParquet format. This file format, combined with DuckDB, allows efficient connection to a remote dataset without needing to download all the data. It is also possible to download the GeoParquet file containing all public Atlas data to perform queries more quickly.

Here are the instructions for interacting with the GeoParquet file stored in the cloud from R using the dplyr, duckdb and and duckdbfs packages.


To start, install the following packages. Once the packages are installed, it is not necessary to re-install them at each R session.

install.packages(c('dplyr','duckdb','duckdbfs', 'sf'))

At each R session, use the necessary librairies.

library(dplyr)
library(duckdb)
library(duckdbfs)
library(sf)

Load the necessary functions to access the data.

source("http://atlas.biodiversite-quebec.ca/bq-atlas-parquet.R")

Remote connection to the GeoParquet file.

The atlas_dates object contains the backup dates of the available files.

atlas_dates

Establish connection to the most recent version of Atlas data.

atlas_rem <- atlas_remote(tail(atlas_dates$dates,n=1))

View the names of the different fields in the Parquet file.

colnames(atlas_rem)

See the different datasets that are included in the Atlas and the number of observations for each.

datasets <- atlas_rem |> group_by(dataset_name) |> summarize(cnt=count()) |> arrange(desc(cnt))

Load only the observations of snowy owls into a data frame. This command can take several minutes depending on the internet connection speed. The snowy owl is given as an example here. It is possible to load any species.

bubo_sca <- atlas_rem |> filter(valid_scientific_name == 'Bubo scandiacus') |> collect()

Load only the observations of snowy owl into a data frame while selecting only a few columns.

bubo_sca <- atlas_rem |> filter(valid_scientific_name == 'Bubo scandiacus') |> select(valid_scientific_name, latitude, longitude, 
dataset_name, year_obs, day_obs) |> collect()

Load observations of Iris versicolor into an sf spatial object. This command can take several minutes depending on the internet connection speed.

iris_vers <- atlas_rem |> filter(valid_scientific_name == 'Iris versicolor') |> 
        mutate(geom = ST_Point(as.numeric(longitude), as.numeric(latitude))) |> 
        to_sf() |> collect()

View on a map

plot(iris_vers['year_obs'])

Get the number of observations for each species present in Atlas and sort by decreasing observation count. Note that this command will not load Atlas data onto your computer. The calculation will be done remotely using DuckDB and the GeoParquet format.

sp_count <- atlas_rem |> group_by(valid_scientific_name) |> summarize(cnt=count()) |> arrange(desc(cnt)) |> collect()

Download the GeoParquet file locally.

For heavier operations and faster interaction with the file, you can download it locally to your computer. Note that the file is approximately 800 MB, so make sure you have enough disk space on your computer.

# Replace " ~/Downloads " for the directory of your choice on your computer.
atlas <- atlas_local(tail(atlas_dates$dates,n=1),'~/Downloads/')

Redo the observation count by species using the local file. The command using the local file is much faster than with the remote file.

sp_count_local <- atlas |> group_by(valid_scientific_name) |> summarize(cnt=count()) |> arrange(desc(cnt)) |> collect()

Load only the observations of snowy owls into a data frame using the local file . The command using the local file is much faster than with the remote file. The snowy owl is given as an example here. It is possible to load any species.

bubo_sca_local <- atlas |> filter(valid_scientific_name == 'Bubo scandiacus') |>  collect()

SQL queries

If you prefer using SQL queries instead of using the dplyr package, the object atlas_con establishes a connection to DuckDB and the function atlas_dbi() will connect to the parquet file and create a VIEW named atlas.

atlas_dbi(tail(atlas_dates$dates,n=1))
acci <- dbGetQuery(atlas_con,"SELECT * FROM atlas WHERE valid_scientific_name = 'Iris versicolor'")