Title: | Spatial and Environmental Blocking for K-Fold and LOO Cross-Validation |
---|---|
Description: | Creating spatially or environmentally separated folds for cross-validation to provide a robust error estimation in spatially structured environments; Investigating and visualising the effective range of spatial autocorrelation in continuous raster covariates and point samples to find an initial realistic distance band to separate training and testing datasets spatially described in Valavi, R. et al. (2019) <doi:10.1111/2041-210X.13107>. |
Authors: | Roozbeh Valavi [aut, cre] , Jane Elith [aut], José Lahoz-Monfort [aut], Ian Flint [aut], Gurutzeta Guillera-Arroita [aut] |
Maintainer: | Roozbeh Valavi <[email protected]> |
License: | GPL (>= 3) |
Version: | 3.1-5 |
Built: | 2024-11-21 06:07:41 UTC |
Source: | https://github.com/rvalavi/blockcv |
Simple random selection of training and testing folds in the structured environment leads to
an underestimation of error in the evaluation of spatial
predictions and may result in inappropriate model selection (Telford and Birks, 2009; Roberts et al., 2017). The use of spatial and
environmental blocks to separate training and testing sets has been suggested as a good strategy for realistic error estimation in datasets
with dependence structures, and more generally as a robust method for estimating the predictive performance of models used to predict mapped
distributions (Roberts et al., 2017). The package blockCV
offers
a range of functions for generating train and test folds
for k-fold and leave-one-out (LOO) cross-validation (CV). It allows for separation
of data spatially and environmentally, with various options for block construction.
Additionally, it includes a function for assessing the level of spatial autocorrelation
in response or raster covariates, to aid in selecting an appropriate distance band for
data separation. The blockCV
package is suitable for the evaluation of a variety of
spatial modelling applications, including classification of remote sensing imagery,
soil mapping, and species distribution modelling (SDM). It also provides support for
different SDM scenarios, including presence-absence and presence-background species
data, rare and common species, and raster data for predictor variables.
Roozbeh Valavi, Jane Elith, José Lahoz-Monfort, Ian Flint, and Gurutzeta Guillera-Arroita
Valavi, R., Elith, J., Lahoz-Monfort, J. J., & Guillera-Arroita, G. (2019). blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods in Ecology and Evolution, 10(2), 225-232. doi:10.1111/2041-210X.13107.
cv_spatial
, cv_cluster
, cv_buffer
, and cv_nndm
for blocking strategies.
This function is deprecated and will be removed in future updates! Please use cv_buffer
instead!
buffering( speciesData, species = NULL, theRange, spDataType = "PA", addBG = TRUE, progress = TRUE )
buffering( speciesData, species = NULL, theRange, spDataType = "PA", addBG = TRUE, progress = TRUE )
speciesData |
A simple features (sf) or SpatialPoints object containing species data (response variable). |
species |
Character. Indicating the name of the field in which species data (binary response i.e. 0 and 1) is stored. If |
theRange |
Numeric value of the specified range by which the training and testing datasets are separated.
This distance should be in metres no matter what the coordinate system is. The range can be explored by |
spDataType |
Character input indicating the type of species data. It can take two values, PA for presence-absence data and PB for
presence-background data, when |
addBG |
Logical. Add background points to the test set when |
progress |
Logical. If TRUE a progress bar will be shown. |
This function assists selection of block size. It allows the user to visualise the blocks interactively, viewing the impact of block size on number and arrangement of blocks in the landscape (and optionally on the distribution of species data in those blocks). Slide to the selected block size, and click Apply Changes to change the block size.
cv_block_size(r, x = NULL, column = NULL, min_size = NULL, max_size = NULL)
cv_block_size(r, x = NULL, column = NULL, min_size = NULL, max_size = NULL)
r |
a terra SpatRaster object (optional). If provided, its extent will be used to specify the blocks. It also supports stars, raster, or path to a raster file on disk. |
x |
a simple features (sf) or SpatialPoints object of spatial sample data. If |
column |
character (optional). Indicating the name of the column in which response variable (e.g. species data as a binary response i.e. 0s and 1s) is stored to be shown on the plot. |
min_size |
numeric; the minimum size of the blocks (in metres) to explore. |
max_size |
numeric; the maximum size of the blocks (in metres) to explore. |
an interactive shiny session
if(interactive()){ library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # manually choose the size of spatial blocks cv_block_size(x = pa_data, column = "occ", min_size = 2e5, max_size = 9e5) }
if(interactive()){ library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # manually choose the size of spatial blocks cv_block_size(x = pa_data, column = "occ", min_size = 2e5, max_size = 9e5) }
This function generates spatially separated train and test folds by considering buffers of
the specified distance (size
parameter) around each observation point.
This approach is a form of leave-one-out cross-validation. Each fold is generated by excluding
nearby observations around each testing point within the specified distance (ideally the range of
spatial autocorrelation, see cv_spatial_autocor
). In this method, the testing set never
directly abuts a training sample (e.g. presence or absence; 0s and 1s). For more information see the details section.
cv_buffer( x, column = NULL, size, presence_bg = FALSE, add_bg = FALSE, progress = TRUE, report = TRUE )
cv_buffer( x, column = NULL, size, presence_bg = FALSE, add_bg = FALSE, progress = TRUE, report = TRUE )
x |
a simple features (sf) or SpatialPoints object of spatial sample data (e.g., species data or ground truth sample for image classification). |
column |
character; indicating the name of the column in which response variable (e.g. species data as a binary
response i.e. 0s and 1s) is stored. This is required when |
size |
numeric value of the specified range by which training/testing data are separated.
This distance should be in metres. The range could be explored by |
presence_bg |
logical; whether to treat data as species presence-background data. For all other data
types (presence-absence, continuous, count or multi-class responses), this option should be |
add_bg |
logical; add background points to the test set when |
progress |
logical; whether to shows a progress bar. |
report |
logical; whether to generate print summary of records in each fold; for very big
datasets, set to |
When working with presence-background (presence and pseudo-absence) species distribution
data (should be specified by presence_bg = TRUE
argument), only presence records are used
for specifying the folds (recommended). Consider a target presence point. The buffer is defined around this target point,
using the specified range (size
). By default, the testing fold comprises only the target presence point (all background
points within the buffer are also added when add_bg = TRUE
).
Any non-target presence points inside the buffer are excluded.
All points (presence and background) outside of buffer are used for the training set.
The methods cycles through all the presence data, so the number of folds is equal to
the number of presence points in the dataset.
For presence-absence data (and all other types of data), folds are created based on all records, both
presences and absences. As above, a target observation (presence or absence) forms a test point, all
presence and absence points other than the target point within the buffer are ignored, and the training
set comprises all presences and absences outside the buffer. Apart from the folds, the number
of training-presence, training-absence, testing-presence and testing-absence
records is stored and returned in the records
table. If column = NULL
and presence_bg = FALSE
,
the procedure is like presence-absence data. All other data types (continuous, count or multi-class responses) should be
done by presence_bg = FALSE
.
An object of class S3. A list of objects including:
folds_list - a list containing the folds. Each fold has two vectors with the training (first) and testing (second) indices
k - number of the folds
size - the defined range of spatial autocorrelation)
column - the name of the column if provided
presence_bg - whether this was treated as presence-background data
records - a table with the number of points in each category of training and testing
Radosavljevic, A., & Anderson, R. P. (2014). Making better Maxent models of species distributions: Complexity, overfitting and evaluation. Journal of Biogeography, 41, 629–643. https://doi.org/10.1111/jbi.12227
cv_nndm
, cv_spatial
, and cv_spatial_autocor
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) # make an sf object from data.frame pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) bloo <- cv_buffer(x = pa_data, column = "occ", size = 350000, # size in metres no matter the CRS presence_bg = FALSE)
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) # make an sf object from data.frame pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) bloo <- cv_buffer(x = pa_data, column = "occ", size = 350000, # size in metres no matter the CRS presence_bg = FALSE)
This function uses clustering methods to specify sets of similar environmental
conditions based on the input covariates, or cluster of spatial coordinates of the sample data.
Sample data (i.e. species data) corresponding to any of
these groups or clusters are assigned to a fold. Clustering is done
using kmeans
for both approaches. The only requirement is x
that leads to
a clustering of the confidantes of sample data. Otherwise, by providing r
, environmental
clustering is done.
cv_cluster( x, column = NULL, r = NULL, k = 5L, scale = TRUE, raster_cluster = FALSE, num_sample = 10000L, biomod2 = TRUE, report = TRUE, ... )
cv_cluster( x, column = NULL, r = NULL, k = 5L, scale = TRUE, raster_cluster = FALSE, num_sample = 10000L, biomod2 = TRUE, report = TRUE, ... )
x |
a simple features (sf) or SpatialPoints object of spatial sample data (e.g., species data or ground truth sample for image classification). |
column |
character (optional). Indicating the name of the column in which response variable (e.g. species data as a binary response i.e. 0s and 1s) is stored. This is only used to see whether all the folds contain all the classes in the final report. |
r |
a terra SpatRaster object of covariates to identify environmental groups. If provided, clustering will be done in environmental space rather than spatial coordinates of sample points. |
k |
integer value. The number of desired folds for cross-validation. The default is |
scale |
logical; whether to scale the input rasters (recommended) for clustering. |
raster_cluster |
logical; if |
num_sample |
integer; the number of samples from raster layers to build the clusters (when |
biomod2 |
logical. Creates a matrix of folds that can be directly used in the biomod2 package as a CV.user.table for cross-validation. |
report |
logical; whether to print the report of the records per fold. |
... |
additional arguments for |
As k-means algorithms use Euclidean distance to estimate clusters, the input raster covariates should be quantitative variables.
Since variables with wider ranges of values might dominate the clusters and bias the environmental clustering (Hastie et al., 2009),
all the input rasters are first scaled and centred (scale = TRUE
) within the function.
If raster_cluster = TRUE
, the clustering is done in the raster space. In this approach the clusters will be consistent throughout the region
and different sample datasets in the same region (for comparison). However, this may result in a cluster(s)
that covers none of the species records (the spatial location of response samples),
especially when species data is not dispersed throughout the region or the number of clusters (k or folds) is high. In this
case, the number of folds is less than specified k
. If raster_cluster = FALSE
, the clustering will be done in
species points and the number of the folds will be the same as k
.
Note that the input raster layer should cover all the species points, otherwise an error will rise. The records with no raster value should be deleted prior to the analysis or another raster layer must be provided.
An object of class S3. A list of objects including:
folds_list - a list containing the folds. Each fold has two vectors with the training (first) and testing (second) indices
folds_ids - a vector of values indicating the number of the fold for each observation (each number corresponds to the same point in x)
biomod_table - a matrix with the folds to be used in biomod2 package
k - number of the folds
column - the name of the column if provided
type - indicates whether spatial or environmental clustering was done.
records - a table with the number of points in each category of training and testing
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction ( 2nd ed., Vol. 1).
cv_buffer
and cv_spatial
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) # make an sf object from data.frame pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # load raster data path <- system.file("extdata/au/", package = "blockCV") files <- list.files(path, full.names = TRUE) covars <- terra::rast(files) # spatial clustering set.seed(6) sc <- cv_cluster(x = pa_data, column = "occ", # optional; name of the column with response k = 5) # environmental clustering set.seed(6) ec <- cv_cluster(r = covars, # if provided will be used for environmental clustering x = pa_data, column = "occ", # optional; name of the column with response k = 5, scale = TRUE)
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) # make an sf object from data.frame pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # load raster data path <- system.file("extdata/au/", package = "blockCV") files <- list.files(path, full.names = TRUE) covars <- terra::rast(files) # spatial clustering set.seed(6) sc <- cv_cluster(x = pa_data, column = "occ", # optional; name of the column with response k = 5) # environmental clustering set.seed(6) ec <- cv_cluster(r = covars, # if provided will be used for environmental clustering x = pa_data, column = "occ", # optional; name of the column with response k = 5, scale = TRUE)
A fast implementation of the Nearest Neighbour Distance Matching (NNDM) algorithm (Milà et al., 2022) in C++. Similar
to cv_buffer
, this is a variation of leave-one-out (LOO) cross-validation. It tries to match the
nearest neighbour distance distribution function between the test and training data to the nearest neighbour
distance distribution function between the target prediction and training points (Milà et al., 2022).
cv_nndm( x, column = NULL, r, size, num_sample = 10000, sampling = "random", min_train = 0.05, presence_bg = FALSE, add_bg = FALSE, plot = TRUE, report = TRUE )
cv_nndm( x, column = NULL, r, size, num_sample = 10000, sampling = "random", min_train = 0.05, presence_bg = FALSE, add_bg = FALSE, plot = TRUE, report = TRUE )
x |
a simple features (sf) or SpatialPoints object of spatial sample data (e.g., species data or ground truth sample for image classification). |
column |
character; indicating the name of the column in which response variable (e.g. species data as a binary
response i.e. 0s and 1s) is stored. This is required when |
r |
a terra SpatRaster object of a predictor variable. This defines the area that model is going to predict. |
size |
numeric value of the range of spatial autocorrelation (the |
num_sample |
integer; the number of sample points from predictor ( |
sampling |
either |
min_train |
numeric; between 0 and 1. A constraint on the minimum proportion of train points in each fold. |
presence_bg |
logical; whether to treat data as species presence-background data. For all other data
types (presence-absence, continuous, count or multi-class responses), this option should be |
add_bg |
logical; add background points to the test set when |
plot |
logical; whether to plot the G functions. |
report |
logical; whether to generate print summary of records in each fold; for very big
datasets, set to |
When working with presence-background (presence and pseudo-absence) species distribution
data (should be specified by presence_bg = TRUE
argument), only presence records are used
for specifying the folds (recommended). The testing fold comprises only the target presence point (optionally,
all background points within the distance are also included when add_bg = TRUE
; this is the
distance that matches the nearest neighbour distance distribution function of training-testing presences and
training-presences and prediction points; often lower than size
).
Any non-target presence points inside the distance are excluded.
All points (presence and background) outside of distance are used for the training set.
The methods cycles through all the presence data, so the number of folds is equal to
the number of presence points in the dataset.
For all other types of data (including presence-absence, count, continuous, and multi-class)
set presence_bg = FALE
, and the function behaves similar to the methods
explained by Milà and colleagues (2022).
An object of class S3. A list of objects including:
folds_list - a list containing the folds. Each fold has two vectors with the training (first) and testing (second) indices
k - number of the folds
size - the distance band to separated trainig and testing folds)
column - the name of the column if provided
presence_bg - whether this was treated as presence-background data
records - a table with the number of points in each category of training and testing
C. Milà, J. Mateu, E. Pebesma, and H. Meyer, Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation, Methods in Ecology and Evolution (2022).
cv_buffer
and cv_spatial_autocor
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) # make an sf object from data.frame pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # load raster data path <- system.file("extdata/au/bio_5.tif", package = "blockCV") covar <- terra::rast(path) nndm <- cv_nndm(x = pa_data, column = "occ", # optional r = covar, size = 350000, # size in metres no matter the CRS num_sample = 10000, sampling = "regular", min_train = 0.1)
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) # make an sf object from data.frame pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # load raster data path <- system.file("extdata/au/bio_5.tif", package = "blockCV") covar <- terra::rast(path) nndm <- cv_nndm(x = pa_data, column = "occ", # optional r = covar, size = 350000, # size in metres no matter the CRS num_sample = 10000, sampling = "regular", min_train = 0.1)
This function visualises the folds create by blockCV. It also accepts a raster layer to be used as background in the output plot.
cv_plot( cv, x, r = NULL, nrow = NULL, ncol = NULL, num_plots = 1:10, max_pixels = 3e+05, remove_na = TRUE, raster_colors = gray.colors(10, alpha = 1), points_colors = c("#E69F00", "#56B4E9"), points_alpha = 0.7, label_size = 4 )
cv_plot( cv, x, r = NULL, nrow = NULL, ncol = NULL, num_plots = 1:10, max_pixels = 3e+05, remove_na = TRUE, raster_colors = gray.colors(10, alpha = 1), points_colors = c("#E69F00", "#56B4E9"), points_alpha = 0.7, label_size = 4 )
cv |
a blockCV cv_* object; a |
x |
a simple features (sf) or SpatialPoints object of the spatial sample data used for creating
the |
r |
a terra SpatRaster object (optional). If provided, it will be used as background of the plots. It also supports stars, raster, or path to a raster file on disk. |
nrow |
integer; number of rows for facet plot |
ncol |
integer; number of columns for facet plot |
num_plots |
a vector of indices of folds; by default the first 10 are shown (if available).
You can choose any of the folds to be shown e.g. |
max_pixels |
integer; maximum number of pixels used for plotting |
remove_na |
logical; whether to remove excluded points in |
raster_colors |
character; a character vector of colours for raster background e.g. |
points_colors |
character; two colours to be used for train and test points |
points_alpha |
numeric; the opacity of points |
label_size |
integer; size of fold labels when a |
a ggplot object
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # spatial clustering sc <- cv_cluster(x = pa_data, k = 5) # now plot the create folds cv_plot(cv = sc, x = pa_data, # sample points nrow = 2, points_alpha = 0.5)
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # spatial clustering sc <- cv_cluster(x = pa_data, k = 5) # now plot the create folds cv_plot(cv = sc, x = pa_data, # sample points nrow = 2, points_alpha = 0.5)
This function computes multivariate environmental similarity surface (MESS) as described
in Elith et al. (2010). MESS represents how similar a point in a testing fold is to a training
fold (as a reference set of points), with respect to a set of predictor variables in r
.
The negative values are the sites where at least one variable has a value that is outside
the range of environments over the reference set, so these are novel environments.
cv_similarity( cv, x, r, num_plot = seq_along(cv$folds_list), jitter_width = 0.1, points_size = 2, points_alpha = 0.7, points_colors = NULL, progress = TRUE )
cv_similarity( cv, x, r, num_plot = seq_along(cv$folds_list), jitter_width = 0.1, points_size = 2, points_alpha = 0.7, points_colors = NULL, progress = TRUE )
cv |
a blockCV cv_* object; a |
x |
a simple features (sf) or SpatialPoints object of the spatial sample data used for creating
the |
r |
a terra SpatRaster object of environmental predictor that are going to be used for modelling. This is used to calculate similarity between the training and testing points. |
num_plot |
a vector of indices of folds. |
jitter_width |
numeric; the width of jitter points. |
points_size |
numeric; the size of points. |
points_alpha |
numeric; the opacity of points |
points_colors |
character; a character vector of colours for points |
progress |
logical; whether to shows a progress bar for random fold selection. |
a ggplot object
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) # make an sf object from data.frame pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # load raster data path <- system.file("extdata/au/", package = "blockCV") files <- list.files(path, full.names = TRUE) covars <- terra::rast(files) # hexagonal spatial blocking by specified size and random assignment sb <- cv_spatial(x = pa_data, column = "occ", size = 450000, k = 5, iteration = 1) # compute extrapolation cv_similarity(cv = sb, r = covars, x = pa_data)
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) # make an sf object from data.frame pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # load raster data path <- system.file("extdata/au/", package = "blockCV") files <- list.files(path, full.names = TRUE) covars <- terra::rast(files) # hexagonal spatial blocking by specified size and random assignment sb <- cv_spatial(x = pa_data, column = "occ", size = 450000, k = 5, iteration = 1) # compute extrapolation cv_similarity(cv = sb, r = covars, x = pa_data)
This function creates spatially separated folds based on a distance to number of row and/or column.
It assigns blocks to the training and testing folds randomly, systematically or
in a checkerboard pattern. The distance (size
)
should be in metres, regardless of the unit of the reference system of
the input data (for more information see the details section). By default,
the function creates blocks according to the extent and shape of the spatial sample data (x
e.g.
the species occurrence), Alternatively, blocks can be created based on r
assuming that the
user has considered the landscape for the given species and case study.
Blocks can also be offset so the origin is not at the outer corner of the rasters.
Instead of providing a distance, the blocks can also be created by specifying a number of rows and/or
columns and divide the study area into vertical or horizontal bins, as presented in Wenger & Olden (2012)
and Bahn & McGill (2012). Finally, the blocks can be specified by a user-defined spatial polygon layer.
cv_spatial( x, column = NULL, r = NULL, k = 5L, hexagon = TRUE, flat_top = FALSE, size = NULL, rows_cols = c(10, 10), selection = "random", iteration = 100L, user_blocks = NULL, folds_column = NULL, deg_to_metre = 111325, biomod2 = TRUE, offset = c(0, 0), extend = 0, seed = NULL, progress = TRUE, report = TRUE, plot = TRUE, ... )
cv_spatial( x, column = NULL, r = NULL, k = 5L, hexagon = TRUE, flat_top = FALSE, size = NULL, rows_cols = c(10, 10), selection = "random", iteration = 100L, user_blocks = NULL, folds_column = NULL, deg_to_metre = 111325, biomod2 = TRUE, offset = c(0, 0), extend = 0, seed = NULL, progress = TRUE, report = TRUE, plot = TRUE, ... )
x |
a simple features (sf) or SpatialPoints object of spatial sample data (e.g., species data or ground truth sample for image classification). |
column |
character (optional). Indicating the name of the column in which response variable (e.g. species data as a binary
response i.e. 0s and 1s) is stored to find balanced records in cross-validation folds. If |
r |
a terra SpatRaster object (optional). If provided, its extent will be used to specify the blocks. It also supports stars, raster, or path to a raster file on disk. |
k |
integer value. The number of desired folds for cross-validation. The default is |
hexagon |
logical. Creates hexagonal (default) spatial blocks. If |
flat_top |
logical. Creating hexagonal blocks with topped flat. |
size |
numeric value of the specified range by which blocks are created and training/testing data are separated.
This distance should be in metres. The range could be explored by |
rows_cols |
integer vector. Two integers to define the blocks based on row and
column e.g. |
selection |
type of assignment of blocks into folds. Can be random (default), systematic, checkerboard, or predefined.
The checkerboard does not work with hexagonal and user-defined spatial blocks. If the |
iteration |
integer value. The number of attempts to create folds with balanced records. Only works when |
user_blocks |
an sf or SpatialPolygons object to be used as the blocks (optional). This can be a user defined polygon and it must cover all
the species (response) points. If |
folds_column |
character. Indicating the name of the column (in |
deg_to_metre |
integer. The conversion rate of metres to degree. See the details section for more information. |
biomod2 |
logical. Creates a matrix of folds that can be directly used in the biomod2 package as a CV.user.table for cross-validation. |
offset |
two number between 0 and 1 to shift blocks by that proportion of block size.
This option only works when |
extend |
numeric; This parameter specifies the percentage by which the map's extent is expanded to increase the size of the square spatial blocks, ensuring that all points fall within a block. The value should be a numeric between 0 and 5. |
seed |
integer; a random seed for reproducibility (although an external seed should also work). |
progress |
logical; whether to shows a progress bar for random fold selection. |
report |
logical; whether to print the report of the records per fold. |
plot |
logical; whether to plot the final blocks with fold numbers in ggplot.
You can re-create this with |
... |
additional option for |
To maintain consistency, all functions in this package use meters as their unit of
measurement. However, when the input map has a geographic coordinate system (in decimal degrees),
the block size is calculated by dividing the size
parameter by deg_to_metre
(which
defaults to 111325 meters, the standard distance of one degree of latitude on the Equator).
In reality, this value varies by a factor of the cosine of the latitude. So, an alternative sensible
value could be cos(mean(sf::st_bbox(x)[c(2,4)]) * pi/180) * 111325
.
The offset
can be used to change the spatial position of the blocks. It can also be used to
assess the sensitivity of analysis results to shifting in the blocking arrangements.
These options are available when size
is defined. By default the region is
located in the middle of the blocks and by setting the offsets, the blocks will shift.
Roberts et. al. (2017) suggest that blocks should be substantially bigger than the range of spatial
autocorrelation (in model residual) to obtain realistic error estimates, while a buffer with the size of
the spatial autocorrelation range would result in a good estimation of error. This is because of the so-called
edge effect (O'Sullivan & Unwin, 2014), whereby points located on the edges of the blocks of opposite sets are
not separated spatially. Blocking with a buffering strategy overcomes this issue (see cv_buffer
).
An object of class S3. A list of objects including:
folds_list - a list containing the folds. Each fold has two vectors with the training (first) and testing (second) indices
folds_ids - a vector of values indicating the number of the fold for each observation (each number corresponds to the same point in species data)
biomod_table - a matrix with the folds to be used in biomod2 package
k - number of the folds
size - input size, if not null
column - the name of the column if provided
blocks - spatial polygon of the blocks
records - a table with the number of points in each category of training and testing
Bahn, V., & McGill, B. J. (2012). Testing the predictive performance of distribution models. Oikos, 122(3), 321-331.
O'Sullivan, D., Unwin, D.J., (2010). Geographic Information Analysis, 2nd ed. John Wiley & Sons.
Roberts et al., (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 40: 913-929.
Wenger, S.J., Olden, J.D., (2012). Assessing transferability of ecological models: an underappreciated aspect of statistical validation. Methods Ecol. Evol. 3, 260-267.
cv_buffer
and cv_cluster
; cv_spatial_autocor
and cv_block_size
for selecting block size
For CV.user.table see BIOMOD_Modeling
in biomod2 package
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) # make an sf object from data.frame pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # hexagonal spatial blocking by specified size and random assignment sb1 <- cv_spatial(x = pa_data, column = "occ", size = 450000, k = 5, selection = "random", iteration = 50) # spatial blocking by row/column and systematic fold assignment sb2 <- cv_spatial(x = pa_data, column = "occ", rows_cols = c(8, 10), k = 5, hexagon = FALSE, selection = "systematic")
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) # make an sf object from data.frame pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # hexagonal spatial blocking by specified size and random assignment sb1 <- cv_spatial(x = pa_data, column = "occ", size = 450000, k = 5, selection = "random", iteration = 50) # spatial blocking by row/column and systematic fold assignment sb2 <- cv_spatial(x = pa_data, column = "occ", rows_cols = c(8, 10), k = 5, hexagon = FALSE, selection = "systematic")
This function provides a quantitative basis for choosing block size. The spatial autocorrelation in either the
spatial sample points or all continuous predictor variables available as raster layers is assessed and reported.
The response (as defined be column
) in spatial sample points can be binary such as species distribution data,
or continuous response like soil organic carbon. The function estimates spatial autocorrelation ranges of all input
raster layers or the response data. This is the range over which observations are independent and is determined by
constructing the empirical variogram, a fundamental geostatistical tool for measuring spatial autocorrelation.
The empirical variogram models the structure of spatial autocorrelation by measuring variability between all possible
pairs of points (O'Sullivan and Unwin, 2010). Results are plotted. See the details section for further information.
cv_spatial_autocor( r, x, column = NULL, num_sample = 5000L, deg_to_metre = 111325, plot = TRUE, progress = TRUE, ... )
cv_spatial_autocor( r, x, column = NULL, num_sample = 5000L, deg_to_metre = 111325, plot = TRUE, progress = TRUE, ... )
r |
a terra SpatRaster object. If provided (and |
x |
a simple features (sf) or SpatialPoints object of spatial sample data (e.g., species binary or continuous date). |
column |
character; indicating the name of the column in which response variable (e.g. species data as a binary response i.e. 0s and 1s) is stored for calculating spatial autocorrelation range. This supports multiple column names. |
num_sample |
integer; the number of sample points of each raster layer to fit variogram models. It is 5000 by default, however it can be increased by user to represent their region well (relevant to the extent and resolution of rasters). |
deg_to_metre |
integer. The conversion rate of degrees to metres. |
plot |
logical; whether to plot the results. |
progress |
logical; whether to shows a progress bar. |
... |
additional option for |
The input raster layers should be continuous for computing the variograms and estimating the range of spatial autocorrelation. The input rasters should also have a specified coordinate reference system. However, if the reference system is not specified, the function attempts to guess it based on the extent of the map. It assumes an un-projected reference system for layers with extent lying between -180 and 180.
Variograms are calculated based on the distances between pairs of points, so un-projected rasters (in degrees) will
not give an accurate result (especially over large latitudinal extents). For un-projected rasters, the great circle distance
(rather than Euclidean distance) is used to calculate the spatial distances between pairs of points. To
enable more accurate estimate, it is recommended to transform un-projected maps (geographic coordinate
system / latitude-longitude) to a projected metric reference system (e.g. UTM or Lambert) where it is possible.
See autofitVariogram
from automap and variogram
from gstat packages
for further information.
An object of class S3. A list object including:
range - the suggested range (i.e. size), which is the median of all calculated ranges in case of 'r'.
range_table - a table of input covariates names and their autocorrelation range
plots - the output plot (the plot is shown by default)
num_sample - number sample of 'r' used for analysis
variograms - fitted variograms for all layers
O'Sullivan, D., Unwin, D.J., (2010). Geographic Information Analysis, 2nd ed. John Wiley & Sons.
Roberts et al., (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 40: 913-929.
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) # make an sf object from data.frame pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # load raster data path <- system.file("extdata/au/", package = "blockCV") files <- list.files(path, full.names = TRUE) covars <- terra::rast(files) # spatial autocorrelation of a binary/continuous response sac1 <- cv_spatial_autocor(x = pa_data, column = "occ", # binary or continuous data plot = TRUE) # spatial autocorrelation of continuous raster files sac2 <- cv_spatial_autocor(r = covars, num_sample = 5000, plot = TRUE) # show the result summary(sac2)
library(blockCV) # import presence-absence species data points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) # make an sf object from data.frame pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) # load raster data path <- system.file("extdata/au/", package = "blockCV") files <- list.files(path, full.names = TRUE) covars <- terra::rast(files) # spatial autocorrelation of a binary/continuous response sac1 <- cv_spatial_autocor(x = pa_data, column = "occ", # binary or continuous data plot = TRUE) # spatial autocorrelation of continuous raster files sac2 <- cv_spatial_autocor(r = covars, num_sample = 5000, plot = TRUE) # show the result summary(sac2)
This function is deprecated and will be removed in future updates! Please use cv_cluster
instead!
envBlock( rasterLayer, speciesData, species = NULL, k = 5, standardization = "normal", rasterBlock = TRUE, sampleNumber = 10000, biomod2Format = TRUE, numLimit = 0, verbose = TRUE )
envBlock( rasterLayer, speciesData, species = NULL, k = 5, standardization = "normal", rasterBlock = TRUE, sampleNumber = 10000, biomod2Format = TRUE, numLimit = 0, verbose = TRUE )
rasterLayer |
A raster object of covariates to identify environmental groups. |
speciesData |
A simple features (sf) or SpatialPoints object containing species data (response variable). |
species |
Character. Indicating the name of the field in which species data (binary response i.e. 0 and 1) is stored. If |
k |
Integer value. The number of desired folds for cross-validation. The default is |
standardization |
Standardize input raster layers. Three possible inputs are "normal" (the default), "standard" and "none". See details for more information. |
rasterBlock |
Logical. If TRUE, the clustering is done in the raster layer rather than species data. See details for more information. |
sampleNumber |
Integer. The number of samples from raster layers to build the clusters. |
biomod2Format |
Logical. Creates a matrix of folds that can be directly used in the biomod2 package as a DataSplitTable for cross-validation. |
numLimit |
Integer value. The minimum number of points in each category of data (train_0, train_1, test_0 and test_1). Shows a message if the number of points in any of the folds happens to be less than this number. |
verbose |
Logical. To print the report of the recods per fold. |
This function is deprecated! Please use cv_plot
function for plotting the folds.
foldExplorer(blocks, rasterLayer, speciesData)
foldExplorer(blocks, rasterLayer, speciesData)
blocks |
deprecated! |
rasterLayer |
deprecated! |
speciesData |
deprecated! |
This function is deprecated and will be removed in future updates! Please use cv_block_size
instead!
rangeExplorer( rasterLayer, speciesData = NULL, species = NULL, rangeTable = NULL, minRange = NULL, maxRange = NULL )
rangeExplorer( rasterLayer, speciesData = NULL, species = NULL, rangeTable = NULL, minRange = NULL, maxRange = NULL )
rasterLayer |
raster layer for make plot |
speciesData |
a simple features (sf) or SpatialPoints object containing species data (response variable). If provided, the species data will be shown on the map. |
species |
character value indicating the name of the field in which the species data (response variable e.g. 0s and 1s) are stored. If provided, species presence and absence data will be shown in different colours. |
rangeTable |
deprecated option! |
minRange |
a numeric value to set the minimum possible range for creating spatial blocks. It is used to limit the searching domain of spatial block size. |
maxRange |
a numeric value to set the maximum possible range for creating spatial blocks. It is used to limit the searching domain of spatial block size. |
This function is deprecated and will be removed in future updates! Please use cv_spatial_autocor
instead!
spatialAutoRange( rasterLayer, sampleNumber = 5000L, border = NULL, speciesData = NULL, doParallel = NULL, nCores = NULL, showPlots = TRUE, degMetre = 111325, maxpixels = 1e+05, plotVariograms = FALSE, progress = TRUE )
spatialAutoRange( rasterLayer, sampleNumber = 5000L, border = NULL, speciesData = NULL, doParallel = NULL, nCores = NULL, showPlots = TRUE, degMetre = 111325, maxpixels = 1e+05, plotVariograms = FALSE, progress = TRUE )
rasterLayer |
A raster object of covariates to find spatial autocorrelation range. |
sampleNumber |
Integer. The number of sample points of each raster layer to fit variogram models. It is 5000 by default, however it can be increased by user to represent their region well (relevant to the extent and resolution of rasters). |
border |
deprecated option! |
speciesData |
A spatial or sf object (optional). If provided, the |
doParallel |
deprecated option! |
nCores |
deprecated option! |
showPlots |
Logical. Show final plot of spatial blocks and autocorrelation ranges. |
degMetre |
Numeric. The conversion rate of metres to degree. This is for constructing spatial blocks for visualisation. When the input map is in geographic coordinate system (decimal degrees), the block size is calculated based on deviding the calculated range by this value to convert to the input map's unit (by default 111325; the standard distance of a degree in metres, on the Equator). |
maxpixels |
Number of random pixels to select the blocks over the study area. |
plotVariograms |
deprecated option! |
progress |
Logical. Shows progress bar. It works only when |
This function is deprecated and will be removed in future updates! Please use cv_spatial
instead!
spatialBlock( speciesData, species = NULL, rasterLayer = NULL, theRange = NULL, rows = NULL, cols = NULL, k = 5L, selection = "random", iteration = 100L, blocks = NULL, foldsCol = NULL, numLimit = 0L, maskBySpecies = TRUE, degMetre = 111325, border = NULL, showBlocks = TRUE, biomod2Format = TRUE, xOffset = 0, yOffset = 0, extend = 0, seed = 42, progress = TRUE, verbose = TRUE )
spatialBlock( speciesData, species = NULL, rasterLayer = NULL, theRange = NULL, rows = NULL, cols = NULL, k = 5L, selection = "random", iteration = 100L, blocks = NULL, foldsCol = NULL, numLimit = 0L, maskBySpecies = TRUE, degMetre = 111325, border = NULL, showBlocks = TRUE, biomod2Format = TRUE, xOffset = 0, yOffset = 0, extend = 0, seed = 42, progress = TRUE, verbose = TRUE )
speciesData |
A simple features (sf) or SpatialPoints object containing species data (response variable). |
species |
Character (optional). Indicating the name of the column in which species data (response variable e.g. 0s and 1s) is stored.
This argument is used to make folds with evenly distributed records. This option only works by random fold selection and with binary or
multi-class responses e.g. species presence-absence/background or land cover classes for remote sensing image classification.
If |
rasterLayer |
A raster object for visualisation (optional). If provided, this will be used to specify the blocks covering the area. |
theRange |
Numeric value of the specified range by which blocks are created and training/testing data are separated.
This distance should be in metres. The range could be explored by |
rows |
Integer value by which the area is divided into latitudinal bins. |
cols |
Integer value by which the area is divided into longitudinal bins. |
k |
Integer value. The number of desired folds for cross-validation. The default is |
selection |
Type of assignment of blocks into folds. Can be random (default), systematic, checkerboard, or predefined. The checkerboard does not work with user-defined spatial blocks. If the selection = 'predefined', user-defined blocks and foldsCol must be supplied. |
iteration |
Integer value. The number of attempts to create folds that fulfil the set requirement for minimum number
of points in each training and testing fold (for each response class e.g. train_0, train_1, test_0
and test_1), as specified by |
blocks |
A sf or SpatialPolygons object to be used as the blocks (optional). This can be a user defined polygon and it must cover all the species (response) points. If the selection = 'predefined', this argument (and foldsCol) must be supplied. |
foldsCol |
Character. Indicating the name of the column (in user-defined blocks) in which the associated folds are stored. This argument is necessary if you choose the 'predefined' selection. |
numLimit |
deprecated option! |
maskBySpecies |
Since version 1.1, this option is always set to |
degMetre |
Integer. The conversion rate of metres to degree. See the details section for more information. |
border |
deprecated option! |
showBlocks |
Logical. If TRUE the final blocks with fold numbers will be created with ggplot and plotted. A raster layer could be specified
in |
biomod2Format |
Logical. Creates a matrix of folds that can be directly used in the biomod2 package as a DataSplitTable for cross-validation. |
xOffset |
Numeric value between 0 and 1 for shifting the blocks horizontally. The value is the proportion of block size. |
yOffset |
Numeric value between 0 and 1 for shifting the blocks vertically. The value is the proportion of block size. |
extend |
numeric; This parameter specifies the percentage by which the map's extent is expanded to increase the size of the square spatial blocks, ensuring that all points fall within a block. The value should be a numeric between 0 and 5. |
seed |
Integer. A random seed generator for reproducibility. |
progress |
Logical. If TRUE shows a progress bar when |
verbose |
Logical. To print the report of the recods per fold. |