---
title: "Introduction to zentracloud"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to zentracloud}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Purpose

The package is designed to act as a direct access point to the ZENTRA Cloud API.
With a valid token, data for a chosen period can be directly loaded into R.
Further, the data is saved into a cache, so that repeated queries for the same
time period are performed much faster.

**IMPORTANT** 

The package is currently not well suited to download large amounts 
of data. The ZENTRA Cloud API is limited to 2000 readings/minute, therefore
the requests via the package functions are throttled. For long time periods we 
thus recommend continuing to use the ZENTRA Cloud interface. 

## Usage

### Installation

```{r install, eval=FALSE}
# install from GitLab
url = "https://gitlab.com/meter-group-inc/pubpackages/zentracloud"
remotes::install_git(url = url)
```


```{r load}
# load package
library(zentracloud)
```

There might be some start-up messages, which will be explained later on.

### Token

A valid token for the API is a prerequisite for the use of the package. Tokens
can be generated on the ZENTRA Cloud web interface. There, go to the menu point
API in the sidebar. If a valid token exists, it will show up and can be copied.
If not, there is an option to add a new key, which will generate a token.

![](imgs/token.png){width="100%"}

For use in the functions, the token has to be set as an option for the duration 
of the R session. To reload the setting for every session, 
the option can be written for example into the .Rprofile.

To set the token for the session use function `setZentracloudOptions()`. Its
arguments are the `token`, as well as the corresponding `domain` and any of the 
other three options that can be set for cache management (details below). 

The domain has to be set to know which server the API should query. To find
out which options exist, see the help page of `setZentracloudOptions()`. If you
are unsure which of the domains your token is valid for, check the URL of your 
ZENTRACLOUD web interface. 

If the URL starts with `zentracloud.com` use `default`, 
if it starts with `aroya.zentracloud.com` use `aroya` and so on.


```{r token, eval = FALSE}
# set token as option
setZentracloudOptions(token = "<your_token>", domain = <"corresponding_domain">)

# set token in .Rprofile
# open profile
usethis::edit_r_profile()

# add token and domain
options("ZENTRACLOUD_TOKEN" = "<your_token>")
options("ZENTRACLOUD_DOMAIN" = "<corresponding_domain>")

```

### Cache

The other options that can be set for this package all concern the cache. Most
importantly, the option to set the cache directory, but also the allowed maximum
size and file age.

For these, default values are set upon loading the package, if the options were
not predefined otherwise. The defaults are:

-   ZENTRACLOUD_CACHE_MAX_SIZE: 500 kB
-   ZENTRACLOUD_CACHE_MAX_AGE: 7 days

For the directory the default changes depending on the operating system. For
instance, the default for Linux is:

-   ZENTRACLOUD_CACHE_DIR: `~/.cache/R/zentracloud`

The path is determined using this function:

```{r cache_path, eval = FALSE}
tools::R_user_dir("zentracloud", which = "cache")
```

Same as with the token, these options can also be changed using
`setZentracloudOptions()`, or set more permanently in the .Rprofile.

To see all currently set options use `getZentracloudOptions()`

```{r get_options, eval = FALSE}
getZentracloudOptions()
#> <zentracloudOptions>
#>   ZENTRACLOUD_CACHE_DIR     : /home/<user>/.cache/R/zentracloud
#>   ZENTRACLOUD_CACHE_MAX_AGE : 7
#>   ZENTRACLOUD_CACHE_MAX_SIZE: 500
#>   ZENTRACLOUD_DOMAIN        : zentracloud.com
#>   ZENTRACLOUD_TOKEN         : <-- hidden -->
```

If the cache directory is filled upon loading the package, some checks will run
automatically:

- If files that are older than the maximum allowed age are found, they are
  deleted. If this is the case, a message will show if and how many files were
  deleted.

- If afterwards the size of the cache directory still surpasses the maximum
  allowed size, a warning will be printed. Then it is up to the user to delete
  or move further files.

To manually clear the cache of files older than a certain age, use function
`clearCache()`. If argument `cache_dir` is not provided, 
the function will read the directory from the options. 
Any path can be set, as long as the cached
files follow the same structure that is automatically created when running
`getReadings()`, which will be described later. 
The argument `file_age` takes an integer, which must be observed in the 
notation. Again, if it is not provided,
it will use the default value as stored in the options.

```{r cache, eval = FALSE}
clearCache(file_age = 5L)
```

To load everything that is currently in your cache use `readCache()`. This will
return a nested list with the data sorted by device and sensor. If argument
`cache_dir` is not provided, the function will use the cache directory set in
the options.

```{r readCache, eval = FALSE}
cached_data = readCache()
```

### Data

To access the API and request the data use function `getReadings()`. Some notes
on the arguments of the function:

- Arguments that need to be provided are the device serial number, 
  as well as start and end datetime of the period of interest.

- Start and end time need to be provided in the format *"YYYY-MM-DD hh:mm:ss"*
  and have to be given in the *logger time zone*!

- If `force_api = TRUE`, the cache is be bypassed and the query goes straight 
  to the API. Still, the results are written to the cache.
  
- If `ignore_cache = TRUE`, the function internally uses a tmp directory as
  cache during processing. No data is written to the cache directory set 
  in the options. Be aware though, that no data are read from the cache either,
  so it is possible that the run time increases.

When running the function, it first checks whether the queried data (or
parts of it) are already in the cache. If yes, it loads it from there, if
not, it accesses the ZENTRA Cloud API and requests the data. 

The maximum download is 2000 entries at once. That means for periods longer 
than around 20 days (in case of a measurement interval of 15 minutes), 
the response is paginated, meaning that the data has to be downloaded 
in chunks. 
Between the different chunks a downtime of 60 seconds has to be observed. 
As such, requesting larger amounts of data takes a while.

The chunks are separately written to the cache to avoid memory shortages. The
data is written as *.parquet* files, which is a highly efficient format, both in
regards to the storage space it uses and to reading and writing speed. (More
info on the format can be found [in a short blog
post](https://www.r-bloggers.com/2021/09/understanding-the-parquet-file-format/)
and [on the arrow github page](https://github.com/apache/arrow)).\
Within the cache a directory is created for the device you queried, inside which
the data is written partitioned by sensor, year and month. For the example query
below, this thus creates a directory tree as such:

![](imgs/dir_tree.png)

This is the structure, that is needed for `clearCache()` to work reliably.

```{r readings, eval=FALSE}
setZentracloudOptions(
  token = Sys.getenv("ZENTRACLOUD_TOKEN")
  , domain = "default"
)

zentra_data = getReadings(
  device_sn = "06-01185"
  , start_time = "2022-06-01 00:00:00"
  , end_time = "2022-06-14 23:59:00"
  , force_api = FALSE
  , ignore_cache = FALSE
)
```

```{r show-readings}
str(zentra_data, max.level = 3, give.attr = FALSE)
```

The data that is returned in `zentra_data` is a list with entries for all the 
sensors connected to the queried device. 
Each entry in turn contains a `data.frame` with columns for the date & time 
specifications and for each variable measured, as well as the corresponding error flags and descriptions. 

Attached as attributes to the value columns are the corresponding unit and 
measurement precision. To access the attributes do the following:

```{r attributes}

# gives back the names of the list items, i.e. the sensor names
attributes(zentra_data)

# querying the attributes of single columns gives back the measurement unit 
# and precision
vars = c("saturation_extract_ec.value", "soil_temperature.value", "water_content.value")
vars_attr = sapply(
  vars
  , \(v) {attributes(zentra_data$`T12-0000248_port5`[[v]])}
  , simplify = FALSE
  )
str(vars_attr)
```

> **NOTE:**
> 
> Variable names in the returned `zentra_data` are taken from the API response.
> This should ensure compatibility with data downloaded via other methods 
> (e.g. as *csv* from the ZENTRACLOUD).
> We suggest to make syntactically valid names before continuing data analysis 
> (e.g. with `base::make_names()` or `janitor::clean_names()`).


### Settings

For internal use within `getReadings()` the device settings are queried. 
This is necessary to accurately deal with the timestamps. 
The function can be called on its own as well. 

As this function also accesses the API, a token and device serial number are 
necessary.

From the settings, information such as measurement intervals, location and 
time settings can be read. The returned object is a nested list.

```{r settings, eval=FALSE}
set = queryDeviceSettings(
  device_sn = "06-01185"
)

# this function allows a quick view into the data structure of the list:
listviewer::jsonedit(set)
```

![](imgs/README-settings-new.png){width="100%"}