Architecture: API Pipeline, XML Engine & Caching
Source:vignettes/architecture.Rmd
architecture.RmdOverview
This document describes three interconnected internal systems that
power every data-retrieval function in entsoeapi:
- API Request Pipeline — validates user input, builds the ENTSO-E query URL, sends the HTTP request, handles errors, and transparently paginates oversized queries.
- XML-to-Table Engine — converts raw XML (or lists of XML documents from paginated responses) into clean, typed, enriched R tibbles.
- Caching System — stores EIC code tables and lookup tables in memory for one hour to avoid redundant downloads during a session.
The three systems form a strict linear pipeline. A user-facing
function (e.g., load_actual_total()) calls the API
pipeline, which produces one or more XML documents. Those documents are
handed to the XML-to-table engine, which calls into the caching system
to enrich results with human-readable names and definitions before
returning a tibble to the user.
Architecture Diagram
User function
|
v
[Validate params / url_posixct_format]
|
v
[Build query string]
|
v
[api_req_safe] <-----------------------------+
| |
v |
[api_req --> GET request, 60s timeout] |
| |
| |
+-- HTTP 200 / zip --> [read_zipped_xml] ---+--> [extract_response]
| | |
+-- HTTP 200 / xml --> [resp_body_xml] -----+ v
| [xml_to_table,
| per document]
+-- error 503 -------> [req_retry, 3x / 10s] |
| | v
| v [extract_leaf/twig/branch]
| [cli_abort if all fail] |
| v
+-- error HTML ------> [cli_abort] [type conversions]
| |
+-- exceeds max -----> [calc_offset_urls, v
pagination] [my_snakecase
| column rename]
v |
[api_req] (loop)] v
[tidy_or_not:
A01 / A03 curve]
|
v
[add_type_names /
add_eic_names /
add_definitions]
|
+----------+----------+
| |
type/eic/def type/eic/def
cache hit? cache miss?
| |
v v
[return cached [download CSV/XML,
lookup table] cache result]
| |
+--------->+<---------+
|
v
[Whitelist columns,
sort rows]
|
v
Tibble returned to user
1. API Request Pipeline
1.1 User-facing functions
Location: R/en_load.R,
R/en_generation.R, R/en_transmission.R,
R/en_market.R, R/en_outages.R,
R/en_balancing.R
Every public function follows the same four-step structure:
Step 1 — Argument validation. Each parameter is
checked with checkmate before any network call is made.
Common checks:
- EIC codes: two-stage check —
checkmate::assert_string()first enforces exactly 16 characters matching^[A-Z0-9-]*$, thenassert_eic()verifies the 16th character is the correct weighted-modulo-37 checksum of the first 15 characters (see below). - Security token: non-empty string (sourced from
Sys.getenv("ENTSOE_PAT")) - Time ranges: difference between
period_endandperiod_startmust not exceed 365 days - Categorical parameters (e.g.,
business_type,process_type): validated against allowed values viacheckmate::assert_choice()
there_is_provider()
(R/utils.R, exported): A lightweight connectivity check
that sends a dummy request to the ENTSO-E API and returns
TRUE when the server responds with HTTP 401 (meaning the
endpoint is reachable but the token was rejected as expected). Returns
FALSE when no internet connection is available or the
server is unreachable. Its primary role is as an
@examplesIf guard in package documentation, ensuring
examples are only executed when the API is accessible.
EIC checksum validation — assert_eic() and
possible_eic_chars (R/utils.R): The
ENTSO-E EIC standard defines a check character at position 16. The named
integer vector possible_eic_chars maps the 37 allowed
characters (0–9, A–Z, -) to
integers 0–36. assert_eic() computes a weighted sum of the
first 15 characters (weights 16 down to 2), derives the expected check
character via (36 - (sum - 1) %% 37) + 1, and aborts with
an informative message if it does not match the actual 16th character.
An optional null_ok = TRUE argument allows
NULL to pass validation (used by functions that accept an
optional EIC parameter). This function is not exported; it is called
internally immediately after each
checkmate::assert_string() EIC check.
Step 2 — Timestamp conversion.
url_posixct_format() (R/utils.R) converts the
user-supplied period_start / period_end to the
format required by the API: YYYYMMDDHHMM in UTC. Accepts
POSIXct objects or character strings in nine common formats. Aborts with
a clear message if the input cannot be parsed. Warns when character
input is interpreted as UTC.
Step 3 — Query string assembly. Each function
hard-codes the ENTSO-E document type and process type codes for its
endpoint, then appends the user-supplied EIC code(s) and converted
timestamps. Optional parameters (e.g., business_type) are
appended conditionally with if (!is.null(...)).
Step 4 — Pipeline invocation.
en_cont_list <- api_req_safe(query_string, security_token)
extract_response(content = en_cont_list, tidy_output = tidy_output)1.2 api_req_safe()
Location: R/utils.R
api_req_safe <- safely(api_req)A one-liner that wraps api_req() with the package-local
safely() helper (a lightweight tryCatch
wrapper). All R-level exceptions are caught and returned as
list(result = NULL, error = <condition>) rather than
halting execution. This standardised return shape is what
extract_response() expects.
1.3 api_req()
Location: R/utils.R
The core HTTP function. Steps:
-
URL construction. Assembles the full URL from package-level constants defined in
R/constants.R:- Scheme:
.api_scheme("https://") - Domain:
.api_domain("web-api.tp.entsoe.eu/") - Path:
.api_name("api?") - Appends
query_stringand&securityToken={token} - Logs the URL to the console with the token replaced by
<...>to prevent credential leakage.
- Scheme:
-
Request configuration. Uses
httr2::request()with:- Method: GET
- Verbose: response headers only
(
req_verbose(header_req=FALSE, header_resp=TRUE)) - Timeout:
.req_timeoutseconds (60, defined inR/constants.R) - Retry: up to 3 attempts with a 10-second backoff, triggered only by HTTP 503 (Service Unavailable) responses. Other HTTP errors are not retried. This guards against transient server-side overload on the ENTSO-E platform.
Execution. Sent via
safely(httr2::req_perform)(the same package-local wrapper) so network errors are captured, not thrown.-
HTTP 200 — response routing.
-
application/ziporapplication/octet-stream: body saved to a temp file, then decompressed byread_zipped_xml(). -
text/xmlorapplication/xml: parsed directly withhttr2::resp_body_xml(encoding = "UTF-8"). - Unknown content-type: aborts with an informative message.
-
HTTP errors — error handling. See section 1.4.
Returns either a single
xml_document, a list ofxml_documentobjects (paginated or zipped responses), or callscli::cli_abort().
1.4 Error handling
| Error type | Condition | Action |
|---|---|---|
| Network / R exception |
req_perform_safe() returns $error
|
Propagated via api_req_safe()
|
| 503 Service Unavailable | HTTP status 503 | Retried up to 3 times (10 s backoff) via req_retry();
cli_abort() if all attempts fail |
| HTML error page | Response body is HTML | Extract status + body, cli_abort()
|
| XML error — code 999, exceeds max | Body is XML, reason code 999, message contains “exceeds the allowed maximum” | Trigger pagination (see 1.5) |
| XML error — code 999, forbidden | Same as above but query matches a forbidden pattern |
cli_abort() with reason text |
| XML error — other codes | Body is XML, other reason codes |
cli_abort() with reason text |
| JSON error | Body is JSON | Extract uuAppErrorMap.URI_FORMAT_ERROR,
cli_abort()
|
1.5 Automatic pagination
Location: calc_offset_urls() in
R/utils.R
When the ENTSO-E API returns an XML error with reason code 999 and a
message indicating the result set exceeds the allowed maximum,
api_req() automatically splits the request into smaller
chunks:
- The error message is parsed with regular expression to extract both the requested and the allowed document counts.
- The number of offset requests needed is calculated:
ceiling(docs_requested / docs_allowed). - Each offset query is built by stripping any existing
&offset=from the original query string and appending&offset=0,&offset=N,&offset=2N, … -
api_req()calls itself recursively for each offset query string. - All responses are collected and returned as a list, which the XML-to-table engine processes element by element.
Pagination is suppressed (and the request is aborted instead) for endpoints known not to support offsets, identified by six hard-coded regular expression patterns covering document types A63, A65, B09, A91, A92, and A94 with specific business or storage types.
1.6 read_zipped_xml()
Location: R/utils.R
Called when the API returns a zip archive. Decompresses the temp file
with safely(utils::unzip) (using the package-local
wrapper), then reads each extracted XML file with
xml2::read_xml(). Returns a list of
xml_document objects — the same shape as a paginated
response, so extract_response() handles both
identically.
2. XML-to-Table Engine
2.1 extract_response()
Location: R/utils.R
Entry point called by every user-facing function. Accepts the
list(result, error) from api_req_safe().
- If
$erroris notNULL: re-throws the error withcli::cli_abort(). - If
$resultis a list (paginated or zipped): iterates withlapply(), callingxml_to_table()on each element, showing a progress bar, then combines all results withdplyr::bind_rows()and converts to a tibble. - If
$resultis a singlexml_document: callsxml_to_table()directly. - Returns a tibble, or
NULLif the API returned no data.
2.2 xml_to_table()
Location: R/utils.R
Core orchestrator. Receives a single xml_document and
returns a tibble by running a fixed transformation sequence:
- XML parsing → raw wide data frame
- Date/time column merging
- Type conversions (DateTime, numeric)
- Column name normalization
- Time series restructuring
- Metadata enrichment (type names, EIC names, definitions)
- Column whitelist filtering
- Row ordering
2.3 XML parsing
Location: extract_leaf_twig_branch(),
extract_nodesets() in R/utils.R
The ENTSO-E XML schema uses three nesting levels, which the engine labels:
| Level | Definition | Example element |
|---|---|---|
| Leaf | No children | <quantity>100</quantity> |
| Twig | Has direct children only | <Period><resolution>…</resolution></Period> |
| Branch | Has grandchild nodes | <TimeSeries><Period><Point>…</Point></Period></TimeSeries> |
extract_nodesets() converts XML nodesets to
data.table objects using xml2::as_list(),
constructing dotted column names from the element hierarchy (e.g.,
TimeSeries.mRID). NULL values become
NA_character_.
2.4 Column name normalization — my_snakecase()
Location: R/utils.R
Two-pass renaming:
Pass 1 — domain-specific substitutions (applied before snakecase conversion):
| Pattern | Replacement |
|---|---|
mRID |
mrid |
TimeSeries |
ts |
^process |
(removed) |
unavailability_Time_Period |
unavailability |
| XML namespace / attribute artifacts | (removed) |
Pass 2 — standard snakecase via
snakecase::to_snake_case(), followed by cleanup passes that
collapse redundant fragments (e.g., psr_type_psr_type →
psr_type).
2.5 Time series handling — tidy_or_not()
Location: R/utils.R
The ENTSO-E API encodes time series as a start timestamp plus a
resolution and a sequence of positional data points — there are no
per-point timestamps in the XML. tidy_or_not() reconstructs
absolute timestamps and offers two output shapes:
Curve type A01 (regular intervals): Points are
evenly spaced; timestamps are calculated as
time_interval_start + (position - 1) × resolution.
Curve type A03 (irregular / broken intervals): Some positions may be missing. The engine builds a complete positional frame and performs a full join to fill gaps, carrying forward the last observed value.
Supported resolutions: PT4S, PT1M, PT15M, PT30M, PT60M, P1D, P7D, P1M, P1Y.
tidy_output = TRUE (default): One row
per data point. The ts_point_dt_start column contains the
reconstructed timestamp. Internal bookkeeping columns
(ts_point_position, ts_resolution_*,
by) are removed.
tidy_output = FALSE: One row per time
period. All data points are nested into a ts_point
list-column. The reconstructed ts_point_dt_start column is
removed.
2.6 Metadata enrichment
Three functions run in sequence after time series restructuring, each joining additional columns:
add_type_names()
(R/utils.R) — joins human-readable definitions from
built-in package data tables (e.g., business_types,
asset_types, process_types) using
lookup_merge(). Produces _def suffix columns
alongside each code column (e.g., ts_business_type →
ts_business_type_def).
add_eic_names()
(R/utils.R) — joins EIC code long names from
area_eic() and resource_object_eic() (both
cached; see section 3) using lookup_merge(). Produces
_name suffix columns alongside each _mrid
column (e.g., ts_in_domain_mrid →
ts_in_domain_name).
add_definitions()
(R/utils.R) — joins further definitions: auction
categories, flow directions, reason codes (with multi-code merging via
" - " separator), and object aggregation types.
2.7 Column whitelist and row ordering
After enrichment, xml_to_table() applies a hard-coded
whitelist of ~140 allowed column names. Any column not on the list is
silently dropped. This prevents internal XML artefacts from leaking into
user-visible output and keeps the API stable across ENTSO-E schema
changes.
Rows are then sorted by: created_date_time,
ts_mrid, ts_business_type,
ts_mkt_psr_type, ts_time_interval_start,
ts_point_dt_start (when present).
3. Caching System
3.1 Two cache objects
The package maintains two independent in-memory caches, both with a 1-hour maximum age:
| Object | Initialised in | Caches |
|---|---|---|
m |
R/utils.R (top of file) |
EIC name lookup tables used during XML enrichment |
mh |
R/en_helpers.R (top of file) |
Full EIC code tibbles downloaded by *_eic()
functions |
Both are cachem::cache_mem(max_age = .max_age) objects,
where .max_age is the package-level constant
3600 (defined in R/constants.R). The max age
is not user-configurable.
3.2 What gets cached
Via mh (one key per EIC function):
| Cache key | Source | Function |
|---|---|---|
party_eic_df_key |
CSV download | party_eic() |
area_eic_df_key |
CSV download | area_eic() |
accounting_point_eic_df_key |
CSV download | accounting_point_eic() |
tie_line_eic_df_key |
CSV download | tie_line_eic() |
location_eic_df_key |
CSV download | location_eic() |
resource_object_eic_df_key |
CSV download | resource_object_eic() |
substation_eic_df_key |
CSV download | substation_eic() |
all_allocated_eic_df_key |
XML download + parse | all_allocated_eic() |
Via m (used inside the XML-to-table
engine):
| Cache key | Content |
|---|---|
area_eic_name_key |
Subset of area_eic(): EicCode + EicLongName
columns |
resource_object_eic_name_key |
Subset of resource_object_eic(): EicCode + EicLongName
columns |
Not cached: API responses. Every call to
load_actual_total(), gen_per_prod_type(), etc.
makes a fresh HTTP request. Only slow, stable reference data (EIC
registries, type definitions) is cached.
3.3 Cache hit/miss pattern
All EIC functions use the same template:
cache_key <- "unique_key_name"
if (mh$exists(key = cache_key)) {
res_df <- mh$get(key = cache_key, missing = fallback_expr)
cli_alert_info("pulling {f} file from cache")
} else {
cli_alert_info("downloading {f} file ...")
res_df <- download_and_transform()
mh$set(key = cache_key, value = res_df)
}The cli::cli_alert_info() calls make the cache source
visible to the user in the console.
3.4 Double-caching during EIC name enrichment
add_eic_names() calls
get_resource_object_eic() (cache m), and
fetches area EIC names inline (cache m, falling back to
area_eic() on cache miss, which uses cache
mh). This means the same underlying data may be stored at
two levels simultaneously:
-
mhholds the full EIC tibble (all columns). -
mholds a narrowed subset (EicCode + EicLongName only) ready for joining.
After both caches are warm, subsequent API calls within the same session perform zero downloads for EIC enrichment.
4. End-to-End Data Flow
The following traces a call to load_actual_total():
load_actual_total(eic, period_start, period_end, tidy_output = TRUE)
│
├─ checkmate: assert EIC format, token presence, ≤365-day range
├─ url_posixct_format(period_start / period_end) → "YYYYMMDDHHMM" UTC
├─ Build query string: "documentType=A65&processType=A16&outBiddingZone_Domain=…"
│
└─ api_req_safe(query_string, security_token)
└─ api_req()
├─ Build URL: https://web-api.tp.entsoe.eu/api?{query}&securityToken=<...>
├─ GET, 60s timeout, retry 3× on 503 (10s backoff), log masked URL
├─ HTTP 200 / text/xml → resp_body_xml() [or zip → read_zipped_xml()]
└─ HTTP error → calc_offset_urls() + recurse [or cli_abort()]
│
└─ extract_response(list(result, error), tidy_output)
├─ error? → cli_abort()
├─ list of XML? → imap + progress bar
│
└─ xml_to_table(xml_doc, tidy_output)
├─ extract_leaf_twig_branch() → raw wide data frame
├─ Merge date+time columns
├─ Convert DateTime → POSIXct(UTC), numeric columns → numeric
├─ my_snakecase() → normalised column names
├─ tidy_or_not() → one row per data point (A01/A03 handled)
├─ add_type_names() → join built-in type tables (no network)
├─ add_eic_names() → get_resource_object_eic() [cache m]
├─ add_definitions() → join built-in definition tables (no network)
├─ Filter to whitelist columns
└─ Sort rows
│
└─ dplyr::bind_rows() + as_tbl() if multiple XML docs
│
└─ tibble returned to user (or NULL)
5. Configuration Reference
| Setting | Value | Location |
|---|---|---|
| API base URL | https://web-api.tp.entsoe.eu/api? |
.api_scheme, .api_domain,
.api_name in R/constants.R
|
| HTTP method | GET |
api_req() in R/utils.R
|
| HTTP timeout | 60 seconds (.req_timeout) |
R/constants.R, applied in api_req()
|
| Retry on 503 | Up to 3 attempts, 10-second backoff |
req_retry() in api_req()
|
| Security token env var | ENTSOE_PAT |
All user-facing functions |
| Verbose logging | Response headers only |
api_req() in R/utils.R
|
| Cache max age | 3600 seconds / 1 hour (.max_age) |
R/constants.R, applied in R/utils.R and
R/en_helpers.R
|
| Pagination trigger phrase | "exceeds the allowed maximum" |
api_req() in R/utils.R
|
| Forbidden offset doc types | A63+A46/A85, A65+A85, B09+archive, A91, A92, A94+A02 |
api_req() in R/utils.R
|
| XML encoding | UTF-8 |
api_req() and xml_to_table()
|
| ZIP content types |
application/zip,
application/octet-stream
|
api_req() in R/utils.R
|
6. Code References
| Component | File | Key Symbols |
|---|---|---|
| Package constants | R/constants.R |
.api_scheme, .api_domain,
.api_name, .req_timeout,
.max_age
|
| EIC checksum validation | R/utils.R |
assert_eic(), possible_eic_chars
|
| Provider check | R/utils.R |
there_is_provider() |
| Cache (general) | R/utils.R |
m |
| Cache (EIC helpers) | R/en_helpers.R |
mh |
| HTTP request | R/utils.R |
api_req(), api_req_safe()
|
| Timestamp formatting | R/utils.R |
url_posixct_format() |
| Zip decompression | R/utils.R |
read_zipped_xml() |
| Pagination | R/utils.R |
calc_offset_urls() |
| XML engine entry | R/utils.R |
extract_response() |
| XML engine core | R/utils.R |
xml_to_table() |
| XML parsing | R/utils.R |
extract_leaf_twig_branch(),
extract_nodesets()
|
| Column naming | R/utils.R |
my_snakecase() |
| Time series | R/utils.R |
tidy_or_not() |
| Type enrichment | R/utils.R |
add_type_names(), lookup_merge()
|
| EIC enrichment | R/utils.R |
add_eic_names(), lookup_merge(),
get_resource_object_eic()
|
| Definition enrichment | R/utils.R |
add_definitions() |
| EIC download functions | R/en_helpers.R |
party_eic(), area_eic(),
resource_object_eic(), all_allocated_eic(), et
al. |
| Built-in type tables | R/data.R |
asset_types, business_types,
process_types, message_types, et al. |
7. Glossary
| Term | Definition |
|---|---|
| EIC | Energy Identification Code — a 16-character alphanumeric code
(digits, uppercase letters, -) identifying market
participants, bidding zones, transmission lines, etc. on the ENTSO-E
platform; the 16th character is a weighted-modulo-37 checksum of the
first 15 |
| Document type | A 3-character ENTSO-E code (e.g., A65) identifying the category of data being requested |
| Process type | A 3-character ENTSO-E code (e.g., A16) qualifying the sub-type of a document type |
| Curve type A01 | Regular time series: data points are evenly spaced at the given resolution |
| Curve type A03 | Broken / irregular time series: some positional slots may be absent; gaps are filled during tidy conversion |
| Tidy output | One row per data point, with an explicit
ts_point_dt_start timestamp column
(tidy_output = TRUE) |
| Nested output | One row per time period, with all data points collected into a
ts_point list-column
(tidy_output = FALSE) |
| Offset pagination | Mechanism by which api_req() splits an oversized query
into multiple requests using &offset=N parameters,
transparent to the caller |
ENTSOE_PAT |
R environment variable holding the user’s ENTSO-E security token |
there_is_provider() |
Exported helper that returns TRUE when the ENTSO-E API
endpoint is reachable; used as an @examplesIf guard
throughout the package |
cachem |
R package providing in-memory and disk caches with automatic expiry,
used by both m and mh cache objects |