Abstract
This report presents the outcome of the Proof-of-Concept (PoC) experiments in the EU RODEO project E-SOH work package (WP3).
Revision history
Version | Date | Comment | Responsible |
---|---|---|---|
1.0 |
2023-04-28 |
First version |
Morten W. Hansen |
1.1 |
2023-10-13 |
Updates from build work |
Morten W. Hansen |
1. Introduction
In this report we present the outcome of PoC (Proof-of-Concept) experiments in the EU RODEO project (E-SOH work package, WP3) in 2023.
In addition, we summarize the results of some relevant earlier technical experiments and productional architectures set up and operated by some of the participating EUMETNET member institutes. Thus this report is divided in two main sections: PoC experiments and Previous experiences.
Github repository: https://github.com/EURODEO/e-soh-poc-report
2. PoC experiments during the RODEO/E-SOH project
In this chapter we document some experiments/studies done during the design phase of the E-SOH project (2023), own subsections will follow for each of them.
2.1. ECMWF EWC-Infrastructure
The ECMWF EWC Infrastructure is a cloud-based infrastructure which makes it easier to work with big data. For now it’s in a pilot phase to test various use cases.
2.1.1. Running things on EWC
We have tested a couple of VMs and everything worked well so far. Also logging in to the VMs remotely is working with ssh-keys (only RSA and OpenSSH). Mainly tested OS was CentOS 7.9 and later Rocky 8.6-3. The mentioned OS are already old and only have a limited time for security patches. So future upgrades of the OS may be required, also we have to see which versions will be available in the future. For now there is no final confirmation for the OS images that will be provided in the new EWC operational infrastructure, but they expect to provide Rocky Linux 8.x and latest Ubuntu LTS. Also we have tested if the requirements for WIS 2.0 are met.
2.1.1.1. Requirements for WIS 2.0
-
Python3 Version 3.8 or higher
-
Docker Engine Version 20.10.14 or higher
-
Docker Compose Version 1.29.2
2.1.1.2. CentOS 7.9
-
Python 3.10.8
-
Docker 23.0.1 via https://docs.docker.com/engine/install/centos/
-
Docker-Compose v2.16.0 via https://docs.docker.com/compose/install/linux/#install-using-the-repository
All requirements for WIS 2.0 should be met in CentOS 7.9.
2.1.1.3. Rocky 8.6-3
-
Python 3.9.13 via https://kifarunix.com/install-python-on-rocky-linux-8/
-
Docker 23.0.1 via https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-rocky-linux-8#step-1-installing-docker
-
Docker-Compose v2.16.0
All requirements for WIS 2.0 should also be met in Rocky 8.6-3
2.2. NetCDF versus Postgres
To compare the performance of different backends for storing recent observations, a special Python program has been written.
The storage is a rolling buffer for keeping the latest (typically 24H of) observations for a set of time series. Currently, three backends have been implemented/compared: two variants based only on a (Postgres-based) relational database and one variant that keeps the observations in netCDF files.
An observation is assumed to consist of two components:
-
a time represented as a UNIX timestamp (seconds since 1970-01-01T00:00:00Z)
-
a value represented as a floating point number
A time series is assumed to be identified by the combination of a station and a parameter (like air temperature).
2.2.1. Storage backends
The following backends are currently implemented:
Name |
Description |
TimescaleDBSBE |
Keeps all data in a TimescaleDB database extended with PostGIS. |
PostGISSBE |
Keeps all data in a Postgres database extended with PostGIS. |
NetCDFSBE_TSMDataInPostGIS |
Keeps all data in netCDF files on the local file system, one file per time series. Per time series metadata (i.e. not actual observations) will also be kept in a Postgres database extended with PostGIS to speed up searching for target files to retrieve observations from. |
Note: the program is designed to make it easy to add more backends.
2.2.2. Use cases
The following use cases are currently implemented:
Name |
Description |
Fill |
Fill storage with observations. |
AddNew |
Add new observations to the storage while deleting any old observations and overwriting any existing ones. |
GetAll |
Retrieve all observations. |
Note: the program is designed to make it easy to add more use cases.
2.2.3. Testing
Testing is done essentially according to this algorithm:
for uc in use cases:
for sbe in storage backends:
apply uc to sbe and aggregate stats
The synthetic time series generated by the program "sample" observations from an underlying sine wave.
2.2.3.1. Test environment
The tests have been run in the following environment:
-
HW: HP ZBook, Intel Core i7-6820HQ CPU @ 2.70GHz × 8, 16GB RAM, 250GB disk
-
OS: Ubuntu 18.04 Bionic
-
Python version: 3.9
The TimescaleDBSBE backend uses a separate TimescaleDB server running in a local docker container.
The PostGISSBE and NetCDFSBE_TSMDataInPostGIS backends both use a PostGIS server that runs
in a local docker container (they use separate databases, though: esoh_postgis
for PostGISSBE
and esoh_netcdf
for NetCDFSBE_TSMDataInPostGIS).
2.2.3.2. Test configuration
The currently relevant configuration settings are these:
Name |
Description |
|
Maximum age (in seconds) of an observation in the storage. The effectively limits the capacity of the storage along the time dimension. (Note: what is considered the 'current time' is manipulated by the program depending on the particular test!) |
|
Number of stations to simulate. |
|
Number of parameters to simulate. (The number of time series will thus be |
|
Time series resolution in seconds. |
|
Applicable to the AddNew use case, this is the number of seconds worth of new observations to add/merge into the storage. |
2.2.3.3. Metadata
In addition to the station location, each time series is accompanied with a small amount of dummy metadata (such as quality of sensor) which currently is not used for anything except taking up space in the storage. Per observation metadata (such as quality of observation) is currently not used at all.
2.2.3.4. Test results
Warning: The below results were run on 2023-04-26 and based on that version of the test program. There are two aspects that should be kept in mind when reading the results:
-
Only a small number of test runs for each combination were run. The results should thus be considered indicative only.
-
Although we have made some efforts to optimize the performance of all backends, we assume there might be further optimization potential for some of them, thus improving the fairness of the comparison.
To conclude the testing conducted in this PoC so far, write performance seems to be better when using files directly, while databases seem to outperform direct file access in data query operations. Comparison will give even more interesting results once Elastic is taken into comparison, too. A time search demo is also considered to be added. Geosearches can also be benchmarked, but it’s not relevant before Elastic is added into this testbench.
2.3. TimescaleDB research/experiments
TimescaleDB is an open-source database designed to make SQL scalable for time-series data. It extends PostgreSQL with features such as native time-series partitioning, compression, and continuous aggregates to improve the performance and scalability of time-series data management.
Hypertables are PostgreSQL tables with special features that improve performance and user experience for time-series data. Hypertables are divided in chunks that hold data only from a specific time range. When a query for a particular time range is executed, the planner can ignore chunks that have data outside of that time range.
TimescaleDB’s native compression works by converting uncompressed rows of data into compressed columns. When new data is added to database, it is in the form of uncompressed rows. TimescaleDB uses a built-in job scheduler to convert this data to the form of compressed columns. Compression can be enabled on individual hypertables by declaring which columns to segment by. Once compression is enabled, the data can then be compressed in by using an automatic policy or manually compressing chunks. By reducing the amount of data that needs to be read, compression can lead to increases in performance. However, benefits of compression depends on data and queries that are run against it.
One useful TimescleDB feature is continuous aggregates which can be used to pre-aggregate data into useful summaries. Materialized views are used to continuously and incrementally refresh a query in the background.
TimescaleDB has built-in support for horizontal scaling through its multi-node feature. One of the databases exists on an access node and stores metadata about the other databases. The other databases are located on data nodes and hold the actual data. Multi-node setup improves write scalability and query performance, but it can be more complex to setup and manage.
TimescaleDB also support master-replica setup as it is built on top of PostgreSQL. This setup brings high availability and read scalability but write scalability might be a bottleneck.
How to test:
-
Add Timescale extension to current PostGIS database
-
Add new TimescaleDB backend to https://github.com/EURODEO/e-soh-datastore-poc/tree/main/tstester (main thing is to convert regular tables to hypertables)
-
Experiment with different setups (changing chunk_time_interval, maybe test how compression affects ingestion performance and so on).
2.4. WIS 2.0 in a box (wis2box) evaluation
From the wis2box
site:
WIS2 in a box (wis2box) is a Free and Open Source (FOSS) Reference Implementation of a WMO WIS2 node. The project provides a plug and play toolset to ingest, process, and publish weather/climate/water data using standards-based approaches in alignment with the WIS2 principles. wis2box also provides access to all data in the WIS2 network. wis2box is designed to have a low barrier to entry for data providers, providing enabling infrastructure and services for data discovery, access, and visualization.
In E-SOH, we looked at wis2box
to evaluate if it could be used as a basis for E-SOH development.
2.4.1. Design
An interesting design choice in wis2box
is to use Elasticsearch as a storage backend for the API.
By storing a GeoJSON documents for each of the observation parameters directly into Elastic,
it can be used to do the selection and filtering required for implementing the OGC Feature API
(which returns GeoJSON collection). This works because the API reply is just a filtered list of
all the GeoJSON objects.
Elastic also provides a straightforward approach to implement the "Replay" functionality,
by storing all MQTT notifications in another Elastic index.
For an API that returns CoverageJSON (i.e. the proposed EDR API for E-SOH), using Elastic for storage seems less straightforward. This is because in CoverageJSON the different measurements for each parameter are returned as JSON arrays. This leads to a very compact result, but it means the response for different measurements can not just be concatenated together.
wis2box
uses the pygeoapi
library to provide the OGC Feature API.
The pygeoapi
library also supports OGC EDR,
but the support for different
backends is currently limited to NetCDF and Zarr.
2.4.2. Performance
The following tests were done on a MacBook Pro with Apple M1 Pro processor. Four cores (out of a total of 10) and 16 GB of memory were assigned to Docker, unless otherwise noted.
2.4.2.1. Loading data
We load one month of KNMI data for about 60 stations (detailed setup instruction can be found here). This consists of 672 (=28x24) BUFR files, one for each observation time. Each BUFR files contains multiple messages for the different stations.
The data load takes more than an hour to complete. 240 MB of space is used by Elastic after loading 11 MB of BUFR files.
If the wis2box-management container (which does the bulk of the work during data load) is killed during the load, the loading does not continue when the container is restarted.
2.4.2.2. API load test
We set up a load test of the Feature API of the wis2box using locust.
Each request ask for a month of data for a single parameter (one of four)
for a single station (one of seven). The full
Feature API response is by default
very verbose (740 KB for a month of data), so we also test a light
response
with most metadata left out (167 KB for a month of data).
gz
compression is used for all responses.
Each "user" does a new request as soon as the previous request finished. Multiple users do parallel requests.
We get the following results:
Cores |
Users |
Request Type |
req/s |
4 |
1 |
full |
25 |
4 |
20 |
full |
90 |
4 |
1 |
light |
30 |
4 |
20 |
light |
115 |
8 |
1 |
full |
24 |
8 |
20 |
full |
140 |
8 |
1 |
light |
30 |
8 |
20 |
light |
170 |
2.4.3. Comments/Conclusions
Based on our very limited testing of wis2box, we have the following 'conclusions':
-
It is designed in a way to leverage existing technology like Elastic.
-
The datastore supporting the API is optimised for OGC Feature API/GeoJSON.
-
Ingestion of 40K station observations (each with multiple parameters) take one hour on my test setup. The relative slowness of this is probably caused by all processing being done serially in Python.
-
Ingestion is not fault-tolerant with default settings (killing and restarting
wis2box-management
container leads to ingestion data loss;WIS2BOX_BROKER_QUEUE_MAX
needs to be increased to avoid data loss). -
The internal elastic storage is verbose, due to the Feature JSON format and storing all the messages (needed for Replay funcitonality).
-
Query performance is good for a Python API (due to offloading of heavy lifting to Elastic?).
2.5. Metadata output
By using the py-mmd-tools package developed at MET Norway, one can create a MET Norway Metadata Format (MMD) file from a properly documented dataset (here NetCDF4).
By using the xslt-transformations defined in the MMD-repo, one can then transform to different metadata standards.
See /metadata-output folder for simple python script to go through all the xslt-translations from the xslt-folder (you need to have lxml installed, and xslt-folder copied where you want to run the script).
Deciding on one internal metadata format will then facilitate increased interoperability when translations between the different metadata-formats are in place.
2.6. MQTT broker evaluation
Technologies investigated:
-
Mosquitto
-
RabbitMQ
-
nats jetstream + mqtt enabled
-
HiveMQ
-
EMQX
-
VerneMQ
Requirements:
-
req 1
-
req 2
-
…
Mosquitto, HiveMQ and EMQX was disregarded due to lack of .. or .. being behind a paywall.
2.6.1. RabbitMQ
-
TLS support: Yes
-
Message persistence support: Yes
-
Authentication: Yes
-
Clustering support: Yes
-
MQTT protocol support: v3.1.1 (as a plugin). RabbitMQ 3.13, the next release series, is expected to be released end of 2023 and will support MQTT protocol version 5.0.
-
Free/Open source: Yes
-
Pros:
-
Wide adoption
-
Supported by VMware
-
Supports all needed features
-
Also has support for other message protocols
-
-
Cons:
-
Does not support mqtt v5 as of latest release
-
2.6.2. VerneMQ
-
TLS support: Yes
-
Message persistence support: No
-
Authentication: Yes
-
Clustering support: Yes
-
MQTT protocol support: v3.1.1 and v5
-
Free/Open source: Yes
-
Pros:
-
Supports all needed features
-
Only supports mqtt, less complex
-
-
Cons:
-
No possiblity for message persistence, but this is ok according to the design decisions in E-SOH. The API endpoints will provide possibility to retrieve data 24 hours back in time.
-
2.6.3. Need resolving before descion
-
Do we need assurance of mqtt v5 before deciding? RabbitMQ seems to be launching v5 before the end of the year.
-
Is MET Norway going to use the broker, and does MET Norway need persistence?
-
Clients must have a way to retrieve old messages if a solution without persistence is selected, but the EDR API should be able to handle this.
3. Previous experiences (prior to RODEO/E-SOH project)
3.1. Experiences on PostGIS-based database for 3rdparty observations
FMI has been using PostgreSQL and PostGIS as a database solution for many years. For 3rdparty observations, FMI has been using during recent years PostGIS in AWS (RDS, Relational Database Service, i.e. relational database as a service). The database schema is planned to be flexible enough to contain different kind of datasets and adding new dataset is planned to be easy. Here we explain the concept with a simplified diagram.
The database is based on narrow data table paradigm, and the data table is shown in the figure as "3rdparty_obsdata" (Not all fields are shown, but just the essential ones to explain the concept. There are e.g. quality control related fields in the table that are not shown here).
Every measured parameter per timestamp (and per station) is inserted as a new row in the database. There are flexibility to ingest either numerical values or textual values (or even both) depending on each parameter’s need. But before ingesting data, there should be defined a producer for that dataset. Producer has its own parameter namespace and station namespace. Aim is that the data producer can ingest its own station identifiers and the database can handle own station id namespace for each data producer. Data ingestion logic (written in pg/SQL) handles creation of new stations and parameters automatically based on the data input.
Producers can be separated in data server API, i.e. some api-key (user) may have access to certain provider(s) and another api-key to some others. This come in handy e.g. when there are different kind of data licenses for the datasets.
This database (integrated with) and used via Smartmet Server API, typically using OGC WFS or Smartmet Server proprietary "timeseries plugin" queries. Integration could benefit from the PostGIS database’s ability to execute spatial queries, we didn’t need to implement that ourselves on the data server side.
3.1.1. Pros and cons
Pros:
-
flexibility to ingest different kinds of datasets with their own station codes and measurement parameters
-
no need to manually add parameters and/or stations (once producer is created)
-
relational database is easy expand with growing metadata needs
-
well-thought partitioning have kept queries efficient enough
-
PostGIS provides us the ability to execute geospatial data queries easily
Cons:
-
it may not directly scale to big amounts of data (lots of rows)
-
it’s relatively expensive to store (and operate) big amount of data in AWS RDS database.
-
every producer has its own measurand namespace, there is the question how to have unified parameter catalogue. Unresolved.
3.2. Experiences on using Geomesa and S3 storage for 3rdparty observations (FMI, Finland)
Development of a new 3rdparty handling system based on S3 file storage started from motivation to cut down the operation costs of the PostGIS based solution (see the previous section). There were also needs to enhance the earlier TITAN deployment (that used AWS Lambda). Development project’s main target was to find a data handling solution on AWS (Amazon Web Services) that doesn’t have that steep price curve when the amount of data is scaled up by the factor of 10 or 100 or even 1000.
Development of the system took place in the second half of year 2021 as a project supported by external consultants who did the majority of the technical planning, evaluation and implementation. Scope of the implementation was data ingestion from 3rdparty data provider (fetch data from there) into data storage, then running TITAN QC for that data and serving it out through OGC EDR API, which was integral part of the reference/PoC impelementation. The project resulted in a MIT-licensed "proof-of-concept" stage system called TIUHA — code shared on Github, see https://github.com/fmidev/tiuha. After the development phase FMI has worked on the QC part and transferred the system to run on Openshift environment (fork to be merged).
Here we represent the fundamental concept of the system and list some pros and cons.
Initially, in the beginning of the project, the planning hypothesis was that NoSQL could cut operation costs. Thus the initial survey concentrated on availability of NoSQL databases, especially Cassandra, as a managed service on AWS. After this study the conclusion was that Cassandra-compatible NoSQL database AWS Keyspaces would actually bring less operation cost compared to the reference (PostGIS), but taking into account proper indexing (to serve geospatial queries etc) it wasn’t a gamechanger to cope with the exponential data growth. AWS S3 storage is way cheaper than e.g. database storage in managed services so we changed quite rapidly the course in the project and started to examine S3. It led us to seek possibilities to operate on georeferenced files, in this case Geomesa (http://www.geomesa.org). Geomesa is also able to operate on top of Accumulo, HBase, Google Bigtable and Cassandra database, but due to the operation cost optimization reason we looked at simple file storage.
From that background the system was based on S3 storage and container-based computation in AWS. The system overview is presented in the following figure.
The system is based on continously-running container that is responsible for API and scheduling and QC task container that a container is spinned up by the continously-running container.
GeoMesa is used to operate on .parquet files (see https://parquet.apache.org) in the S3 bucket, both writing them and reading them (by the API). All Geomesa-dependent handling, including the integrated OGC EDR API was written in Kotlin.
Pros
-
reduced operation cost by using S3 object storage
-
relatively simple architecture
-
optimized for this single purpose, also for certain queries — serving them well
Cons
-
reduction in operation cost didn’t come without price: dependency to Geomesa library (Java) [and its dependencies] and need to prepare files for certain queries (consumes more disk space compared to reference PostGIS implementation)
-
PostGIS-based reference setup was more flexible for different usage scenarios
As the implementation is in early phase (more like PoC than production system), we’ll don’t dive here into deeper anlysis what could be enhanced technically (e.g. splitting API and scheduler into separate containers).
3.3. Experiences on OGC EDR API technology at FMI (Finland)
3.3.1. Introduction
OGC API EDR at FMI was first implemented in the context of the GeoE3 project as a plugin to FMI’s existing SmartMet Server data server. There it has already been utilized for a number of use cases including Intelligent Traffic and Energy Efficiency for buildings.
In addition to GeoE3 project the main drivers to use EDR is the Geoweb weather visualization project developed in collaboration with Met Norway and KNMI and provide a more user friendly open data interface replacing the legacy OGC WFS. Geoweb will act as an EDR client to visualize timeseries data for users. FMI current operational OGC WFS interface provides complex XML based output whereas OGC API EDR by default recommends CoverageJSON which is more suitable for web development purposes. It also converts more nicely to HTML via the OpenAPI specification which enables data to be indexed by search engines more easily.
As a side note, OGC API Features was first considered as a replacement for the legacy OGC WFS but EDR has better and more versatile support for spatio-temporal data so focus was switched to EDR instead. Most probably OGC API Features will also be implemented at some stage as it has currently better client support compared to OGC API EDR.
3.3.2. Supported query types
FMI implementation currently supports the following EDR query types
-
Position
-
Area
-
Trajectory
-
Corridor
-
Radius
-
Location
-
Instance
Of these, Location and Trajectory queries are already used and proven efficient in the GeoE3 project. The Trajectory query is used to get temperatures along a road and the Location query for outside temperature for a location (related to building heating).
3.3.3. Implementation experiences and issues
The OGC API EDR specification really only specifies how the API should look like. Internal implementation details are in the scope of the implementation. FMI built EDR on top of it’s existing data server which meant we could utilize existing data engines and data structures. Thus it only took a limited amount of time to provide a new interface on top of that.
Most issues seemed to arise in aligning of the data models with our peer institutes and others implementing EDR. There is ongoing work for example between the Nordic countries to align EDR implementations in the aviation sector. In this special case EDR is used to server the aviation data models in XML (according to SWIM/IWXXM specifications). This is still ongoing work.
3.3.4. Publish-Subscribe support
The upcoming WMO WIS2, EUMETNET FEMDI and SWIM (ICAO’s aviation framework) will all require some kind of PubSub support mechanism. The upcoming version (1.2) of EDR is adding support for the Publish-Subscribe patteri via AsyncAPI. With that it will be nicely aligned to support the different requirements from the mentioned undertakings.
The challenges are that SWIM and WIS2/FEMDI require a different PubSub protocol, SWIM AMQP 1.0 and WIS2/FEMDI MQTT. This means there has to be a bridge built between them or have separate implementations for the two use cases.
3.4. Experiences on OGC EDR API technology at KNMI (Netherlands)
3.4.1. Introduction
In 2022 the KNMI Data Platform (KDP) team set up a OGC EDR API for observation data (AWS, airport and offshore platforms). The resolution of the observation data is 10 minutes, and new observations are added every 10 minutes. Historical data is available from 2003. This EDR API has been running in production since Q4 2022 as an alpha release.
In addition to the observation data, other data collections are available through the KDP EDR, but they are not the focus of this discussion, as the technology used is less relevant for E-SOH.
3.4.2. Data storage
The data is stored in PostgreSQL/PostGIS (Amazon Aurora Serverless v1 to be precise) using a wide table approach. Each row consists of parameter values for a single timestamp for a single station. The dataset has roughly 1 million time points (6 * 24 * 365 * 20). This gives about 60 million rows in the data table. The data table contains only the station id, the station locations are stored in a separate table.
3.4.3. API
The API is implemented in Python using FastAPI, Pydantic
and a package that provides a Pydantic model for CoverageJSON.
The Python code does a SELECT
statement on the database, and the output is then translated to the
CoverageJSON Pydantic model. Any filtering (including geo) is done in the database.
3.4.4. Benchmark
Load testing was performed for the following database backends were tested:
-
PostgreSQL production setup as described above (2 capacity units, 8 vCPU for API service)
-
PostgreSQL on a laptop: similar to described above, but running locally in a
docker compose
stack -
Timescale on a laptop: similar to PostgreSQL on a laptop, but with the data table replaced by a Timescale Hypertable with chunks of 4 weeks, and all data older than 8 weeks compressed, giving significant storage savings (21 GB versus 2.4 GB compressed).
We used Locust to run the following benchmark:
-
Full dataset (20 years of data)
-
Each query consists of randomly selecting a station (from 5 options), a variable (from 4 options), a query length (from 1 hour to 1 month), and a random start time in the 20 year period
-
Each locust user does their queries in series
-
The load test was done for 1, 10 or 25 parallel users.
Results:
DB |
Users |
Requests per second |
Mean request time (ms) |
PostgreSQL (production) |
1 |
16 |
42 |
PostgreSQL (production) |
10 |
103 |
77 |
PostgreSQL (production) |
25 |
154 |
130 |
PostgreSQL (laptop) |
1 |
23 |
32 |
PostgreSQL (laptop) |
10 |
78 |
100 |
PostgreSQL (laptop) |
25 |
80 |
250 |
Timescale (laptop) |
1 |
8 |
110 |
Timescale (laptop) |
10 |
27 |
320 |
Timescale (laptop) |
25 |
28 |
750 |
Notes:
-
The "laptop" tests were performed on a Macbook with
arm64
processor, butamd64
containers were used for PostgrSQL and Timescale. So while the laptop results are comparable, they should not be compared directly with production results.
On average over all these tests, timescale is a factor 3x slower. Looking at the detailed results, timescale is especially slower for small request, while for requests that retrieve a month of data, the difference is much smaller. This could be caused by timescale always having to decompress a month of data (the chunk size) even for the smallest request size.
3.5. NMHS observation pipelines today
This section contains information about current methods Members use to share observations today and their plans to deliver observations in the future, as provided during the design phase. The focus is on data from Automatic Weather Stations (AWS), and not data from all observation networks.
3.5.1. MET Norway
Transfer of observations to GTS:
-
Data are collected from stations and sent through a Quality Control (QC) system.
-
Quality controlled observations from QC are sent on a Kafka queue in an internal XML format.
-
BUFR generation job listens on Kafka queue and creates BUFR files for each message on the queue. The BUFR files are then sent to an in-house built GTS server. This GTS server then creates GTS messages based on these BUFR files.
Sharing of near real-time and archive data through APIs:
-
Internal pub/sub solution based on Kafka
-
Custom built REST interface (https://frost.met.no) for querying observations data, but it is not OGC-API EDR, and it does not provide FAIR compliant data, e.g.,
-
Unique and persistent identifiers are missing
-
It is a non-standard custom API
-
Vocabularies are not standard
-
Provenance is missing
-
MET Norway has during the last 4 years worked on unifying its general data management to meet the FAIR principles through a metadata driven approach. To enable FAIR observations data timeseries, we do the following:
-
Observations are retrieved from frost.met.no and written to NetCDF-CF files that are made available through an OPeNDAP API. The NetCDF-CF files contain data per instrument per station, and both raw and corrected observations. It is filled with new observations on regular intervals. The data is documented with discovery metadata following the Attribute Convention for Data Discovery (ACDD) plus extensions to support extended interoperability, and use metadata following the CF convention.
-
The discovery and use metadata are used to create XML files with discovery metadata following an internal specification (MMD; https://github.com/metno/mmd). This metadata also contains information about data access, e.g., file paths, OPeNDAP urls and OGC WMS urls.
-
The MMD XML files are ingested in a SOLR database (nosql)
-
SOLR is connected to the outside world through APIs (e.g., AOI-PMH, CSW, opensearch) and web solutions in drupal to provide both machine-to-machine and human-machine interfaces. The drupal websites also contain persistent dataset landing pages.
-
Translation (xslt) of the discovery metadata between different standards is done on the fly when requested by the user, in order to enable interoperability.
One example implementation of the system is the Svalbard Integrated Arctic Earth Observing System. AWS datasets from Svalbard can here be found via a human search interface. Some funcitionality is still under development, but the metadata driven approach to data management should be clear, e.g., demonstrating discovery metadata interoperability through the "Export metadata" function. See the SIOS Data Management System web page for more information.
Plans to become compliant with WIS 2.0
-
We aim to reuse as much as possible. We have ongoing collaboration with the geopython group, and have started using pygeoapi
-
We do not outsource any development work
-
RODEO/E-SOH is used
3.5.2. KNMI
Current production chain for collecting observations data and sending them to users:
Observations data is transferred to the GTS server through an inhouse message format that is translated to BUFR messages and put onto GTS using MSS. There is also an SQL observation database (KMDS), but the BUFR message generation is not based on data in the database (parallel track).
Sharing of near real-time and archive data through APIs:
In the KNMI Data Platform there is both a file-based API, and a (alpha) EDR API that provide observation data at 10-minute intervals. The EDR API can currently be queried by position (lat/lon) or station ID. The return format is CoverageJSON.
Plans to become compliant with WIS 2.0
KNMI do plan to be compliant with WIS 2.0, but there is no definite plan yet, just some ingredients for a solution including RODEO WP3 components.
Building a solution yourself? NO, preferably not. Currently we use MSS, we might replace this with a newer version which would hopefully be WIS2.0 compliant
Outsourcing the development work to a 3rd party? Or,
Using the deliverables of RODEO WP3 (E-SOH)? YES, as much as possible
Number of stations and the temporal resolution of the data that might be made available
How many AWS do you operate and will make available? → Official AWS: 26. If we also count airfields and offshore rigs, there are about 60 stations that provide observation data.
What temporal resolution do you expect to be made available from these stations → 10 minutes
Can you provide information about the format(s) you plan to share data with users → Sub hourly BUFR is being worked on (but not in production). CoverageJSON for the API.
3.5.3. FMI
What is your current production chain for collecting observations data and sending them to Users? For example, can you explain…
How do you transfer observations data to the GTS server?
-
Process regularly collect observations from stations. Observations are stored in database.
-
QC is triggered by storage to database, and quality flags are set on the data.
-
Fetch data regularly from database, convert to BUFR and copy to GTS server.
Comment: Input data file (one or more stations) → converted to BUFR data file (same amount of stations as input file) → BUFR file is copied to MSS as it is.
Do you already share near real-time and archive data through APIs? Yes via SmartMet Server OGC WFS and timeseries plugin functionality.
Is this a pub/sub type solution to files hosted on a web server? Neither.
Do you have a facility for users to query a ‘database’ to access observations using a EDR compliant API? Yes, OGC API EDR in progress.
Do you have definite plans to become compliant with WIS 2.0? If so, does it involve… A and C.
WIGOS Station Identifiers are already implemented in the rain observation data and converted to BUFR data, but they are not yet sent out to Sweden nor Estonia. We are planning of implementing WSI in other observation data as well (AWS sub-hourly BUFR, ocean data).
Could you give us a bit more information about the number of stations and the temporal resolution of the data that might be made available. AWS stations = 175 + Airports = 19 = 194, all of these have WMO number.
How many AWS do you operate and will make available? 175
3.5.4. Met Office (UK)
Current Production Chain and GTS distribution:
-
Collect data from stations
-
Store data in database.
-
Basic QC as data are stored,
-
Station metadata added,
-
Data averaged to minute temporal resolution.
-
-
EDR interface to database, near real-time data and archived data through same interface.
-
Product generation system creates BUFR files.
-
Every 10 minutes gets data from DB, creates BUFR files for each station according to WMO standard for sub-hourly surface observations.
-
Every hour gets data from DB, creates BUFR files for each station according to WMO standard for hourly surface observations.
-
-
Hourly BUFR files get copied to GTS via Met Office Moving Weather systems. (Sub-hourly BUFR messages are not shared on the GTS but are available for internal Met Office use.)
-
-
API Access
-
We don’t have a pub/sub solution yet for the 10 min BUFR files but it is planned
-
Yes, we have EDR access to AWS station data
-
WIS 2.0 plans
We plan to build a solution ourselves but will adapt our solution to match that being developed by RODEO WP3 and to be compliant with FEMDI as well as WIS 2.0.
Current AWS Network
-
Approximately 300 stations. Most in the UK but also in some overseas territories.
-
1-minute observations are available through the interactive EDR API. 10-minute BUFR files for each station will also be produced.
-
As well as BUFR, the interactive EDR API returns data in? To be confirmed. JSON?
4. Conclusions
In this document we presented both the experiences gathered from Proof-of-Concept (PoC) experiments during the E-SOH project and some selected prior experience from the participating institutes.
Architecture decisions are covered by the architecture document, so this PoC report is not commenting suitability/feasibility of evaluated techniques or solutions.
As the conclusion from the PoCs, the key findings are:
-
it will be possible to run the E-SOH system on EWC (European Weather Cloud) (infrastructure PoC)
-
wis2box reference implementation works, but the ingestion is too slow for our use case — if used, needs to be further developed (wis2box PoC)
-
Elastic (formerly Elasticsearch) is one promising option for data store (document DB for JSON files, geo-query support) (Elastic&wis2box PoCs)
-
GIS-capable relational database, namely PostGIS (with or without timescaleDB extension) is another potential option for data store. (netcdf vs. DB PoC)
-
there are earlier encouraging expriences on PostGIS-based solution for similar kind of needs, and expertise on that, and expertise on EDR API implementation within the project concortium (previous experiences chapter and its subsections)