December 11, 2025
Cloud Optimized DMAC (CODMAC)
Jonathan Joyce and Shane St Savage, Tetra Tech
Run time: 34:29
00:00:02:00 - 00:00:36:03
Micah Wengren
All right. Thanks very much, and great to see everyone on today. Today we are featuring a presentation on the Cloud-Optimized DMAC project with IOOS and Tetra Tech. And just to give some brief background before handing it off to our presenters, the CODMAC, or Cloud-Optimized DMAC project, is funded by the Bipartisan Infrastructure law. It's something we started in this... during this past fiscal year, so it's been ongoing for about six months,
00:00:36:19 - 00:01:09:01
Micah Wengren
and it's really geared towards an investment in improving the national DMAC infrastructure, and specifically aligning it more closely with commercial cloud, as kind of the not so much emerging anymore, but sort of predominant data management and dissemination platform in general, and then also building off of the outcomes of the Reaching for the Clouds project, which was executed during the past several fiscal years.
00:01:10:13 - 00:01:36:03
Micah Wengren
So, you know, we're really trying to build off of a lot of the learnings that emerged from Reaching for the Clouds and, also, you know, while that was pretty heavily focused on handling large ocean model data, this project is gonna be, in addition to working with ocean ocean model data, it's also going to be focused pretty closely on observing and time series data types as well
00:01:36:03 - 00:02:07:21
Micah Wengren
because of course that's a critical thing that we do here at IOOS. So, what we're really hoping to get from this effort is to improve the national scale, you know, aggregated national scale data management within the DMAC system, and provide a single or more centralized access point for users to access IOOS data at network-wide scale. And additionally, investigate opportunities for efficiencies to be gained across the DMAC system as a whole,
00:02:08:11 - 00:02:49:15
Micah Wengren / Jonathan Joyce
whether that's just leveraging the cloud in general, or potentially other enhancements like providing this national scale aggregation in an interoperable way with existing RA DMAC systems, so really looking at ways to kind of move ahead with our our national DMAC system. So I'm gonna go ahead and pass it off to Jonathan from Tetra Tech to dive into some of the details. Our presenters today are Jonathan Joyce and Shane St. Savage from, from Tetra Tech. So, over to you Jonathan. Alright, thanks a lot for that intro, Micah.
00:02:49:17 - 00:03:22:16
Jonathan Joyce
So yeah, I'm Jonathan Joyce. I've got Shane St. Savage with me. And today we're gonna be talking about the Cloud-Optimized DMAC project or CODMAC. I'm gonna start with an intro or little bit of the background. Some of you might have seen this, we did another tech webinar for the prior project about two years ago. So, a brief summary of what we did during that project, and then going to be talking about all the new projects that we're
00:03:23:05 - 00:03:47:20
Jonathan Joyce
you know, we're bringing this research forward to. So, you know, the original, the Reaching for the Cloud project, that was a three year research project where we were specifically looking at what are the emerging or established technologies both in the cloud and in the scientific data management and or data access space?
00:03:48:16 - 00:04:08:08
Jonathan Joyce
And to develop a number of prototypes demonstrating those technologies and the capabilities. And another big part of this was outreach, right, with the IOOS community, right, with all of you, and, you know, some of the, you know, downstream customers of IOOS data to really see where the pain points were,
00:04:08:08 - 00:04:35:08
Jonathan Joyce
where we could, you know, add and enhance things, and just, you know, where are the opportunities? At the end of that project, and we, and this is up on the IOOS GitHub now, we wrote up a technical architecture recommendation and a roadmap. And as well as, you know, documented the prototypes, pros/cons of different approaches.
00:04:35:08 - 00:05:21:03
Jonathan Joyce
And so now we're gonna be taking some of those views and moving forward with them will be the focus of this. As Micah mentioned, we primarily focused on the forecast models during that project in order to build a full end to end capability, right? From the the data side all the way to being able to display images from a webpage. And so why, you know, why focus on this stuff? Stepping back again is, well, it's really the, you know, I'd say the main driving factor, right, is just the data volume, the data complexity that continues to grow
00:05:21:03 - 00:05:47:00
Jonathan Joyce
and that our systems need to be able to handle, right? The new forecast models are both larger, and they're also finer resolution. And we want to be able to - and the expectation right now of customers is not I'm going to wait, you know, minutes to download something. It's more, I want to be able to explore a gigabyte or terabyte scale data set through a web browser
00:05:47:00 - 00:06:12:09
Jonathan Joyce
and just have, you know, data for me. And so, unfortunately, right, those legacy systems right now are not able to meet those demands. We operate, at Tetra Tech, we operate several THREDDS and ERDDAP instances, and so we fully understand the maintenance challenges. That they're difficult to run in the cloud,
00:06:12:09 - 00:06:41:02
Jonathan Joyce
and then just the, the crashing from, you know, high load periods and big data requests. And then finally, right, we're also, we're not doing this in a silo, right? Greater NOAA and IOOS are kind of leading toward right, cloud adoption, and especially with AI, you know, we wanna stay on top of that, of those technologies so that we can stay,
00:06:41:05 - 00:07:09:24
Jonathan Joyce
you know, innovative, fast, and reliable. I'd also add that this gives us more capability to... by moving into this cloud infrastructure, right, we have more ability to, say play with new computers that just came out or, you know, try out a new, you know, AI related tools, whatever. Basically, it's less, it's not limiting, you don't have to go out and buy things.
00:07:09:24 - 00:07:56:16
Jonathan Joyce
You can play with the latest technology as it rolls out. And next slide. Okay. And so, so a lot of this was, we've really focused on the Pangeo tool set and brought a lot of lessons learned away for how to use those tools, and we've developed a strong understanding of various ways of operating cloud infrastructure. What works, what doesn't, and then also kind of fitting that into understanding the nature of IOOS and how people work, you know, what, what are the best trade offs to make in terms of the technology?
00:07:58:11 - 00:08:18:05
Jonathan Joyce
We, you know, also have, you know, gotten to know all of you a lot better and what what the requirements are and, ultimately, you know, I'd say, you know, we feel more comfortable with these technologies, right? Three or four years ago, everything seemed kind of brand new.
00:08:18:05 - 00:08:40:23
Jonathan Joyce
We weren't sure. Now, you know, we can, we have a grasp, I would say a solid grasp, but not just, you know, the CODMAC team, but you know, I've seen it in the entire IOOS community that, you know, people really understand the cloud a lot better and have more projects using it.
00:08:44:08 - 00:09:32:10
Jonathan Joyce
Next page is - there we go. And so, ultimately this is about, right, bringing the science information at its high quality directly to the end user, right? We don't want to, it no longer makes sense to say, well, we're technically limited because, you know, we don't need to be, right? By moving to the cloud, we're able to, you know, parse through more data and deliver it in higher resolution to our clients. And so this is, again, a reminder, so through that project and we're continuing to update these forecasts, we have these publicly available, cloud-optimized website data sets
00:09:33:12 - 00:10:02:12
Jonathan Joyce
in zarr 2. So I'd encourage you, if you haven't gone and investigated or played around with any of these, you know, I encourage you to do so, and, cause now is a good time. These are the prototypes that are out there, but it's good for us to know now what else do we need to do? How can we better organize? How can we make this more usable? So that's gonna be a big focus going forward.
00:10:06:19 - 00:10:38:24
Jonathan Joyce
So, enough of the recap. Now, going forward, I'm going to talk about a few of the core technologies, which you've heard before, maybe in a different way, and then start to show a few, like use cases, and how we're planning to apply those. So the kind of key things that we're thinking about, right, in this next phase of development are, first of all, you know, some people call it "operationalizing".
00:10:39:01 - 00:11:03:14
Jonathan Joyce
What does that mean? It's really making these systems, moving them from a prototype level, to a more reliable, understandable, and maintainable versions. That includes, right, education, so ensuring, right, that all of you understand what these capabilities are, not necessarily that you need to, that everyone needs to be an expert
00:11:03:21 - 00:11:43:12
Jonathan Joyce
on Kubernetes and Dagster and Icechunk and all these things, but more that these things are out there and, you know, if you need scalable infrastructure, you know, we have a recommendation for that. If you have a data workflow, you know, we have a recommendation for that. And with very minimal effort, we can start to use these common components to solve, you know, a variety of problems that you might, you know, come up with in your day to day. So a big part of this is also, you know, sharing what we have available.
00:11:44:11 - 00:12:10:11
Jonathan Joyce
And then finally, still on the data side, right, a lot of these projects are looking at, you know, not everything's a technical problem, right? We still have across, all the data sets, a lot of challenges collecting standard metadata, and does everything mean the same thing across data sets? Not usually, and so that's a big, big challenge that we're undertaking as well.
00:12:12:08 - 00:13:30:24
Jonathan Joyce
And then, in addition to the gridded data types, now we're also looking at how do we, in a cloud-optimized version, look at these observation data sets, right? Usually time-based collections. And then, of course, AI is also the elephant in the room, and how do we, using this infrastructure, how can we serve like, AI native workflows? So quick overview. So one of those technologies, right, that we're kind of consolidating on is the Kubernetes platform. And so, this doesn't mean that everything has to be on Kubernetes, but it makes sense for when you have a distributed application, meaning you have, like, several containers that need to talk to each other, or maybe stateless containers that, you know, you want to be able to scale up. But we've, our team and other teams, right, have found that it's a pretty good way of managing infrastructure. And for people who don't really manage infrastructure, essentially Kubernetes is, you can have a collection of virtual machines,
00:13:34:11 - 00:14:46:19
Jonathan Joyce
and it kind of gives you the tools to be able to manage all of those machines sort of like it's in one platform. So you can add machines, you can subtract them, you can run workloads on, like let's say you have a GPU requirement. You can run a GPU workload on the appropriate machine, and also because we have a lot of small teams here, it makes it easier in terms of the IT operational requirements, like such as provisioning new machines, patching, monitoring, it makes all that more manageable by a small team, or a small group of individuals. Um, how do we go back? So our next tool that we're kind of consolidating around is Dagster for our workflow management. We've, since the start, understood that, you know, workflow management is an important piece. There are a lot of different tools out there that do this. Dagster in particular
00:14:46:19 - 00:16:05:19
Jonathan Joyce
fits with this community well because it can all be managed in Python. And then when it is deployed, it can be but that's not a requirement. You could also deploy it on a virtual machine or some other hardware. GMRI/NERACOOS has actually, you know, pioneered the the use of this for their internal infrastructure and for managing data. And essentially, it gives you a strong user interface. It describes the number of common data patterns for, you know, understanding when your upstream data is out of date. And then the other nice feature is that these workflows, they can be independently authored and deployed. So, meaning, we don't all have to work on the same Dagster workflow, right? If someone else has a data requirement, they could write it, test it on their local Dagster, and then send it to whoever's running the the operational Dagster. And that would all be in Python.
00:16:08:18 - 00:17:47:22
Jonathan Joyce
Okay, last set of new tools. VirtualiZarr and Icechunk are another set of tools that we're looking at. So, in our original Reach for the Clouds prototypes, we were using Kerchunk and the zarr 2 protocol. Since then, there's been some work on, sort of a new - call it the upgrade - of Kerchunk, I guess, which is VirtualiZarr, another Python package. And it's a little bit simpler to work with than Kerchunk. It addresses a lot of issues, but it's still actively under development. So we are, we have been able to convert some things, some of the data sets to zarr 3, but some of like the FVCOM, it's taking a little bit more work. So you could think of VirtualiZarr is like Kerchunk. And then Icechunk is, you could consider it like a database layer for Kerchunk-like things, or really for array data. So it's, rather than treating, say each data set as an individual file, it's recognizing that, you know, our like netCDF data sets are composed of a lot of different arrays, and that they may change over time. So this is a kind of a fast layer for improving access to array data types.
00:17:54:17 - 00:18:43:04
Jonathan Joyce
Ok, and so, we're taking those like Icechunk, VirtualiZarr, these new innovations, and and we're trying to roll that into kind of the upgrade of this cloud optimization workflow, which is all those data sets that I showed earlier. So we're looking at, first of all, monitoring this by running it in Dagster, right? Before, we were just running Lambda jobs, but once you have a large number of data sets, it becomes difficult to, you know, understand which data set’s breaking, which one isn't, things like that. So we're moving this workflow to Dagster so that will have better monitoring.
00:18:43:16 - 00:20:07:04
Jonathan Joyce
We can reduce the operating costs, and then move to our common Kubernetes infrastructure, again, to help with our costs, cloud costs. And then on the you could call it the front end side from that data conversion, then we want to be able to display the data online and, you know, through a variety of interfaces. So that's where we have been able to test a lot of the performance improvements from Icechunk. The CORA data set I've shown before, it's a very large, you know, multi-terabyte reanalysis, but we were able to see some improvements, so we tested that with Icechunk just to see how much faster we get things. So we did see some pretty good improvements in performance, and so that's one of the main reasons why we, you know, are continuing to try to convert other data sets over. We're also working on vector tile rendering to be able to display barbs and add that into our, into the common xpublish-wms.
00:20:10:11 - 00:21:48:08
Jonathan Joyce
Another related project, right, for, with CO-OPS and OceansMap integration. CO-OPS is trying to reduce the number of copies of data that they're trying to, that they have to manage. And they want to use the NODD data sets that are already there and available. So we're working on making that connection, like I showed in the previous, right? So adding that monitoring, Allowing them to query native grids so that they have that more accurate shoreline display, and eventually, right, that will roll into the kind of their operational system for, you know, viewing the latest. HFRNet is another project. We haven't, we just recently this year, started. We transitioned the HFRNet website from a kind of traditional server-based system to our cloud. So sort of a lift and shift. We're receiving now the data in S3 from NESDIS, and then we're responsible for displaying that on the website. So, not a lot of enhancement yet, but we do plan on taking these, you know, using the services we have and the data aggregation tools, to continue to provide a better experience for HFRNet data.
00:21:49:12 - 00:23:10:23
Jonathan Joyce
And then, I think this last project I'm showing, MetOcean Data Link, where we're trying to make it easier for other data collectors, for example, offshore operators, to be able to submit their data so that then we can kind of ingest that as observations for IOOS. So there, they have a number of sensor streams that we are working on building infrastructure for, and so, essentially, you know, this infrastructure looks very similar to our, you know, original Reach for the Clouds, ideas, right? We have a data ingest. We have some sort of orchestration, which is gonna be Dagster again. However, the addition here is we're going to, because it's time series data, we're going to need to look at different ways of storing it. Maybe timescale DB, maybe parquet. And then finally, how do we deliver that data? Can we use xpublish? How can we integrate ERDDAP into using some of these new data types?
00:23:12:01 - 00:24:21:01
Jonathan Joyce /Shane St. Savage
And that is all for me. I'm gonna let Shane present the next few projects. Yeah, you can just keep presenting. I'll tell you next slide, please. Keep it simple. Okay, similar project to the recently renamed MetOcean Data Link is the Water Level DAC. If you squint at the architecture diagrams, they look pretty similar and we'll see that next. But the Water Level - oh, can you go back? Yeah. The Water Level DAC is sort of, you know, as the name implies, focused on assembling water level data. The main differentiator there with other observational data sets are the water level datums. So there's a lot of different metadata that needs to get captured correctly to make sure that, you know, the water level is in reference to sort of a known ellipsoid or other indicator of water level height. And there's a bunch of the datums out there, and then users are going to want to be able to convert the different datums between each other to get, you know, the output data in the datum that they want. So, fair amount of extra complexity both on the input and output of the water level data,
00:24:21:17 - 00:25:28:08
Shane St. Savage
and the community has been great about sort of putting our heads together and starting to think about all these complexities, starting with, you know, there's a sort of water level metadata team. I call out a couple of people by name who have been leading that effort, but they've been doing great work on sort of surveying the community, deciding all the different fields that need to be captured so we have really robust metadata to power the system. So one of the main reasons that we're thinking about, you know, tackling water level as a community at the DAC level, of having a centralized DAC is that there's, you know, a couple of different, sort of high volume producers of this data, including Hohonu and Green Stream, a couple of these other large APIs, where, in our opinion, and, you know, the sort of like pricing and political concerns to be figured out, but it would make more sense to sort of grab and process the data from those high level providers all at once on a nationwide scale, and then also be able to sort of pick up and ingest smaller startup, you know, more specialized data providers at the regional level
00:25:28:10 - 00:26:29:07
Shane St. Savage
through the system that we all know and love where regional ERDDAPs ingest, curate, and serve out that data through their ERDDAPs, and then at the IOOS national level, we pull data from those ERDDAPs into a central system. So the idea is a sort of best of both worlds, we tackle the high level APIs and the DAC level, and then pull in smaller data sets from regional ERDDAPs. Having all the data in one place, we can ensure that QA/QC using QARTOD is sort of evenly applied in cases where we're not getting it from the data provider, and then provide, you know, these high level, you know, one stop shop data access APIs that includes ERDDAP, but it can include some other, more cloud-native approaches as I'll talk about in the future, in the near future in a couple seconds. And this also ensures that we, you know, facilitate those user on-demand conversions of the data into different datums that I mentioned, you know, we sort of have the toolset all in one place to make sure that we're providing that flexibility on the data requests.
00:26:29:07 - 00:27:33:13
Shane St. Savage
Next slide, please. So yeah, that's the architecture diagram that I was talking about. This is different than the MetOcean Data Link, but similar concept. You know, on the left, we start by collecting the asset metadata that I was talking about, all those crucial fields, including the the datum info, and then we pull from the large vendor APIs, and then also from the RA ERDDAPs to get all the data into one place. That flows into a data collection system, gets put in a central observation store, and then gets postprocessed and QA/QC'd. And then from there, you know, we can serve out the data through various APIs, including ERDDAP and a simple REST API, produce data products like netCDF, potentially parquet, and then a SHEF encoding if that's needed for NWS, and then that can get pushed to NCEI through this project called Gitify that produces the archive formats served out through S3 and then, you know, clients can, you know, depending on the use case, interact with those different APIs or static cloud-hosted S3 files.
00:27:33:18 - 00:28:43:17
Shane St. Savage
Next. Okay, so the main reason that we're talking about this project in the context of CODMAC, and as Micah mentioned at the start of the meeting, is that this is a good opportunity to start exploring, you know, some cloud-optimized, cloud-native data formats for observational data. Like a lot of what we focused on, like Jonathan said, is the grids, you know, the rasters that are, they're big data. They are sort of like the main pain point when you're trying to serve out the data. They take up a ton of of RAM and disk space if you're trying to sort them out, and if you're running in the cloud, that stuff is expensive. So, you know, initially CODMAC focused on like, you know, put the data in zarr, put it into formats that can be efficiently queried. So now we're starting to look at some of the same approaches for observational data. Obviously, observational data is much smaller, so it's not quite the same pain points, but there are, you know, the various, the sort of the industry standard APIs can be quite RAM intensive, memory intensive, and servers take care and feeding. you know, you have to upgrade the things, you have to keep an eye on them, you have to make sure they're not being hacked.
00:28:43:17 - 00:29:57:15
Shane St. Savage
So, the interesting thing that we get out of taking a cloud native approach for parquet data is that you're just producing files, you know, you're ingesting the data, you're running your QA/QC post processing, and then you're writing a file that gets served out on an S3 bucket, and you're sort of done with the responsibility of serving. And then, clients can directly query that file using HTTP range requests, so it, you know, it sort of efficiently just asks for the data that's needed for the specific queries that you're running. There's no service to maintain. It's massively scalable because you're just serving out files. So you have sort of all of the, you know, S3 serving infrastructure of AWS or GCP or backplays, or on premises S3 hosting, if you want to do that, and it's all a standard simple API, right? So there's lots of different places you can put this data, and it can sort of deal with the, the huge, you know, fire hose of like agentic API or agentic AI requests, right? We're sort of starting to see this uptick of like huge bursts of requests from agentic AI that will, you know, will quickly cripple any sort of like legacy approaches
00:29:57:15 - 00:31:04:05
Shane St. Savage
to serving out this data because they don't care. So on the one hand, we'll have to start like autobanning overly aggressive requests, but on the other hand, if we can, if we can offer data that's in a, you know, precomputed format that can keep up with this, we're getting ahead of the game. And then some cool things that you can do, you know, you can directly query these parquet files using a number of clients, including DuckDB. There's a version of DuckDB that runs in your browser called DuckDB-Wasm, so without even installing, you know, any software in your on your computer, you just like pull up a webpage and can start querying from these S3 hosted parquet files. And there's a simple example here where we're saying like, what's the average temperature in August at Scripps Pier in 2025? And we have the raw data in the parquet file and it's computing this and it takes like 1 second. It's very fast. So in addition to running on the browser, I've seen approaches where, you know, you don't have an API - sorry - you don't have a database running, but you have an API where a user can submit a request, and then in a serverless fashion,
00:31:05:06 - 00:32:02:05
Shane St. Savage
you know, a DuckDB instance will spin up, query the parquet, and then return, you know, the JSON or whatever payload. So you don't actually have a database running that you have to keep healthy, and maybe more importantly, that you have to pay for in the cloud constantly, but, you know, within a couple of milliseconds, a database instance will spin up and down because it's so lightweight and do whatever data processing you need. So there's a lot of pretty promising capability in this approach. And one other thing you can do, and we didn't mention it previously, but we've sort of done some experimenting with grids in doing this, is precompute your statistics. So, you know, maybe you have one parquet file that serves the raw data, but you have another sort of very small parquet file that has, you know, the weekly averages for every parameter in your region, or whatever, and it's very, very quick to do sort of like, high-level, zoomed out exploration of these time series if you precompute those statistics. Next slide please.
00:32:02:05 - 00:33:06:12
Shane St. Savage /Jonathan Joyce
And I thought this was cool. It's probably pretty tough to read, but this is that in-browser DuckDB web shell, so on the left, I'm asking it to just describe the table structure of the parquet file that's sitting out on AWS. It's listing out all the different parameters in there. On the upper right, I'm saying, show me, you know, the average temperature at this station in - oh, throughout every month of 2025, and almost instantly, you know, you can get that data out of it. No server involved, no API involved, just hitting the parquet file. Next slide. Okay, I'm gonna hand it back off to Jonathan now. Yeah, really, yeah, just briefly, so, you know, Shane and I have kind of been hinting it at, right, but the AI is really, you know, the next step and, you know, at the forefront of our minds as we're, you know, transforming these systems.
00:33:07:11 - 00:34:27:20
Jonathan Joyce
You know, we, some of the, you know, DACs we operate, right, we're seeing now, right, they can be, you know, taken down pretty easily just with, you know, a few AI agents, you know, just trying to look for data. So we're, you know, trying to get ahead of that with looking at this scalable infrastructure, and then we're also, you know, kind of proactively looking at ways that we can use AI, you know, to assist either with our, you know, data collection, data generation, or anything, right, in between. So, you know, not necessarily gonna go through all of these, but just, you know, want that at the top of your mind, and then I think we're ready to break for discussion and questions. You know, primarily, I think, you know, we wanna hear from you, you know, what else do we need to look at in this, you know, kind of CODMAC cloud space, you know, and what kind of challenges you're having and, you know, how we can, you know, help keep making these connections.
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS
A lock or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.
