AptPlot Demultiplexers

I spent the previous post talking about developing AptPlot, a plotting tool written in Java. There's a lot that I glossed over to stick to some self-determined ineffable blog post maximum length, but what I glossed over the most were the actual analysis code plot files that it was extended to support, and why I kept mentioning demultiplexers. So what is a plot file?

Plot file basics

If you've never worked with a thermal hydraulic code (which I'm guessing is pretty likely), these simulators output their data as they're crunching it: one time slice after another, after another.

Here's a hypothetical view of a typical plot file structure for a super simplified simulation that has a few pipes, where each pipe is simulated as a number of nodes along its length, and the solver calculated temperature and pressure at every simulated moment (in reality, solvers simulate a lot more than temperature and pressure across much bigger and more varied components). The hypothetical pipe1 is composed of two nodes:

Simplified view of a multiplexed plot file

At the top of the file is a header with some metadata and a list of channel headers. Each channel basically maps to one piece of data in every "time slice". The first header says that the data at time slice index 0 is time (how far into the simulated event are we), the second piece of data is pressure in pipe1's first node, and so on. These channel headers generally have some additional info like units of the channel (generally seconds for time, possibly kelvins for temperature, etc.) and maybe some additional info that isn't super relevant to this post. The file header might also have an offset to the first time slice for quick seeking, shown by the arrow in th e diagram. The body is simple multiplexed data: a sequence of binary-coded values written one after the other in the order defined in the header, in order, to the end of the time slice. Then the next bytes of the file start the next time slice. Depending on the format, they might wrap each time slice in some kind of block structure with additional metadata, but this general structure gives you the gist of it.

This is a perfectly reasonable way to output data for an analysis code. They had to output data as they go because they absolutely could not store all this data internally until the run completes: sims can be very long running (on a scale of days or even weeks), and especially at the time the size of the generated plotfile could often exceed system memory. Codes also need to be able to recover from a crash or manual TERM signal (more on that in a bit), so writing time slices as they went was the only sensible option. It's also a perfectly reasonable output format for post-processing applications like visualization animation tools that step through the plot file one time slice at a time.

It's not a great format for graphing data though.

Multiplexed data was too slow

When plotting data, an analyst wants to do something like graph three pipe nodes pressure values against each other, or compare pressure over time to temperature drops, or whatever other area they're trying to gain insight into. They don't need the contents of whole file at any given time. But with multiplexed data, plotting three channels meant iterating over the entire file, one slice at a time, until all the data was retrieved. How else could you do it? There's no way of knowing where each time slice begins without iterating over the file at least once.

This was entirely too slow. We're talking about binary files that were often in the several gigabyte range at a time that predated flash drives as a common workstation solution. For the hard disks of the time, reading the whole file every time you wanted to plot some data was a huge burden. You could do it, our plugins supported it, but it bogged things down.

To speed things up, we offered a pre-processing step in the form of independent demultiplexer utilities that converted the plot file into a format more suitable for plotting:

Demultiplexing the plot file

The demultiplexers created a plot file that restructured the data to better support plotting. Now every channel was laid out in one contiguous region of the file: pipe1 node 1 temperature wrote all its values, start to finish, before the file moved on to the next channel. Each channel header contained a file location to where the data started for quick seeking, identified which time channel it was dependent on (which had its own file location specified), and listed the number of values in the channel.

This made for orders of magnitude faster plotting. Now the plotting tool could jump to the time channel, grab everything in one contiguous read, then seek to the requested channel and read what it needed. On the hard drives analysts were using, this was a massive boost. Demultiplexing the file took a few minutes up front, but afterwards reading a given channel went from 1-2 minutes with the multiplexed file (or longer, in some cases) to a second or two with the demultiplexed file. The cost of the up-front demultiplexing step was generally a price that analysts were always willing to pay.

Development complications

Developing the demultiplexers for different codes had some tricky complications.

First is something alluded to earlier: restarts. Like I said, these codes run for a long time. If something happens, like the user needs to stop the calculation, or the solver unexpectedly crashes, or three days into a sim the analyst was visited by the dreaded automatic Windows update, they couldn't afford to start over from scratch. They needed to restart the calculation from where it left off. Most of these solvers have some kind of restart capability built in.

But wow did that ever complicate demultiplexing. Because they didn't just allow restarting the calculation from where it left off. They allowed changing the model. Like changing the time basis (maybe from simulating second by second to every other second, or every half second), or changing some pipe's internal roughness, or changing how many nodes are in the pipe. Handling all these cases and how to interpolate a channel so it was still coherent from simulation start to end covered a lot of tricky edge cases.

Second is that many of these codes are developed by government grant and they didn't always have what you would consider an exhaustive file spec. A few do! They were often wrong. I will never forget poring over a particular poorly-documented solver's plot file format in a hex editor trying to reverse engineer its format and discovering that little was in the spec was not only wrong, but that the channel and time slice totals were written as double precision floating values instead of integers.

It's always healthy to remember that sometimes an innocuous coding mishap might very well outlive you and complicate the lives of future generations for many years to come. Now that's what I call legacy.