TDMS Fragmentation: Why Your TDMS Files Use Too Much Memory
It’s been pretty hectic around here but nothing would have stopped me getting to the Central South LabVIEW User Group (CSLUG) this week. Since working on my own these events are even more valuable.
This time around I presented on TDMS files. Not the sexiest subject going, but by understanding how they work you can avoid some key pitfalls.
What is TDMS?
In summary, TDMS is a structured file format by National Instruments. It is used heavily in LabVIEW and also DIAdem (which I’m a big fan of) because it allows you to save files with:
- A similar footprint and precision as a binary file.
- A self descriptive structure (can be loaded by LabVIEW, DIAdem, Excel or any other application with a TDMS library without knowing how it was produced).
- The ability to be efficiently data mined through DIAdem or the LabVIEW datafinder toolkit.
If this is brand new to you I recommend you read this article first as we are going to jump in a little deeper.
Sounds Great, What are the Pitfalls?
In applications with a simple writing structure it is a fantastic format that lives up to what it promises. However when you get to more complex writing patterns we can end up with a problem called TDMS fragmentation. This can occur if you:
- Write a separate timestamp and data channels (or any pattern where you are writing multiple data types).
- Write different channels to the same file alternately.
- Write to multiple groups simultaneously.
To understand why we must look at the structure.
The TDMS Structure
The first thing to understand is that TDMS files allow streaming by using a segmented file structure (their predecessor, TDM files had to be written in one go). In essence, every time you call a TDMS write function, a new segment is added to the file.
Each segment contains:
- Header or Lead In data – This describes what the segment contains and offset information to allow random access of the file.
- Meta Data – This states what channels are included in the segment, any new properties and a description of the raw data format.
- Raw Data – Binary data for the channels described.
So how does this impact our disk or memory footprint?
TDMS has a number of optimisations built in to try and bring the footprint as close to binary as possible.
When you write two segments which have the same channel list and meta data the TDMS format will skip the meta data (and even the lead in) for that segment, meaning that the space used is only that of the raw data, giving an effective “compression ratio” of 100%.
Taking the scenario where we write exactly the same channels repeatedly to the file, we only get one copy of meta data and all the rest is raw data, exactly what we want.
But consider this scenario:
A common case where we may want to write twice to the file. Each TDMS write is going to write a segment to the file, in this case because it will alternate between the two the meta data does change and has to be written every time. This leads to a fragmented file.
This will happen in any scenario where we are using multiple TDMS write nodes to a single file.
You can also see the level of fragmentation is going to depend on how much raw data is included in each write.
If we write 10,000 points each time the meta data will still be much smaller than the raw data and although fragmented, it is probably acceptable.
If however we write 1 sample each time, those green areas are going to shrink a lot and you could end up with more meta data than real data!
We can measure the impact of fragmentation by looking at the size of the tdms_index files that are generate when the files are used. This is essentially all of the meta data that has been extracted from the file.
Here we can see file 2.tdms is exactly what we want. 1kB of meta data to a 15MB file. 0.tdms however is heavily fragmented, 12MB of the 36MB file is used by meta data (in this case file 0.tdms and 1.tdms actually contain exactly the same data but use some of the techniques mentioned later and demonstrated in an example mentioned at the end).
When working with fragmented files you will also see the memory usage of the library increase over time. This is because the TDMS library is keeping a model of the file in memory, collating the meta so that it can do things like perform random access. The more meta data, the more memory required.
(Contrary so some reports this is not a “memory leak” in the strict definition of being unexpected, it’s entirely predictable, not that it makes you feel much better about it!)
To reduce the memory you either need to reduce fragmentation, or close and open a new file periodically.
So how do we avoid TDMS fragmentation?
- Writing as a single datatype (i.e a single write node). This means if we have timestamp data, converting to seconds first, or writing an offset time.
- Write seperate files and combine them later.
- Write larger chunks of data. This will still give a fragmented file but the meta data is spread across much more raw data and the effect is not as pronounced.
- Use TDMS buffering. There is a special property you can set in the TDMS file called “NI_minimumBufferSize”.
When you write a numeric to this property, the library will buffer all data for a segment in memory until it has that many samples. This is the easiest solution but does mean:
a) Additional RAM usage
b) In the event of a crash/power loss you will lose the most recent data.
- If disk space is the main concern, defragment the files before storage. there is a defrag function in the TDMS palettes that can be used once the file is complete to reduce the size.
Your homework to investigate this is in an example I posted on the community which demonstrates
- the effect of fragmentation in the cases shown earlier and
- the effectiveness of the memory buffer in solving the problem.
- (also creates the 0,1,2.tdms files I showed earlier)
Go take a look and keep an eye on the size of those index files!