Managing Large Data Sets in LabVIEW
Have you ever run out of memory in LabVIEW?
I gave a presentation at the CLD summit last week talking about some of the design considerations and a few ideas for techniques that can help when it comes to high throughput applications. If you click the cog you can also open the speaker notes which elaborate on some of the points.
The problem with long term waveform data storage
This was a key item that I didn’t manage to get to. Currently there are two prevalent techniques I would look at:
These are your SQL based databases whether MySQL, MS SQL Server or similar.
The idea in these is that all of the data is stored in tables. The columns available are fixed in the design of the database and you add data by filling rows. The relational element comes from the fact that each row has a unique identifier which can be referenced in other tables.
The challenge with waveform data is understanding how this translates to a table.
You could store the entire waveform in a single field as a binary blob but this limits the searchablity (which I think is a word!).
Alternatively you must create a new row for every datapoint and each row would need a timestamp, seriously increasing the storage capacity and reducing the performance of the searches. This is before you get into working out the correct design to get optimum performance.
Datafinder is National Instruments’ solution to this. Datafinder is a file indexer. You store all of the files that you want to make searchable in a common place which Datafinder can index.
Through dataplugins, multiple file types can be supported but they all get translated to the TDMS style structure to make the properties searchable. You then use the toolkit for LabVIEW or DIAdem to mine through the data.
This has some appealing characteristics. It is ridiculously simple to set up compared to a database, just put your files in the right place. This also makes it quite flexible, being able to take data from different sources and still keep it easily searchable.
The main issue with this is the fact that it is file based, if the data is continuous across files and the section your interested in spans files then you have to code around this to load from multiple files.
However I am wondering whether there maybe a new kid on the block that could overcome some of these issues:
NoSQL and MongoDB
So I’m pretty sure every credible writer starts with a wikipedia quote:
A NoSQL or Not Only SQL database provides a mechanism for storage and retrieval of data that is modelled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability.
In short NoSQL is one of those wonderful buzzwords which doesn’t mean anything specific, just different!
I was quite intrigued though by some of the different data models. The one that stands out is the document model used by MongoDB among others.
This means that instead of defining tables you add data as documents. These documents contain fields which can be indexed but it is quite flexible, different documents don’t have to exactly match in structure. At the very least this will match the structure of datafinder very nicely and could be a viable alternative where the file based management is unappealing.
My next step though is to investigate the capabilities of spreading data across documents. Most databases allow you to define database side functions about how to retrieve data. This is typically high performance and allows it to be used from any language that has a database driver. I’m planning to investigate whether this will allow for a structure that can retrieve continuous data out of files and make some of our “big data” challenges go away.