Why VirtData?
If it is not real data, then what is the value of it? Virtual data is something that, when you need it, has no substitute.
To explain why, let us consider the trade-offs in some basic simulation and testing challenges and what they mean.
Speed
Challenge: access large amounts of bulk data
Actual Data - depends primarily on speed and efficiency of storage system - limited to size of storage system
Virtual Data - depends on speed of generation functions and CPU - not effectively limited to size of storage
In this case, virtual data is strongly favored for testing systems that need to be faster than the things that they are targeting. This is an essential requirement for any system that must provide high-fidelity results in timing, loading levels, etc. Essentially, load testing is a primary reason for having procedurally generated data. You can not easily get high fidelity results if the testing apparatus is saturated with load as easily as the test target.
Agility
Challenge: modify basic properties of test data
Actual Data - requires recapture of test data, regardless of origin
Virtual Data - has no moving mass, instantly changes according to recipe
The ability to change your data set instantly is also a boon for testing systems. Not only can you modify the conditions of a test by changing out your virtual data set, you can create multiple variations that can elicit interesting contrasts and outcomes, ad-infinitum. The size of a virtual data set is paltry, often smaller than testing logic.
Sophistication
Challenge: create nuanced interplay between data
Actual Data - same design burden as virtual data, but still with moving mass - significant effort is spent on managing bulk of data
Virtual Data - is able to create statistically shaped data without bulk transfer - is able to provide determinism in addition to statistical shaping - is able to simulate superset/subset relationships
Because virtual data is dependent on the generation methods, and generation methods can be chosen themselves to have interesting mathematical relationships, many things are possible that we don’t even consider when raw data is handed to us.
Realism
Challenge: access real data
Real data wins this one, obviously. Virtual data is not strictly real data, but it can be. Read further for why.
Challenge: access realistic data
Virtual data can be as realistic as you need it to be, up to and including being based on samples of real data. If you want to capture a real data set and access it as such through a virtual interface, then the data is, in fact, real data.
Virtual data methods can be used as a sliding scale between accessing real data and just making it all up according to statistical recipes. With a small amount of real data, you can create a very high volume of simulated data. The small sample is almost always easy to fit in memory, keeping the data generators fast. Still, the data generated can be extrapolated to data set sizes that can be useful for testing even the largest systems.
For accessing realistic data, why not both? If you have an example data set, nothing prevents you from using it as raw data through the virtual data interface. Doing so gives you useful choices for tackling some of the challenges above.
Distribution
Challenge: partition a dataset for distributed testing
Real Data - Requires a bulk processing step to do any reconfiguration or (re)partitioning
Virtual Data - Subset groupings of virtual data are easily described by common recipes.
When you need to be able to identify or control, or vary how data is distributed in a testing scenario, there is no comparison.
Specificity
Challenge: canonically identify detailed test data parameters
Real Data - In order to canonically capture and represent the testing parameters with a real bulk dataset, you have to have and retain the whole dataset.
Virtual Data - Virtual data can be described concisely and completely in a recipe. Whether you are needing book end scenarios that identify corner cases, or more normal data, the recipes are compact. For data that is based on real samples, you get to pick how much is too much.
In Summary
It is possible to have realistic samples, statistical shaping, high throughput, immediately adaptivity, and repeatability at the same time.
The examples above are only the tip of the iceberg in terms of what is possible. Virtual DataSet aims to make it easier to explore and use new dataset simulation methods.