VirtData provides more than just a set of libraries and APIs. The concepts are the starting point and foundation. The software tools are built directly from the concepts. Having a strong understanding of them will enable you to get the most out of virtual dataset.
Procedural generation is a method of generating data by feeding a stream of random or pseudo-random data into an algorithm. Usually, procedural generation aims to produce content which appears original, not generated by an algorithm – believably authentic by some standard.
In VirtData, this takes the form of applying a layer of data mapping functions to a set of inputs called coordinates.
Data Mapping Functions
The primary tool in the VirtData toolbox is the data mapping function. It is simply a function that knows how to take an input and produce a result that is meaningful to the user. VirtData makes it easy to reuse data mapping functions in pre-assembled ways, thus acting as a “recipe” for a virtual dataset.
An input coordinate is simply the logical address of a set of data that a mapping function can render. Input coordinates are often single scalar values. However, for some functions, multiple values may be supported. At the most basic level, input coordinates can be thought of as the independent variables that determine all of the dependent output variables.
A single-valued input coordinate can be used, for example, to describe a point on a time line. A coordinate pair may be used to represent latitude and longitude. A 4-tuple could describe a place and time with X, Y, Z, and T components.
Random Number Generators
Sequences of values produced by random number generators (more properly called Pseudo-RNGs) are not actually random, even though they may pass certain tests for randomness. Useful (P)RNGs produce apparently random data that is completely deterministic. In practice, the combination of these two properties is quite valuable for testing and data synthesis. Apparent randomness and determinism (AKA repeatability in this context) are not mutually exclusive.
There are ways of collecting random data that is effectively and truly random, although such methods are not generally useful for testing at speed. Assume that the term RNG in VirtData always refers to the common Pseudo*-RNG.
While RNGs do provide a repeatable stream upon which to build a virtual data set, this requires that you always iterate the RNG state in the same order. That is, you must iterate every cycle in the same exact order for each time you want to create a rendering of data. Ideally, we want to be able to observe a part of the virtual data set without having to iteratively advance the RNG states to the interesting part.
For this, we use hashing instead of RNGs. By using a hash with relatively good apparent randomness and high dispersion, we can achieve much the same as with an RNG with the added benefit of random access to the data stream.
Just as RNGs can appear random when the are not truly, statistical distributions which take them as inputs work in the same way. By feeding a uniform RNG over the unit interval [0,1.0] into a density function, we can simulate random sampling (with replacement) over an imagined population of entities, events, times, etc. This is a common building block of realistic simulations, for video games as well as database tests, and everything in between.
Mapping vs Generation
Virtual dataset emphasizes the idea of “data mapping” over “data generation”, but allows for users to break these rules when necessary. Data mapping implies that pure functions are being used. Data generation implies that deterministic output is not expected. The choice between these is simply a matter of whether you use mutable state in your mapping.
A functions that depends on mutable state in addition to the input value will not yield the same result for a given input. Such functions may produce the same sequence of outputs given the same sequence of inputs, but this is not sufficient for simulating sampling from a population with stable properties.
A mapping function that does not depend on changing state is effectively a pure function. This includes functions that depend on data, as long as that data itself doesn’t change.
Parameters to a function that can initialize it are simply another form of immutable state – so long as these parameters do not change for the life of the function.