Deterministic random generation#

stablebear’s random generation system produces reproducible results regardless of thread count or execution order. This is achieved by deriving a unique, deterministic seed for each tensor element from a master seed and the element’s position, so that parallel threads never share or depend on each other’s random state.

This mechanism underpins all random operations in stablebear – generating noisy PCFs, sampling Poisson point processes, and any future operation that needs per-element randomness.

Design goal#

Several operations populate tensors element-wise with random data: sample_poisson fills a point cloud tensor with Poisson-distributed points, noisy_sin fills a PCF tensor with noisy discretizations, and so on. These operations are parallelized across threads via Taskflow [1].

In a parallel setting, the order in which elements are visited depends on the thread scheduler – this is non-deterministic. A shared RNG would produce different results on every run. The requirement is: same seed, same results, always – independent of the number of threads, the order of execution, or the platform.

Architecture#

The system has three layers:

Engine (xoroshiro128pp.hpp) – the random engine that produces pseudorandom output from a given state.
RandomGenerator (random_generator.hpp) – holds a master seed and derives per-element engines via make_engine.
Walk with random (walk.hpp) – tensor traversal that pairs each element with its own deterministically-seeded engine.

Any function that needs per-element randomness simply calls walk() or parallel_walk() with a RandomGenerator, receiving an independent engine at each element. The function does not need to know anything about the seeding strategy.

Engine: xoroshiro128++#

The default engine is Xoroshiro128pp, an implementation of xoroshiro128++ [2]. It satisfies the C++ UniformRandomBitGenerator concept, so it works with all standard distribution types (std::normal_distribution, std::poisson_distribution, etc.).

The engine takes two 64-bit state words (s0, s1) and produces 64-bit output. The core algorithm lives in detail/xoroshiro128pp_impl.hpp (public domain reference code, unmodified), while the engine wrapper is in xoroshiro128pp.hpp.

Key properties:

128-bit state – two uint64_t words, vs 2.5 KB for mt19937_64
Period – \(2^{128} - 1\), far more than needed for per-element streams
Quality – passes BigCrush [3]
Speed – significantly faster than both mt19937_64 and counter-based alternatives (Philox) for the init-then-draw pattern used here

Seeding via make_engine#

The RandomGenerator class is templated on the engine type. It does not construct engines directly – instead, it delegates to detail::make_engine<EngineT>(seed), which encapsulates the engine-specific seeding strategy.

For the default engine, the seeding chain is:

m_seed + m_offset + flatIndex
  -> make_engine<Xoroshiro128pp>
    -> s0 = splitmix64(seed)
    -> s1 = splitmix64(s0)
    -> Xoroshiro128pp(s0, s1)

(m_offset is the per-generator stream counter described in Advancing between draws below; it is 0 for the first draw after seeding, so the chain reduces to m_seed + flatIndex there.)

This follows the recommended seeding approach from [2]: chain splitmix64 outputs to fill the state words, feeding each output as the next input. Since splitmix64 is a bijection, this guarantees two distinct, well-distributed state words and avoids the all-zeros state (which xoroshiro cannot be in).

The default make_engine for other engine types applies a single splitmix64 and forwards to the engine’s constructor:

template <typename EngineT>
EngineT make_engine(uint64_t seed)
{
  return EngineT(splitmix64(seed));
}

Adding support for a new engine type requires only a make_engine specialization – the rest of the system (RandomGenerator, walk, consumer functions) is unaffected.

splitmix64#

Raw addition (seed + index) would produce correlated seeds for nearby indices. The splitmix64 hash function [4] transforms the sum into a well-distributed 64-bit value:

uint64_t splitmix64(uint64_t x)
{
  x += 0x9e3779b97f4a7c15ULL;
  x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ULL;
  x = (x ^ (x >> 27)) * 0x94d049bb133111ebULL;
  return x ^ (x >> 31);
}

This ensures that adjacent indices produce statistically independent seed values.

RandomGenerator#

RandomGenerator<EngineT> stores a uint64_t master seed and a uint64_t offset – a counter of how many seed slots have been handed out so far. It holds no RNG engine state. A draw does not pull engines from the generator directly; instead it reserves a contiguous block of seed slots via reserve(n) and derives one engine per element from that block:

template <typename EngineT = Xoroshiro128pp>
class RandomGenerator
{
public:
  explicit RandomGenerator(uint64_t seed);

  // A reserved, contiguous range of seed slots. Captured by value, so it
  // stays valid even after the generator advances or the draw runs async.
  class Block
  {
  public:
    EngineT sub_generator(size_t flatIndex) const
    {
      return detail::make_engine<EngineT>(m_base + flatIndex);
    }
  private:
    uint64_t m_base;
  };

  // Reserve the next n slots [m_offset, m_offset + n) and advance past them.
  Block reserve(size_t n)
  {
    Block block(m_seed + m_offset);
    m_offset += n;
    return block;
  }
};

Within a single draw, element i is seeded from m_base + i where m_base = m_seed + m_offset. Since flatIndex is the row-major index into the tensor, the engine for each element is a pure function of position – not of execution order. Across draws, the offset moves forward (see below), so successive draws occupy disjoint ranges of seed slots.

Advancing between draws#

A generator must advance its state between sampling calls. Otherwise two consecutive draws from the same generator would reuse the same seed slots and produce byte-for-byte identical tensors – a silent footgun for any repeated-sampling loop (bootstrap, Monte Carlo), and contrary to the numpy.random.Generator / torch.Generator convention:

g = sb.random.Generator(seed=5)
a = sb.random.noisy_sin((3,), generator=g)
b = sb.random.noisy_sin((3,), generator=g)
# a and b differ; a fresh Generator(5) reproduces both, in order.

Advancing is the job of reserve(n): a draw over an N-element tensor reserves the block [m_offset, m_offset + N) and bumps m_offset by N immediately, before any element is filled. The next draw therefore starts at m_offset = N and cannot overlap the previous one. Re-seeding (seed()) resets the offset to 0.

Reserving up front – rather than advancing after the draw – is what makes this correct under parallelism. parallel_walk dispatches its per-element callback asynchronously via Taskflow, so the callbacks run after the call returns. If the generator were advanced only once the draw “finished”, there would be no well-defined moment to do so, and reading the live generator’s offset from inside the async callbacks would race. Instead, reserve computes the block base and advances the counter synchronously, then the immutable Block (a single uint64_t) is captured by value into the callback. Workers read only their copy; the generator is free to advance for the next draw.

Backward compatibility. The first draw after seeding uses m_offset == 0, so its element i is seeded from m_seed + i – exactly the pre-advancement behaviour. Existing seeds reproduce their historical first draw unchanged; only the second and later draws (previously duplicates) now differ.

Non-overlap by construction. Because each draw reserves a disjoint, contiguous slot range, independence between draws is structural – it does not rely on a hash scattering nearby seeds apart. (splitmix64 still decorrelates adjacent slots within the engine seeding chain, as described above.)

Walk integration#

The walk() and parallel_walk() functions in walk.hpp provide the bridge between the random generator and tensor traversal. Both sequential and parallel variants exist:

// Sequential: visits every element in row-major order
walk(tensor, generator, [](const std::vector<size_t>& idx, EngineT& engine) {
  // engine is seeded deterministically from idx's flat position
});

// Parallel: distributes elements across threads via taskflow
parallel_walk(tensor, generator, [](const std::vector<size_t>& idx, EngineT& engine) {
  // same engine for same idx, regardless of which thread runs this
}, executor);

Internally, both variants first call gen.reserve(tensor.size()) once to claim a block for the whole draw (advancing the generator), then compute the flat (row-major) index for each element and call block.sub_generator(flatIndex) to create that element’s engine. The Block is captured by value, so the parallel variant is safe even though its callbacks run asynchronously. The callback receives the engine by reference and can draw as many samples as needed.

This is the only entry point for deterministic randomness. A consumer function does not manage seeds or engines directly – it receives a ready-to-use engine from the walk and draws from it:

// Example: sample_poisson uses parallel_walk with a generator
sb::parallel_walk(out, gen,
  [&](const std::vector<size_t>& idx, auto& engine) {
    std::poisson_distribution<size_t> countDist(lambda);
    auto nPoints = countDist(engine);
    // ... fill point cloud using engine ...
  }, exec);

Adding a new random operation follows the same pattern: accept a RandomGenerator, pass it to walk or parallel_walk, and use the provided engine.

Why flat indexing works#

The flat index is computed from the multi-dimensional index and the tensor shape:

For a tensor with shape (s0, s1, ..., sN):
  flat(i0, i1, ..., iN) = i0 * (s1 * s2 * ... * sN) + i1 * (s2 * ... * sN) + ... + iN

This mapping is bijective – every element has a unique flat index, and the index depends only on position and shape, not on traversal order. The parallel walk computes the same flat index from any thread:

// In parallel_walk_impl:
size_t rem = flat;
for (ptrdiff_t i = ndim - 1; i >= 0; --i)
{
  idx[i] = rem % shape[i];
  rem /= shape[i];
}

This reverse computation reconstructs the multi-index from the flat index, ensuring both sequential and parallel walks agree on the mapping.

Concrete example#

Consider a (3, 4) tensor populated with seed 42:

Element (0, 0): flat = 0  -> make_engine(42 + 0)  -> splitmix64 chain -> engine
Element (0, 1): flat = 1  -> make_engine(42 + 1)  -> splitmix64 chain -> engine
Element (0, 2): flat = 2  -> make_engine(42 + 2)  -> splitmix64 chain -> engine
...
Element (2, 3): flat = 11 -> make_engine(42 + 11) -> splitmix64 chain -> engine

Each element gets its own Xoroshiro128pp engine with a unique state derived from its position. Whether the walk visits these 12 elements on 1 thread or 12 threads, each element always receives the same engine, producing identical results.

The draw reserves slots [0, 12) and advances the offset to 12, so a second draw from the same generator shifts every base by 12:

Second draw, element (0, 0): flat = 0  -> make_engine(42 + 12 + 0)  -> engine
Second draw, element (2, 3): flat = 11 -> make_engine(42 + 12 + 11) -> engine

The two draws never share a seed slot, so they differ – yet replaying with a fresh Generator(42) reproduces both draws in the same order.

Global and explicit generators#

The system supports two usage patterns:

Global generator – sb::seed(42) sets the seed on a process-wide DefaultRandomGenerator (default_generator()). Some functions use this by default when no generator is passed explicitly.
Explicit generator – RandomGenerator gen(42) creates an independent generator that can be passed to functions, allowing multiple independent random streams.

Switching engines#

The engine is a template parameter on RandomGenerator. To use a different engine:

Implement the engine class satisfying UniformRandomBitGenerator (result_type, operator(), min(), max()).
Add a detail::make_engine specialization if the engine needs a non-standard seeding strategy.
Instantiate RandomGenerator<YourEngine>.

The walk infrastructure, consumer functions, and Python bindings do not change.