Path-traced sequences at one sample per pixel (1-spp) are attractive for interactive previews but remain severely noisy, particularly under caustics, indirect lighting, and volumetric media. We present SampleMono, a novel approach that performs multi-frame spatiotemporal extrapolation of low-resolution and low-sample Monte Carlo sequences without requiring auxiliary buffers or scene-specific information. We transfer and prune a pre-trained video generation backbone and fine-tune it on SampleMono GYM, a synthetic Monte Carlo dataset, to generate four clean high-resolution frames from a longer window of noisy inputs, thereby decoupling render and presentation timelines. Our experiments demonstrate that by combining a frozen VAE encoder-decoder and training of a video generation model pruned to two transformer layers, our pipeline can both provide spatial upsampling and temporal extrapolation to a long sequence of 16 RGB frames of 50 milliseconds time delta between frames at 256×144 resolution with severe Monte Carlo noise, generating subsequent four RGB frames of 12.5 milliseconds time delta between frames at 1280×720 resolution with substantially reduced noise at varying quality while fitting VRAM budget of 5GB. We plan to publish the code for data GYM, model pruning, pipeline training, and rendering.