Accelerating Video-to-Vector Ingest to 8,000 Vectors per Second per Node

By Madison Bratina

Video now makes up the majority of the world's data, and making it searchable means turning every frame into embeddings at enormous scale. Running that work on CPUs leaves expensive GPUs idle while decode and preprocessing stages fall behind, and the per-frame cost swings wildly with content: a busy street scene produces many times the vectors of an empty hallway. We've built a GPU-accelerated ingest pipeline that decodes video, extracts embeddings, and streams the resulting vectors into an encrypted vector store, end to end on NVIDIA GPUs. This post shows how full-stack co-design across decode, embedding, batching, and transport sustains roughly 8,000 vectors per second on a single four-GPU node, and scales linearly by adding nodes.

End-to-end pipeline architecture#

A deployment has three parts: a coordinator, an ingest fleet, and a storage fleet. The coordinator owns a run, placing one work item per video on an Amazon SQS queue and scaling an autoscaling group of GPU nodes to drain it. The ingest fleet, the subject of the rest of this post, pulls videos off the queue and turns them into vectors. Those vectors land in the storage fleet, which holds them in CyborgDB DiskIVF shards, an encrypted, disk-resident vector index, alongside a partitioned PostgreSQL table for frame metadata; at query time the coordinator fans a search across the storage nodes and merges the results.

Inside a single GPU node#

Each ingest node is an Amazon EC2 g6.12xlarge instance: four NVIDIA L4 GPUs with 24 GB each, 48 vCPUs, and 192 GiB of memory. We run 16 worker processes per node, four pinned to each GPU, as Figure 1 shows. Each worker is a systemd template instance that loads its models once and keeps them resident in GPU memory, so it never pays model warmup again as it works through job after job.

A single g6.12xlarge ingest node: 4 NVIDIA L4 GPUs, 16 worker processes pinned 4 per GPU, SQS in, segment transport out to storage

Figure 1. A single ingest node. SQS feeds 16 worker processes, four pinned to each NVIDIA L4 GPU, which stream fp16 vectors to storage over the segment transport.

The worker count matters. A single worker per GPU leaves the L4 around 10% utilized, because between inferences it spends most of its time on CPU work: decoding video, cropping objects, and serializing output. Running four workers per GPU overlaps those stalls and keeps the silicon working. The pinning is mechanical, GPU_INDEX = (worker - 1) / 4 exported as CUDA_VISIBLE_DEVICES, and the workers start a few seconds apart so 16 processes do not load model weights into GPU memory at the same instant.

The transport matters too. In an A/B test with identical server-side work, the binary segment transport delivered 1.9x the throughput of a gRPC implementation, because it sends fp16 on the wire instead of gRPC's fp32 protobuf and frames over raw TCP instead of parsing protobuf. An earlier HTTP path tops out near 44,000 vectors/sec and fails outright past one billion vectors. Binary, fp16, and raw TCP are the data plane.

From a frame to a set of vectors#

For each sampled frame, a worker runs two operations at the same time, as Figure 2 shows. It embeds the whole frame with Nomic Embed Vision to produce one vector for the scene, and it runs YOLOE, an open-vocabulary detector whose target classes are a list of text prompts set per run, capped at 30 objects per frame. Each detected object is cropped and embedded by the same Nomic Vision model. One frame therefore yields one scene vector plus one vector per object, all in a single vector space, which is what makes per-object search work rather than whole-frame matching alone.

Per-frame flow: the whole frame is embedded by Nomic Vision and run through YOLOE in parallel; crops go back through Nomic Vision; Nomic Text, a MobileCLIP2 concept index, and ArcFace faces ride alongside

Figure 2. The per-frame pipeline. The whole frame is embedded and detected in parallel; YOLOE crops return through the same vision model, and text, concept, and face vectors ride alongside into three vector spaces.

Several lighter outputs ride alongside the vision vectors. Nomic Embed Text shares a vector space with Nomic Vision, so a typed query retrieves image vectors directly; it embeds the class list, OCR reads from signs and plates, spatial phrases derived from object geometry, and short scene summaries. A separate MobileCLIP2 concept index embeds the full frame again, a 4x4 grid of 16 patches, and a bank of natural-language prompts. The grid is the detail worth calling out: where YOLOE is object-centric and fires only on its prompt classes, the grid is uniform, 16 equal cells on every frame, so it covers what the detector misses and keeps a small object in one corner from washing out of the whole-frame average. Faces go through ArcFace into a separate 512-d identity space. Every vector except faces is stored at 256-d; the Nomic vectors are Matryoshka-trained, so truncating them is nearly free, and the MobileCLIP vectors are truncated to match.

Keeping the GPUs saturated with batching#

A single 256-d embedding is a small matmul, so embedding one frame at a time would leave the GPUs idle while the surrounding CPU stages run. The pipeline batches at every model call and pools work across frames. As Figure 3 shows, each frame's outputs flow into three independent queues that flush as large batches: 64 images for Nomic Vision (full frames and crops pooled together), 32 for MobileCLIP2, and 512 strings for Nomic Text, with finished vectors buffered and upserted 128 at a time. Inference runs in fp16, and four workers share each GPU so that while one worker is on a CPU stage, another is running a batch on the device.

Batching: each frame's outputs go into three independent queues (Nomic Vision 64, MobileCLIP2 32, Nomic Text 512) that flush as big batches into a 128-vector upsert buffer; four workers overlap per GPU to fill idle gaps

Figure 3. Batched ingest. Each frame's outputs accumulate in independent queues that flush as full batches, and four workers overlap on each GPU so the device stays busy.

This design addresses one of the harder properties of video embedding: load is content-dependent and bursty. Detection density, and so the vector count, swings with the footage, which would normally make GPU load swing with it and starve the device on the quiet stretches. Because frames and crops pool into full batches across many frames, the GPU runs the same efficient batch whether the current clip is a crowded lobby or an empty lot. Busy versus quiet content changes how fast vectors come out, not how well the hardware is used.

Throughput: vectors per second per node#

Per-node throughput decomposes cleanly: vectors per second equals frames per second times vectors per frame. A node processes frames at a near-constant rate set by the hardware, about 61 sampled frames per second across all 16 workers, as Figure 4 illustrates. What changes is vectors per frame, a property of the footage rather than the configuration. Busy daytime footage runs about 131 vectors per frame and quiet overnight footage about 48, a 2.7x swing driven entirely by object count.

vectors/sec = frames/sec times vectors/frame; busy footage 60.9 fps times 131 vpf is 7,981 vectors/sec, quiet footage about 61 times 48 is about 2,900 vectors/sec

Figure 4. The throughput identity. The node holds frames per second roughly constant; the footage sets vectors per frame, so throughput tracks content while GPU utilization does not.

The same node with the same configuration sustains 7,981 vectors/sec on busy footage and about 2,900 vectors/sec on quiet footage. Table 1 summarizes the measured per-node and multi-node throughput. Because the batched queues keep the GPUs saturated either way, the busy-versus-quiet swing appears only as the output rate, never as idle hardware.

Configuration	Sustained throughput
One node, busy footage	7,981 vectors/sec (60.9 fps x 131 vectors/frame)
One node, quiet footage	~2,900 vectors/sec (~61 fps x 48 vectors/frame)
Three nodes, 25 minutes	18,175 vectors/sec

Table 1. Measured ingest throughput, through the full production path to encrypted storage.

Scaling across a fleet#

Because workers share nothing, aggregate throughput is the sum of the nodes. Three nodes sustained 18,175 vectors/sec for 25 minutes, with the second half of the run faster than the first, which is the signature of a system that does not decay as it fills. Figure 5 plots the measured points against the ideal-linear line.

Horizontal scaling: three nodes measured at 18,175 vectors/sec on the ideal-linear line, projected out to more nodes

Figure 5. Ingest throughput scales linearly with node count, because workers never coordinate.

To increase ingest throughput, add GPU nodes; to store more vectors, add storage nodes for them to land on. Nothing in the pipeline coordinates across nodes, so there is no coordination tax as the fleet grows, and the same shape carries from a single node up to a fleet ingesting into the trillions of vectors. The one rule to respect is to size the storage tier to absorb the ingest rate. As Figure 6 shows, a storage node absorbs roughly 18,000 to 20,000 vectors/sec, so the number of storage nodes is the larger of what capacity requires and what the fleet ingest rate divided by the per-node absorb rate requires.

Two fleets: add ingest nodes at about 8,000 vectors/sec each, feeding storage nodes that absorb about 18,000 to 20,000 vectors/sec each, to scale to trillions of vectors

Figure 6. The two-fleet model. Ingest scales by adding GPU nodes and storage scales to match, with no cross-node coordination on either side.

Conclusion#

Our ingest pipeline sustains roughly 8,000 vectors per second per four-GPU node by keeping NVIDIA L4 GPUs saturated across the full ingest path: frame and object embedding run in parallel, cross-frame batched queues stay full regardless of scene complexity, inference runs in fp16, and a binary fp16 transport moves vectors 1.9x faster than gRPC. Throughput then scales linearly by adding nodes, because nothing in the pipeline coordinates across them. For teams building searchable video, this turns otherwise idle GPU cycles into indexed vectors and lowers the cost of indexing video at scale.

Accelerating Video-to-Vector Ingest to 8,000 Vectors per Second per Node

Accelerating Video-to-Vector Ingest to 8,000 Vectors per Second per Node

End-to-end pipeline architecture#

Inside a single GPU node#

From a frame to a set of vectors#

Keeping the GPUs saturated with batching#

Throughput: vectors per second per node#

Scaling across a fleet#

Conclusion#

Keep reading

Searching 100 Billion Encrypted Video Vectors: Why Search Latency Didn't Budge With Scale