Privacy-preserving face matching requires comparing biometric templates without exposing them. Fully homomorphic encryption (FHE) supports this directly. It runs computation on ciphertext and returns an encrypted result, so the plaintext templates are never available to the server. The cost is performance. A single homomorphic multiply runs several orders of magnitude slower than its plaintext equivalent, and one face-matching query stacks thousands of these operations together.
This post describes the work we did to make encrypted face search practical at scale. We fixed a correctness bug in the GPU comparator, reduced the cost of a single query by 3.4x, modified the underlying GPU library to allow multiple processes to share a GPU, and scaled the search horizontally across GPUs. The result searches 10 million encrypted face embeddings in about one second, validated on real data against an exact brute-force baseline.
Background#
This work began as a reproduction of HyDia, an FHE-based facial matching protocol published at PETS 2025. HyDia provides both a CPU and a GPU implementation. Our goal was to confirm its reported results, then reduce query latency and scale the search to larger databases. Working from its published implementation is how we encountered the GPU bug described in the next section.
The workload is encrypted nearest-neighbor search over face embeddings. Each face is a 512-dimension ArcFace vector, drawn here from the public Glint360K dataset. A match is a cosine similarity above a fixed threshold. Under CKKS, the FHE scheme for approximate arithmetic on real numbers, the query has two stages.
- An encrypted matrix-vector product between the query and a packed database of vectors. We use the baby-step/giant-step (BSGS) algorithm to reduce the number of ciphertext rotations, which are among the most expensive FHE operations.
- An encrypted comparison. A polynomial approximates the sign of (similarity minus threshold), producing a value near 1 for a match and near 0 otherwise, without revealing the similarity score.
We packed 16,384 vectors into a single ciphertext, which we call a shard, and ran the pipeline on NVIDIA L40S GPUs. The implementation builds on two open FHE stacks: OpenFHE for the CPU path, and FIDESlib, an open GPU implementation of CKKS, for the accelerated path.
A correctness bug in the GPU comparator#
While reproducing HyDia's results on real faces, we found that the GPU search path returned incorrect results. It recovered only about half of the genuine matches that the CPU path found, a recall of 50 percent, and reported no error. HyDia's published accuracy is reported for the CPU path; the GPU path is where we observed the discrepancy. Synthetic benchmarks did not expose the problem, because genuine matches in synthetic data sit well above the threshold. Real faces produce genuine matches close to the threshold, which is where the bug took effect.
The cause was in the comparison polynomial. The evaluator computed coefficients in the Chebyshev basis but reassembled the polynomial with monomial-basis arithmetic. That reassembly is correct for ordinary powers of x and invalid for Chebyshev polynomials, where T_k(x) times T_k(x) does not equal T_2k(x). The evaluator therefore computed a smoother, lower-degree function than intended, and the decision boundary shifted from a cosine of 0.44 to about 0.52. Genuine matches with similarity between 0.44 and 0.52 fell below the shifted boundary and were dropped.
We confirmed the mechanism with a plaintext model that reproduced the decrypted GPU output to three decimal places, then corrected the evaluation to use the Chebyshev basis throughout. Recall returned to 100 percent. We also added a validation step that checks recall against an exact brute-force search, so the encrypted path cannot diverge from ground truth without detection.
Reducing the cost of a single query#
We reduced per-query latency through four changes.
Removing redundant synchronization. The pipeline contained device-wide barriers that forced the GPU to drain and idle between steps. Most were unnecessary and were removed.
Deriving membership from the index pass. The system ran the comparator twice, once to identify which entries match and once to report whether any entry matches. The second result follows from the first, so we removed the redundant pass.
Per-thread scratch buffers. Key-switching used shared global scratch, which prevented multiple ciphertext-matrices from being processed concurrently. Giving each worker its own scratch buffer allowed independent matrices to overlap on the GPU.
Rotation hoisting. In the baby-step stage, one key-switch is shared across multiple rotations rather than repeated for each rotation.
Together these changes reduced the warm query time for a single shard by 3.4x, to about 0.64 seconds. That is the time to search 16,384 encrypted faces on one GPU.
Enabling multiple processes to share a GPU#
Profiling showed that each GPU was idle roughly 85 percent of the time during a query. The workload is latency-bound. It is a long sequence of small, dependent kernels, and the GPU spends most of its time waiting for the next launch rather than computing. A GPU with that much idle time should be able to search two shards at once.
The standard mechanism for concurrent GPU sharing on NVIDIA hardware is the Multi-Process Service (MPS). MPS would not start with our binary. The cause was the library's Number-Theoretic Transform (NTT), the core operation behind every homomorphic multiply, which was implemented with CUDA Dynamic Parallelism, where a kernel launches other kernels from the device. MPS does not support Dynamic Parallelism.
We replaced the device-side launches with equivalent host-launched batched kernels. Throughput was unchanged, and the binary ran under MPS. With each process limited to half of the GPU, two shards run concurrently on one GPU and both finish in about one second. This halves the number of GPUs required for a given latency target.
Scaling horizontally#
The per-query work does not change the structure of the problem. A database of N vectors is N divided by 16,384 shards, and shards are independent. Scaling is therefore data-parallel. The system broadcasts the encrypted query to every shard, runs the comparator on each GPU, and merges the matches.
Horizontal sharding distributes a single encrypted query across GPUs.
Ten million vectors is 640 shards. With one shard per GPU, the full search completes in the time of a single shard, about 0.6 seconds, on 640 GPUs. With two shards per GPU through MPS, it runs on 320 GPUs in about 1.2 seconds. With four shards per GPU, 160 GPUs complete it in about 4 seconds. The query is identical in each case. The variable is how much hardware is applied.
Results#
Figure 2 shows query latency as a function of fleet size for a 10-million-vector database.
Query latency for a 10M-vector database as a function of fleet size.
The two operating points of interest are:
- 10 million vectors in about 1.2 seconds on 320 L40S GPUs, at two shards per GPU.
- 10 million vectors in about 4 seconds on about 160 L40S GPUs, at four shards per GPU.
Sub-second search is reachable at roughly 640 GPUs, where it becomes limited more by GPU availability than by computation.
We validated these results on real data rather than synthetic vectors. We generated 10,485,760 real face embeddings, searched all of them under FHE, and compared the result against an exact plaintext brute-force search over the same set. The encrypted search returned 328 matches out of 328 true matches, a recall of 100 percent, with zero false positives across all 10 million embeddings.
Scaling limits#
Real-time FHE search has one property that determines how it scales. Encrypted vectors must reside in GPU memory to be searched, because the homomorphic operations run on the GPU and cannot access host memory directly. FHE ciphertexts are also large. A 512-dimension embedding that occupies about 2 KB in plaintext expands under CKKS so that a 16,384-vector shard is about 9 GB. A 48 GB L40S holds approximately 80,000 encrypted vectors.
This produces a linear scaling law. Searchable real-time database size grows with total GPU memory, at roughly one GPU per 80,000 vectors.
Searchable database size scales linearly with total GPU memory.
Three limits follow from this property.
The minimum fleet size is set by memory, not latency. Searching 10 million vectors in real time requires at least 128 GPUs, because fewer cannot hold 10 million encrypted vectors in memory. Below that point the database must be streamed from storage, and query time increases from seconds to minutes.
The minimum latency is set by the kernel, not the hardware. With unlimited GPUs, the search bottoms out at the 0.64 seconds required for one shard, because a shard is the unit of work on a single GPU. Reducing it further requires faster kernels, not more of them.
The cost of real-time search grows linearly with database size, at approximately one GPU per 80,000 vectors.
Next steps#
Two directions follow from these limits. The first is kernel fusion: combining the long sequence of small, launch-bound kernels into fewer larger ones, which addresses the per-shard latency floor that additional hardware cannot.
The second is a learned, encrypted index. The current system performs brute-force search, where every query scans every vector, which is why cost grows linearly with database size. An index that directs a query to only the relevant partitions, while remaining encrypted and free of access-pattern leakage, would remove the need to scan and hold the entire database. This is the path to sublinear encrypted search and to scaling beyond what brute force allows.
Summary#
We fixed a recall bug in the GPU comparator, reduced single-query latency by 3.4x, modified the GPU FHE library to support multi-process sharing, and scaled the search across GPUs. The system searches 10 million real encrypted face embeddings in about 1.2 seconds on 320 GPUs, or about 4 seconds on 160 GPUs, with 100 percent recall and zero false positives measured against an exact brute-force baseline.
