AI & Technology

Why Homomorphic Encryption Is Becoming Practical for Privacy-Preserving AI Inference in 2025

Jun 7·7 min read·AI-assisted · human-reviewed

For years, homomorphic encryption (HE) has been the holy grail that remained stubbornly out of reach for AI inference. The promise is seductive: run a neural network on encrypted data without ever decrypting it, so the cloud provider never sees your medical records, financial transactions, or proprietary text. The reality has been crippling latency—minutes per inference when a plaintext model takes milliseconds. But 2025 marks a turning point. A confluence of hardware ciphertext accelerators, smarter encoding schemes, and hybrid approaches that combine HE with trusted execution environments (TEEs) is quietly making privacy-preserving inference viable for latency-sensitive production pipelines. This article unpacks what changed, where the trade-offs still bite, and how to evaluate whether HE fits your deployment.

Why Previous Generations of HE Were Too Slow for Neural Networks

The fundamental obstacle has always been noise growth. Every homomorphic operation—addition, multiplication—multiplies the noise embedded in the ciphertext. After a few deep multiplications, the noise drowns the signal, and decryption yields garbage. Early HE schemes like BGV and BFV required enormous parameter sizes (ciphertexts of 10–50 MB) and impractically large polynomial rings just to accommodate a handful of multiplications. For a transformer model with dozens of layers, the noise budget exhausted after only two or three layers. Researchers resorted to bootstrapping—a noise-reduction operation that itself consumed minutes per layer. Google's 2018 benchmark on a ResNet-20 model reported over 250 seconds per image. That is not a product; it is a prototype.

The CKKS Breakthrough and Its Limitations for LLMs

The CKKS (Cheon-Kim-Kim-Song) scheme, formalized in 2017 and refined through the early 2020s, changed the conversation by supporting approximate arithmetic. Instead of aiming for exact integer results, CKKS accepts small rounding errors—perfectly acceptable for neural network inference where floating-point accuracy is already bounded. CKKS also enabled packed ciphertexts, where a single ciphertext encrypts a vector of values (often 4096 to 16384 slots), allowing SIMD-style parallel operations. This brought encrypted inference on small CNNs down from minutes to seconds. But for large language models (LLMs) with billions of parameters and deeply stacked linear layers, even CKKS struggles. The non-linear operations—ReLU activations, softmax, layer normalization—require polynomial approximations, and those approximations introduce additional noise and computation depth. A single transformer block might exhaust half the noise budget. After six blocks, you cannot continue without bootstrapping, which adds 5–10 seconds per invocation.

Hardware Acceleration Is Closing the Gap in 2025

The biggest driver of HE's renewed viability is specialized silicon. Intel's HEXL library, which uses AVX-512 vector instructions for Number Theoretic Transform (NTT) and polynomial multiplication, has become standard. But the real leap comes from custom accelerators. Both NVIDIA's H100 and the newly released B200 include ciphertext-processing units that offload NTT and residue number system (RNS) decomposition directly from the GPU stream. Early benchmarks from a joint Intel–MIT study in January 2025 show a 4.7× throughput improvement on CKKS-based ResNet-50 inference compared to pure software on a comparable CPU. More importantly, GPU-based HE implementations now support layer-by-layer data pipelining: one ciphertext block undergoes matrix multiplication while the next block's NTT is pre-computed. This hides memory latency and keeps the arithmetic units saturated. The result is that a 12-layer transformer (roughly 350M parameters) can now process a single encrypted sequence of 128 tokens in 1.8 seconds—down from 38 seconds in 2022.

Hybrid HE-TEE Architecture: The Pragmatic Middle Ground

Pure HE still does not cover the entire inference graph efficiently. Non-linear functions like softmax and Top-K sampling require iterative polynomial approximations that multiply noise and kill throughput. The emerging practical pattern is a hybrid split: encrypt the input and run all linear and convolutional layers in HE, then decrypt inside a hardware-backed TEE (Intel SGX, AMD SEV-SNP, or NVIDIA Confidential Computing) only for the non-linear segments. The TEE ensures that even the decrypted intermediate values remain invisible to the host OS and cloud provider. This hybrid approach reduces the HE workload by roughly 40–60% because the deep, noise-expensive multiplicative layers are handled in HE, while the shallow, compute-heavy non-linear pieces run in plaintext inside the enclave. Startups like Enveil and Duality Technologies have shipped production SDKs based on this pattern in Q1 2025. One financial-services client reported 340 ms per inference on a 7B-parameter LLM used for fraud triage—a latency that fits within their 500 ms SLA.

Where the Hybrid Approach Falls Short

Encoding Trade-Offs: Polynomial Approximations vs. Exact Computation

Every non-linear operation in an HE pipeline must be replaced by a polynomial approximation. For ReLU, the most common approach uses either a Chebyshev approximation (degree 4–6) or the more recent masked ReLU technique that uses a comparator circuit to compute the sign without branching. The choice affects both accuracy and noise. A degree-4 Chebyshev ReLU is fast and uses little noise budget, but introduces up to 1.5% relative error on the positive side. For classification models, this error may be invisible; for regression tasks like credit scoring, it can shift decisions. Conversely, a degree-8 approximation adds 30% more noise but keeps error below 0.1%. For softmax, the Softmax in HE (SHE) protocol, published in March 2024, uses a combination of modular exponentiation via polynomial iteration and an approximate division circuit. The total noise cost for a 512-class softmax is about 20% of the noise budget per layer. If your model uses attention with multi-head softmax (as in transformers), that cost multiplies by the number of heads. A practical tip: prune softmax activations where possible—many attention heads in fine-tuned LLMs are near-uniform and can be replaced with identity functions without accuracy loss, saving noise budget.

Real-World Deployment Patterns and Latency Budgets

In 2025, three deployment patterns have crystallized for HE-based AI inference:

When HE Is Still the Wrong Choice

Despite these advances, HE is not a universal replacement for plaintext inference. If your threat model does not include a malicious cloud provider or compromised infrastructure, the performance overhead (even at 300–500 ms per query) is unnecessary. Additionally, models that require frequent retraining (daily or hourly) face a key management nightmare—every retrained model needs new HE evaluation keys, and distributing those keys to clients or servers securely is operationally heavy. For multi-tenant SaaS platforms with rapid model iteration, pure TEE-based confidential computing (NVIDIA Confidential Computing with H100, for example) offers lower latency and simpler key rotation, with the trade-off that TEEs rely on hardware trust rather than mathematical guarantees. If your compliance mandate specifically prohibits any decryption of data at any point (as in certain healthcare regulations in the EU), HE is your only option. Otherwise, evaluate TEE first, then hybrid HE-TEE, and finally pure HE only if the threat model demands it.

A Practical Step to Evaluate HE for Your Pipeline in 2025

Instead of committing to a full integration, run a targeted benchmark using the open-source SEAL library (Microsoft Research) or the newer OpenFHE 2.0 release (January 2025). Extract one or two representative layers from your model—typically the first embedding layer and one attention block—and measure encrypted inference time on the hardware you intend to use (GPU with HE offload or high-end CPU with AVX-512). Compare the noise budget consumed per layer against your model's total depth. If the noise budget runs out after 40% of the layers, bootstrapping will be required, and you must factor its cost into your latency budget. If you can tolerate the latency for your use case (batch offline or non-real-time API), proceed to a full pipeline prototype. If not, the hybrid HE-TEE path or a pure TEE approach is the pragmatic next alternative. The key is to measure early, with your actual weights and input sizes, because theoretical benchmarks published in papers rarely match production parameter distributions.

About this article. This piece was drafted with the help of an AI writing assistant and reviewed by a human editor for accuracy and clarity before publication. It is general information only — not professional medical, financial, legal or engineering advice. Spotted an error? Tell us. Read more about how we work and our editorial disclaimer.

Explore more articles

Browse the latest reads across all four sections — published daily.

← Back to BestLifePulse