Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs

1Hebrew University of Jerusalem, Israel
2Amherst College, USA
3VMware Research by Broadcom, USA

Wonda Data Curation Pipeline

We introduce Wonda, a rigorous data curation pipeline that transforms raw verifier output into high-quality training invariants for loop invariant generation task through AST normalization, LLM-driven simplification, and formal quality grading.

Wonda pipeline diagram: verification query, direct verification, AST normalization, LLM simplification, syntax validation, and V1/V2 correctness and sufficiency checks produce qualifying invariants.

Abstract

The synthesis of inductive loop invariants remains a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on complex programs, producing invariants that are invalid or computationally ineffective. Although fine-tuning is a natural strategy to address these limitations, obtaining high-quality training data remains an open challenge. We first formalize the properties required for a high-quality training invariant, and then present Wonda, a rigorous data curation pipeline that extracts such invariants from raw verifier output via AST-based normalization followed by LLM-driven semantic rewriting and augmentation with provable quality guarantees. Fine-tuning Small Language Models (SLMs) on Wonda-curated data yields consistent gains across the Qwen3, Llama-3.1, and Mistral families: the 4B and 8B Qwen3 models nearly double invariant correctness and double speedup rates, while Llama-3.1-8B triples both. On the challenging InvBench suite, the same 4B model outperforms an off-the-shelf model 20× its size and matches the end-to-end verification time of GPT-OSS-120B, while a 14B Qwen3 model matches that of the frontier model GPT-5.2, all without test-time compute overhead.

Why Not Raw Verifier Output?

Example of a raw verbose invariant generated by UAutomizer: 3084 characters, 75 conjuncts, 11 disjuncts

UAutomizer-generated invariants are often correct but verbose, cluttered, and poor training targets.

Fine-tuning on invariants produced by symbolic verifiers such as UAutomizer is a natural way to specialize language models for invariant synthesis. In practice, however, this strategy often fails on hard benchmarks. Raw UAutomizer outputs tend to be syntactically noisy, verifier-specific, and pedagogically weak: they enumerate many disjuncts instead of exposing compact structure that a model can learn to reproduce.

We argue that data quality, rather than model scale alone, is the key bottleneck. Wonda addresses this by curating verifier output into training invariants that are correct, useful for verification, and compact enough to serve as effective supervision.

What Makes a Good Training Invariant?

We formalize four properties that a high-quality training invariant should satisfy:

  • Non-degeneracy: exclude trivial formulas (TRUE, FALSE).
  • Correctness: the invariant holds at the loop location on all relevant executions.
  • Usefulness: using the invariant speeds up verification relative to the baseline verifier.
  • Compactness: a succinct syntactic form that is easier for a model to learn and generalize from.

Wonda’s grading function G ∈ {0, 1, 2, 3} scores each candidate using formal V1 (correctness) and V2 (sufficiency) checks. Only candidates with G ≥ 2 enter the training set.

Example: Simplifying Verbose Invariants

The figures below show concrete V0 → V1 → V2 transformations produced by Wonda. AST normalization removes syntactic noise; LLM simplification rewrites verbose disjunctive forms into compact closed-form expressions that are easier to learn.

Main Results

Table 3: main results on the hard instances (n = 123)

Highlights

  • Wonda V2 fine-tuning roughly doubles correctness and speedup rates for Qwen3-4B/8B.
  • Llama3.1-8B-V2 triples both vs. base.
  • Qwen3-14B-V2 matches GPT-5.2 on end-to-end VBP.
  • Qwen3-4B/8B-V2 match GPT-OSS-120B on end-to-end VBP despite lower correctness.
  • At 4B scale, Wonda nearly matches a model 20× larger with no test-time overhead.

WONDA Pipeline Ablation

Table 4: Wonda ablation study on the hard split

Highlights

  • V0: raw UAutomizer output → V1: AST normalization → V2: full Wonda pipeline.
  • Each stage helps, but V2 is best per model family. V0 can sometimes hurt performance for Mistral-7B and Qwen3-0.6B.
  • Largest gains on correctness, speedup rate, and verification time across Qwen3, Llama3.1-8B, and Mistral-7B.

BibTeX

@article{pinto2026wonda,
  title={Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs},
  author={Pinto, Ido and Elboher, Yizhak Yisrael and Wu, Haoze and Narodytska, Nina and Katz, Guy},
  journal={arXiv preprint arXiv:2603.15510},
  year={2026},
  note={Accepted to ICML 2026}
}