March 4, 2025

Silencing the Noise: Dismissing LLM Security Findings

Silencing the Noise: Dismissing LLM Security Findings

The Almanax platform scans code repositories for potential issues, but not every flagged finding requires action. Some may be false positives, others are known and intentional design choices. Our new "dismiss" feature lets users filter out these false positives and design choices, ensuring they don’t reappear in future scans.

This feature relies on a hybrid method: embeddings for speed and LLM-as-a-judge (G-Eval) for accuracy. We benchmarked different approaches to balance performance, cost, and latency.

How We Measure Performance

We created a benchmark with:

  • A list of dismissed findings. 
  • A list of output findings predicted in a subsequent scan. A subset of these are duplicates of the dismissed set, with a slight difference in phrasing.

Example

One approach to measuring the precision of our dismissal pipeline would be to simply treat it as a binary classification task, where each finding is either retained or dismissed without further distinction. This method assumes a perfect one-to-one correspondence between dismissed and retained findings, which can be misleading. For instance, if a finding is incorrectly dismissed due to a similarity with a different but unrelated dismissed finding, a binary approach would still count it as a correct dismissal, despite the underlying mistake.

Instead, we take a more nuanced approach. For each dismissed finding, we check whether the corresponding output finding was filtered out. We also measure how many other output findings were incorrectly filtered out.

From these measurements, we calculate two key metrics:

  • Hit Rate: Probability a dismissed finding is correctly filtered out in future scans. A higher hit rate means fewer duplicate  alerts.
  • Mis-Hit Rate or False Positive Rate (FPR): Probability an unintended finding is incorrectly removed. Lower FPR ensures critical issues aren't dismissed by mistake.

These metrics give us a clear picture of how accurate our dismissal pipeline is.

Evaluated Methods

1. Embeddings (Cosine Similarity)

This approach leverages the semantic similarities captured by embeddings. We convert each finding and dismissed finding into a numerical representation (embedding) using a pre-trained model. Then, we calculate the cosine similarity between these embeddings. A high cosine similarity indicates the two findings are semantically close, suggesting the output finding should be dismissed. 

For example, if a dismissed finding has a similarity score of 0.7 with an output finding, it may be excluded from future scans.

Mindful of the unique characteristics of performing comparisons between embeddings, we consider multiple options. While performing DOT-product similarity is generally less information-destructive [1], here we consider cosine similarity to be a practical fit for the task. 

However, simply using cosine similarity can lead to mis-hits like the one in the example below.

Example:

  • Dismissed finding: "Lack of access control in unsponsor() function"
  • Output finding: "Lack of access control in forceUnsponsor() function"
  • Cosine similarity: 0.95 (high, but function names differ)

Pros: Fast (~40s), cost-effective
Cons: Higher false positives due to semantic-only comparison

Best Model: jina-embeddings-v2-base-code

Other Models:

2. LLM-as-a-Judge (G-Eval)

As we just saw, embeddings struggle when syntax matters. Two findings might be semantically similar but refer to different functions. 

Our custom implementation of G-Eval [2] uses an LLM to determine if two findings are similar enough. This is achieved via Chain of Thought prompting (a series of user-defined evaluation steps) and calculating the weighted average of token log probabilities to produce a final similarity score. G-Eval experiments have demonstrated a high correlation with human judgement. 

Example:

  • Dismissed finding: "Lack of access control in unsponsor() function"
  • Output finding: "Lack of access control in forceUnsponsor() function"
  • G-Eval Score: 0.47
  • LLM evaluation: Different enough to retain as a valid finding.

Pros: Higher accuracy
Cons: Slow (~29min), costly

3. Hybrid (Embeddings + G-Eval)

We first apply cosine similarity to filter most findings, then use G-Eval on borderline cases. This improves latency while preserving accuracy, by significantly reducing the number of LLM comparisons needed.

Example:

  • Findings with cosine similarity < 0.7 go through a quick cosine similarity filter.
  • If similarity is between 0.7 and 1.0, G-Eval runs additional checks.

Performance:

The optimal thresholds we landed on: cosine similarity = 0.7, G-Eval = 0.6, achieving a Hit Rate of 1.00 and FPR of 0.004 on our benchmark.

Conclusion

The hybrid approach delivers the best trade-off between speed and accuracy, significantly reducing false positives while keeping costs manageable. By using embeddings for their efficiency and LLM evaluation for its precision, we ensure dismissed findings are correctly handled while minimizing the risk of missing true security threats. This enables security teams to focus on real issues rather than noise.

References

[1] Is Cosine Similarity of Embeddings Really About Similarities? https://research.netflix.com/publication/is-cosine-similarity-of-embeddings-really-about-similarity

[2] G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. https://arxiv.org/abs/2303.16634