
An AI Security Engineer fixing your code issues
New York, New York
© 2024 Almanax
The Almanax platform scans code repositories for potential issues, but not every flagged finding requires action. Some may be false positives, others are known and intentional design choices. Our new "dismiss" feature lets users filter out these false positives and design choices, ensuring they don’t reappear in future scans.
This feature relies on a hybrid method: embeddings for speed and LLM-as-a-judge (G-Eval) for accuracy. We benchmarked different approaches to balance performance, cost, and latency.
We created a benchmark with:
Example
One approach to measuring the precision of our dismissal pipeline would be to simply treat it as a binary classification task, where each finding is either retained or dismissed without further distinction. This method assumes a perfect one-to-one correspondence between dismissed and retained findings, which can be misleading. For instance, if a finding is incorrectly dismissed due to a similarity with a different but unrelated dismissed finding, a binary approach would still count it as a correct dismissal, despite the underlying mistake.
Instead, we take a more nuanced approach. For each dismissed finding, we check whether the corresponding output finding was filtered out. We also measure how many other output findings were incorrectly filtered out.
From these measurements, we calculate two key metrics:
These metrics give us a clear picture of how accurate our dismissal pipeline is.
1. Embeddings (Cosine Similarity)
This approach leverages the semantic similarities captured by embeddings. We convert each finding and dismissed finding into a numerical representation (embedding) using a pre-trained model. Then, we calculate the cosine similarity between these embeddings. A high cosine similarity indicates the two findings are semantically close, suggesting the output finding should be dismissed.
For example, if a dismissed finding has a similarity score of 0.7 with an output finding, it may be excluded from future scans.
Mindful of the unique characteristics of performing comparisons between embeddings, we consider multiple options. While performing DOT-product similarity is generally less information-destructive [1], here we consider cosine similarity to be a practical fit for the task.
However, simply using cosine similarity can lead to mis-hits like the one in the example below.
Example:
Pros: Fast (~40s), cost-effective
Cons: Higher false positives due to semantic-only comparison
Best Model: jina-embeddings-v2-base-code
Other Models:
2. LLM-as-a-Judge (G-Eval)
As we just saw, embeddings struggle when syntax matters. Two findings might be semantically similar but refer to different functions.
Our custom implementation of G-Eval [2] uses an LLM to determine if two findings are similar enough. This is achieved via Chain of Thought prompting (a series of user-defined evaluation steps) and calculating the weighted average of token log probabilities to produce a final similarity score. G-Eval experiments have demonstrated a high correlation with human judgement.
Example:
Pros: Higher accuracy
Cons: Slow (~29min), costly
3. Hybrid (Embeddings + G-Eval)
We first apply cosine similarity to filter most findings, then use G-Eval on borderline cases. This improves latency while preserving accuracy, by significantly reducing the number of LLM comparisons needed.
Example:
Performance:
The optimal thresholds we landed on: cosine similarity = 0.7, G-Eval = 0.6, achieving a Hit Rate of 1.00 and FPR of 0.004 on our benchmark.
The hybrid approach delivers the best trade-off between speed and accuracy, significantly reducing false positives while keeping costs manageable. By using embeddings for their efficiency and LLM evaluation for its precision, we ensure dismissed findings are correctly handled while minimizing the risk of missing true security threats. This enables security teams to focus on real issues rather than noise.
References
[1] Is Cosine Similarity of Embeddings Really About Similarities? https://research.netflix.com/publication/is-cosine-similarity-of-embeddings-really-about-similarity
[2] G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. https://arxiv.org/abs/2303.16634