Guardrails Evaluation — AI Safety Research Tool
This tool is developed as part of the SURICATA-I project at the University of Deusto. It evaluates the refusal behaviour of large language models (LLMs) against harm benchmarks such as WildGuardMix and HarmBench.
The dashboard visualises metrics including true positive rate (TPR), false positive rate (FPR), F1 score, and attack success rate (ASR) across multiple models and datasets.
JavaScript is required to use this application.