Guardrails Evaluation — AI Safety Research Tool

This tool is developed as part of the SURICATA-I project at the University of Deusto. It evaluates the refusal behaviour of large language models (LLMs) against harm benchmarks such as WildGuardMix and HarmBench.

The dashboard visualises metrics including true positive rate (TPR), false positive rate (FPR), F1 score, and attack success rate (ASR) across multiple models and datasets.

JavaScript is required to use this application.

LLM Evaluation

Dataset Model Harm label Model refused? Classification Search prompt

Loading…

Model Emotion Search utterance

Loading…