Generative Artificial Intelligence, and in particular Large Language Models (LLMs) and LLM-based Agents, have significantly changed the way how we (humans) perform our daily activities; features like being able to establish smart conversations with and without context, as well as their capabilities for answering questions from any kind and domain is attracting more attention from the researchers and practitioners communities because we need to understand and assess the weaknesses and strengths of the models. For instance, hallucination is a well-known issue in LLMs, as well as the possibility of providing inappropriate questions when the models lack filters or are biased or poisoned. Previous work has been devoted to asses LLMs under different contexts and scenarios, e.g., code generation. However, few studies have been done in the context of information security; to our knowledge, no previous work has analyzed the quality of answers provided by LLMs to cybersecurity-related questions. Therefore, we present a dataset of questions extracted from StackExchange, including their top-10 answers and the ones generated by three GPT models (3.5-Turbo, 4-4o) for 5K+ questions; the dataset also includes similarity metrics (e.g., ROUGE, SacreBLUE, BERTScore) of the LLM-based answers when compared to the human-accepted ones.

Publications

Coming Soon...!

Dataset

This section introduces SecLLM, a dataset comprising the more reviewed security-related questions on Stack Exchange- from 2010 to 2023. The dataset includes a total of 5,023 questions, along with their corresponding answers — both the original, user-submitted answers and responses generated by various versions of the popular large language model GPT.

The dataset was constructed by extracting the questions and other data from Stack Exchange using a XML file found in the Internet Archive. Then scrapping the data from it in order to get both questions and answers. Next, we used the GPT-3.5, GPT-4, and GPT-4o models to generate responses to the questions. Finally we used the ROUGE, BLEAU and BERTScore metrics to evaluate the quality of the generated responses, comparing them to the accepted human answers of each of the questions.

The resulting dataset is composed of three storage mechanisms. The first one is a relational database using PostgreSQL, the database structucture is described down below.

The second one is a JSON file that contains a list of questions, each questions contains the information of the question, the tags, the answers and the generated answers,as well as the metrics used to evaluate them. The third file is an xml file with the same structure as the JSON file.

Database Structure

As shown in the image above, the dataset consists in six tables:

Questions: Contains the questions extracted from Stack Exchange. This includes it's title, body and other fields that are useful.
Answers: Contains the answers extracted from Stack Exchange, including the body and an id to know which question is being answered.
GeneratedAnswers: Contains the answers generated by the LLMs, here we include the body of the question as well as the model being used to generate the answer.
Tags (& QuestionTag): Contains the tags of the questions. Using this we provide a way of filtering by tags
Evaluation: Contains the metrics used to evaluate the quality of the generated answers against the accepted answers. This includes the ROUGE, BLEAU and BERTScore metrics.

Metrics and Statistics

All questions from Stack Exchange were reduced to a limited number of 5023 questions, because the generation of answers with GPT has a monetary cost, the most popular questions from each year were selected, here is an histogram of the number of questions per year.

The prompt used to generate the answers specifically asked that the answers were 250 words long, here is an histogram of the lenght of the final result.

Next you can see how is the distribution of the metrics for each metric, and for each model used, to see how every model behaves in comparison to the others