Shallow or Deep? An Empirical Study on Detecting Vulnerabilities using Deep Learning

Deep learning (DL) techniques are on the rise in the software engineering research community. More and more approaches have been developed on top of DL models, also due to the unprecedented amount of software-related data that can be used to train these models. One of the recent applications of DL in the software engineering domain concerns the automatic detection of software vulnerabilities. While several DL models have been developed to approach this problem, there is still limited empirical evidence concerning their actual effectiveness especially when compared with shallow machine learning techniques. In this paper, we partially fill this gap by presenting a large-scale empirical study using three vulnerability datasets and five different source code representations (i.e., the format in which the code is provided to the classifiers to assess whether it is vulnerable or not) to compare the effectiveness of two widely used DL-based models and of one shallow machine learning model in (i) classifying code functions as vulnerable or non-vulnerable (i.e., binary classification), and (ii) classifying code functions based on the specific type of vulnerability they contain (or "clean", if no vulnerability is there). As a baseline we include in our study the AutoML utility provided by the Google Cloud Platform. Our results show that the experimented models are still far from ensuring reliable vulnerability detection, and that a shallow learning classifier represents a competitive baseline for the newest DL-based models.

Along this Online appendix you will find links for the data and for the different scripts that we used in the different steps of the process (i.e., training and tunning models, retrieving results)

Empirical Study Design

I. The goal of this study is to empirically analyze the effectiveness of deep/shallow learning techniques for detecting software vulnerabilities in source code at function-level granularity when using different models and source code abstractions. We conduct experiments with three different models (two deep and one shallow). In particular, we experiment with: (i) Random Forest (RF), (ii) a Convolutional Neural Network (CNN), and (iii) a Recurrent Neural Network (RNN), with the first being representative of shallow classifiers and the last two of deep learning models. We chose RF due to its popularity in the software engineering domain. Concerning the two DL models, they have been used, with different variations, in previous studies on the automatic detection of software vulnerabilities.
On top of the three experimented models, we also exploit as baseline for our experiments an automated machine learning (AutoML) approach. AutoML is a solution to build DL systems without human intervention and not relying on human expertise. It has been widely used in Natural Language Processing (NLP) and it is provided by Google Cloud Platform (GCP). AutoML eases the hyper-parameter tuning and feature selection using Neural Architecture Search (NAS) and transfer learning. This solution was released by Google in 2019 for NLP and in 2018 for Computing Vision.
II. The context of the study is represented by three datasets of C/C++ code reporting software vulnerabilities at the function granularity level, for a total of 1,841,323 functions, of which 390,558 are vulnerable ones.

Research Question and Main Context

Having that goal and context in consideration, our study addresses the following research question (RQ):

What is the effectiveness of different combinations of classifiers and code representations to identify functions affected by software vulnerabilities?

We answer this RQ in two main steps. First, we create binary classifiers able to discriminate between vulnerable and non-vulnerable functions, without reporting the specific type of vulnerability affecting the code. This scenario is relevant for practitioners/researchers who are only interested in identifying potentially vulnerable code for inspection/investigation. Second, we experiment the same models in the more challenging scenario of classifying functions as clean (i.e., do not affected by any vulnerability) or as affected by specific types of vulnerabilities.

Data Collection

We relied on three datasets composed by C/C++ source code functions and information about the vulnerabilities affecting them.

Author validation

We are aware of the fact that commit messages might imprecisely identify bug-fixing commits and, as a consequence, vulnerability-fixing commits, two authors independently analyzed a statistically significant sample (95% confidence level +-5% confidence interval, for a total size of 384) of identified commits to check whether they were actually vulnerability fixes. After solving 45 cases of disagreement, they concluded that 90.3% of the identified vulnerability-fixing commits were true positives .

Code Representation

From each dataset we extracted two sets of tuples. The first one, in the form (function_code, is_vulnerable), aims at experimenting the models in the scenario in which we want to identify vulnerable functions, but we are not interested to the specific type of vulnerability. In the second, the tuples are instead in the form (function_code, vulnerability_type), to experiment the models in the scenario in which we want to classify the vulnerability type exhibited by a given function. We use the non_vulnerable label to identify functions not affected by any vulnerability. Starting from these two datasets, we built five versions of each one by representing the code in different ways, to study how the code representation affects the performance of the experimented models.

The abstract representations can be found below:

Data Cleaning

Before using the three datasets to train and evaluate the experimented models, we performed a transformation and cleaning process on each of them to (i) make the data treatable with DL/shallow models, and (ii) avoid possible pitfalls that are common in studies of machine learning on code (e.g., duplicated functions due to forked projects).

By starting with five different datasets of function representations for each dataset. We addressed conflicting representation (i.e., two samples with same code representation and different labels) and duplicates. In case of conflicting representations, all instances were removed. As for the duplicates, we removed all duplicates having the same raw source code representation and the same label (i.e., type of vulnerability affecting them, if any), keeping only the first occurrence. This means that it is possible to have in our datasets two snippets having the same abstract representation, but not the same raw source code. Such a design choice is justified by the fact that the translation from raw source code to abstract representation is part of the classification pipelines used in ML implementations, and it is performed after the removal of duplicates.

Classifiers/Code

Given the variables involved in our study, namely four approaches (i.e., GCP-AutoML, RF, CNN & RNN), five representations, three datasets, and two types of classification---binary and multiclass), we built a total of 120 different models. We publicly release the source code of the base models and how to retrieve the results after training and tunning.