VerifAI: A Toolkit for the Formal Design and Analysis of Artificial Intelligence-Based Systems
We present VerifAI, a software toolkit for the formal design and analysis of systems that include artificial intelligence (AI) and machine learning (ML) components. VerifAI particularly addresses challenges with applying formal methods to ML components such as perception systems based on deep neural networks, as well as systems containing them, and to model and analyze system behavior in the presence of environment uncertainty. We describe the initial version of VerifAI, which centers on simulation-based verification and synthesis, guided by formal models and specifications. We give examples of several use cases, including temporal-logic falsification, model-based systematic fuzz testing, parameter synthesis, counterexample analysis, and data set augmentation.
KeywordsFormal methods Falsification Simulation Cyber-physical systems Machine learning Artificial intelligence Autonomous vehicles
The increasing use of artificial intelligence (AI) and machine learning (ML) in systems, including safety-critical systems, has brought with it a pressing need for formal methods and tools for their design and verification. However, AI/ML-based systems, such as autonomous vehicles, have certain characteristics that make the application of formal methods very challenging. We mention three key challenges here; see Seshia et al.  for an in-depth discussion. First, several uses of AI/ML are for perception, the use of computational systems to mimic human perceptual tasks such as object recognition and classification, conversing in natural language, etc. For such perception components, writing a formal specification is extremely difficult, if not impossible. Additionally, the signals processed by such components can be very high-dimensional, such as streams of images or LiDAR data. Second, machine learning being a dominant paradigm in AI, formal tools must be compatible with the data-driven design flow for ML and also be able to handle the complex, high-dimensional structures in ML components such as deep neural networks. Third, the environments in which AI/ML-based systems operate can be very complex, with considerable uncertainty even about how many (which) agents are in the environment (both human and robotic), let alone about their intentions and behaviors. As an example, consider the difficulty in modeling urban traffic environments in which an autonomous car must operate. Indeed, AI/ML is often introduced into these systems precisely to deal with such complexity and uncertainty! From a formal methods perspective, this makes it very hard to create realistic environment models with respect to which one can perform verification or synthesis.
Perception: A perception component maps a concrete feature space (e.g. pixels) to an output such as a classification, prediction, or state estimate. To deal with the lack of specification for perception components, VerifAI analyzes them in the context of a closed-loop system using a system-level specification. Moreover, to scale to complex high-dimensional feature spaces, VerifAI operates on an abstract feature space (or semantic feature space)  that describes semantic aspects of the environment being perceived, not the raw features such as pixels.
Learning: VerifAI aims to not only analyze the behavior of ML components but also use formal methods for their (re-)design. To this end, it provides features to (i) design the data set for training and testing , (ii) analyze counterexamples to gain insight into mistakes by the ML model, as well as (iii) synthesize parameters, including hyper-parameters for training algorithms and ML model parameters.
Environment Modeling: Since it can be difficult, if not impossible, to exhaustively model the environments of AI-based systems, VerifAI aims to provide ways to capture a designer’s assumptions about the environment, including distribution assumptions made by ML components, and to describe the abstract feature space in an intuitive, declarative manner. To this end, VerifAI provides users with Scenic [12, 13], a probabilistic programming language for modeling environments. Scenic, combined with a renderer or simulator for generating sensor data, can produce semantically-consistent input for perception components.
2 VerifAI Structure and Operation
Inputs and Outputs: Using VerifAI requires setting up a simulator for the domain of interest. As we explain in Sect. 3, we have experimented with multiple robotics simulators and provide an easy interface to connect a new simulator. The user then constructs the inputs to VerifAI, including (i) a simulatable model of the system, including code for one or more controllers and perception components, and a dynamical model of the system being controlled; (ii) a probabilistic model of the environment, specifying constraints on the workspace, the locations of agents and objects, and the dynamical behavior of agents, and (iii) a property over the composition of the system and its environment. VerifAI is implemented in Python for interoperability with ML/AI libraries and simulators across platforms. The code for the controller and perception component can be arbitrary executable code, invoked by the simulator. The environment model typically comprises a definition in the simulator of the different types of agents, plus a description of their initial conditions and other parameters using the Scenic probabilistic programming language . Finally, the property to be checked can be expressed using Metric Temporal Logic (MTL) [2, 24], objective functions, or arbitrary code monitoring the property. The output of VerifAI depends on the feature being invoked. For falsification, VerifAI returns one or more counterexamples, simulation traces violating the property . For fuzz testing, VerifAI produces traces sampled from the distribution of behaviors induced by the probabilistic environment model . Error table analysis involves collecting counterexamples generated by the falsifier into a table, on which we perform analysis to identify features that are correlated with property failures. Data augmentation uses falsification and error table analysis to generate additional data for training and testing an ML component . Finally, the property-driven synthesis of model parameters or hyper-parameters generates as output a parameter evaluation that satisfies the specified property.
Abstract Feature Space and Scenic Modeling Language: The abstract feature space is a compact representation of the possible configurations of the simulation. Abstract features can represent parameters of the environment, controllers, or of ML components. For example, when analyzing a visual perception system for an autonomous car, an abstract feature space could consist of the initial poses and types of all vehicles on the road. Note that this abstract space, compared to the concrete feature space of pixels used as input to the controller, is better suited to the analysis of the overall closed-loop system (e.g. finding conditions under which the car might crash).
VerifAI provides two ways to construct abstract feature spaces. They can be constructed hierarchically, combining basic domains such as hyperboxes and finite sets into structures and arrays. For example, we could define a space for a car as a structure combining a 2D box for position with a 1D box for heading, and then create an array of these to get a space for several cars. Alternatively, VerifAI allows a feature space to be defined using a program in the Scenic language . Scenic provides convenient syntax for describing geometric configurations and agent parameters, and, as a probabilistic programming language, allows placing a distribution over the feature space which can be conditioned by declarative constraints.
Searching the Feature Space: Once the abstract feature space is defined, the next step is to search that space to find simulations that violate the property or produce other interesting behaviors. Currently, VerifAI uses a suite of sampling methods (both active and passive) for this purpose, but in the future we expect to also integrate directed or exhaustive search methods including those from the adversarial machine learning literature (e.g., see ). Passive samplers, which do not use any feedback from the simulation, include uniform random sampling, simulated annealing, and Halton sequences  (quasi-random deterministic sequences with low-discrepancy guarantees we found effective for falsification ). Distributions defined using Scenic are also passive in this sense. Active samplers, whose selection of samples is informed by feedback from previous simulations, include cross-entropy sampling and Bayesian optimization. The former selects samples and updates the prior distribution by minimizing cross-entropy; the latter updates the prior from the posterior over a user-provided objective function, e.g. the satisfaction level of a specification or the loss of an analyzed model.
Property Monitor: Trajectories generated by the simulator are evaluated by the monitor, which produces a score for a given property or objective function. VerifAI supports monitoring MTL properties using the py-metric-temporal-logic  package, including both the Boolean and quantitative semantics of MTL. As mentioned above, the user can also specify a custom monitor as a Python function. The result of the monitor can be used to output falsifying traces and also as feedback to the search procedure to direct the sampling (search) towards falsifying scenarios.
Error Table Analysis: Counterexamples are stored in a data structure called the error table, whose rows are counterexamples and columns are abstract features. The error table can be used offline to debug (explain) the generated counterexamples or online to drive the sampler towards particular areas of the abstract feature space. VerifAI provides different techniques for error table analysis depending on the end use (e.g., counter-example analysis or data set augmentation), including principal component analysis (PCA) for ordered feature domains and subsets of the most recurrent values for unordered domains (see  for further details).
The communication between VerifAI and the simulator is implemented in a client-server fashion using IPv4 sockets, where VerifAI sends configurations to the simulator which then returns trajectories (traces). This architecture allows easy interfacing to a simulator and even with multiple simulators at the same time.
3 Features and Case Studies
This section illustrates the main features of VerifAI through case studies demonstrating its various use cases and simulator interfaces. Specifically, we demonstrate model falsification and fuzz testing of an autonomous vehicle (AV) controller, data augmentation and error table analysis for a convolutional neural network, and model and hyperparameter tuning for a reinforcement learning-based controller.
3.1 Falsification and Fuzz Testing
VerifAI offers a convenient way to debug systems through systematic testing. Given a model and a specification, the tool can use active sampling to automatically search for inputs driving the model towards a violation of the specification. VerifAI can also perform model-based fuzz testing, exploring random variations of a scenario guided by formal constraints. To demonstrate falsification and fuzz testing, we consider two scenarios involving AVs simulated with the robotics simulator Webots . For the experiments reported here, we used Webots 2018 which is commercial software.
In the first example, we falsify the controller of an AV which is responsible for safely maneuvering around a disabled car and traffic cones which are blocking the road. We implemented a hybrid controller which relies on perception modules for state estimation. Initially, the car follows its lane using standard computer vision (non-ML) techniques for line detection . At the same time, a neural network (based on squeezeDet ) estimates the distance to the cones. When the distance drops below 15 m, the car performs a lane change, afterward switching back to lane-following.
The correctness of the AV is characterized by an MTL formula requiring the vehicle to maintain a minimum distance from the traffic cones and avoid overshoot while changing lanes. The task of the falsifier is to find small perturbations of the initial scene (generated by Scenic) which cause the vehicle to violate this specification. We allowed perturbations of the initial positions and orientations of all objects, the color of the disabled car, and the cruising speed and reaction time of the ego car.
3.2 Data Augmentation and Error Table Analysis
Data augmentation is the process of supplementing training sets with the goal of improving the performance of ML models. Typically, datasets are augmented with transformed versions of preexisting training examples. In , we showed that augmentation with counterexamples is also an effective method for model improvement.
VerifAI implements a counterexample-guided augmentation scheme, where a falsifier (see Sect. 3.1) generates misclassified data points that are then used to augment the original training set. The user can choose among different sampling methods, with passive samplers suited to generating diverse sets of data points while active samplers can efficiently generate similar counterexamples. In addition to the counterexamples themselves, VerifAI also returns an error table aggregating information on the misclassifications that can be used to drive the retraining process. Figure 4 shows the rendering of a misclassified sample generated by our falsifier.
For our experiments, we implemented a renderer that generates images of road scenarios and tested the quality of our augmentation scheme on the squeezeDet convolutional neural network , trained for classification. We adopted three techniques to select augmentation images: (1) randomly sampling from the error table, (2) selecting the top k-closest (similar) samples from the error table, and (3) using PCA analysis to generate new samples. For details on the renderer and the results of counterexample-driven augmentation, see . We show that incorporating the generated counterexamples during re-training improves the accuracy of the network.
3.3 Model Robustness and Hyperparameter Tuning
In this final section, we demonstrate how VerifAI can be used to tune test parameters and hyperparameters of AI systems. For the following case studies, we use OpenAI Gym , a framework for experimenting with reinforcement learning algorithms.
Finally, we used VerifAI to study the effects of hyperparameters when training a neural network controller for a mountain car. In this case, the controller must learn to exploit momentum in order to climb a steep hill. Here, rather than searching for counterexamples, we look for a set of hyperparameters under which the network correctly learns to control the car. Specifically, we explored the effects of using different training algorithms (from a discrete set of choices) and the size of the training set. We used the VerifAI falsifier to search the hyperparameter space, guided again by the reward function provided by OpenAI Gym (here the distance from the goal position), but negated so that falsification implied finding a controller which successfully climbs the hill. In this way VerifAI built a table of safe hyperparameters. PCA analysis then revealed which hyperparameters the training process is most sensitive or robust to.
We presented VerifAI, a toolkit for the formal design and analysis of AI/ML-based systems. Our implementation, plus the examples described in Sect. 3, are available in the tool distribution , including detailed instructions and expected output.
In future work, we plan to explore additional applications of VerifAI, and to expand its functionality with new algorithms. Towards the former, we have already interfaced VerifAI to the CARLA driving simulator , for more sophisticated experiments with autonomous cars, as well as to the X-Plane flight simulator , for testing an ML-based aircraft navigation system. More broadly, although our focus has been on CPS, we note that VerifAI ’s architecture is applicable to other types of systems. Finally, for extending VerifAI itself, we plan to move beyond directed simulation by incorporating symbolic methods, such as those used in finding adversarial examples.
Our work is complementary to the work on industrial-grade simulators for AI/ML-based CPS. In particular, VerifAI enhances such simulators by providing formal methods for modeling (via the Scenic language), analysis (via temporal logic falsification), and parameter synthesis (via property-directed hyper/model-parameter synthesis).
- 1.VerifAI: a toolkit for the design and analysis of artificial intelligence-based systems. https://github.com/BerkeleyLearnVerify/VerifAI
- 3.Annpureddy, Y., Liu, C., Fainekos, G.E., Sankaranarayanan, S.: S-taliro: a tool for temporal logic falsification for hybrid systems. In: Tools and Algorithms for the Construction and Analysis of Systems, TACAS (2011)Google Scholar
- 4.Brockman, G., et al.: OpenAI Gym. arXiv:1606.01540 (2016)
- 5.Butler, M.: Uber’s tempe accident raises questions of self-driving safety. East Valley Tribune (2017). http://www.eastvalleytribune.com/local/tempe/uber-s-tempe-accident-raises-questions-of-self-driving-safety/article_30b99e74-189d-11e7-bc1d-07f943301a72.html
- 6.Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: Conference on Robot Learning, CoRL, pp. 1–16 (2017)Google Scholar
- 7.Dreossi, T., Donzé, A., Seshia, S.A.: Compositional falsification of cyber-physical systems with machine learning components. In: Barrett, C., Davies, M., Kahsai, T. (eds.) NFM 2017. LNCS, vol. 10227, pp. 357–372. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57288-8_26CrossRefGoogle Scholar
- 8.Dreossi, T., Donze, A., Seshia, S.A.: Compositional falsification of cyber-physical systems with machine learning components. J. Autom. Reasoning (JAR) (2019)Google Scholar
- 9.Dreossi, T., Ghosh, S., Yue, X., Keutzer, K., Sangiovanni-Vincentelli, A., Seshia, S.A.: Counterexample-guided data augmentation. In: 27th International Joint Conference on Artificial Intelligence (IJCAI) (2018)Google Scholar
- 12.Fremont, D.J., Dreossi, T., Ghosh, S., Yue, X., Sangiovanni-Vincentelli, A.L., Seshia, S.A.: Scenic: a language for scenario specification and scene generation. In: 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) (2019, to appear)Google Scholar
- 13.Fremont, D.J., Yue, X., Dreossi, T., Ghosh, S., Sangiovanni-Vincentelli, A.L., Seshia, S.A.: Scenic: language-based scene generation. CoRR (2018). arXiv:1809.09310
- 14.Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, P., Chaudhuri, S., Vechev, M.: AI2: safety and robustness certification of neural networks with abstract interpretation. In: 2018 IEEE Symposium on Security and Privacy (SP) (2018)Google Scholar
- 15.Ghosh, S., Berkenkamp, F., Ranade, G., Qadeer, S., Kapoor, A.: Verifying controllers against adversarial examples with Bayesian optimization. In: 2018 IEEE International Conference on Robotics and Automation (ICRA) (2018)Google Scholar
- 16.Godefroid, P., Kiezun, A., Levin, M.Y.: Grammar-based whitebox fuzzing. In: ACM SIGPLAN Notices. ACM (2008)Google Scholar
- 17.Grembek, O., Kurzhanskiy, A.A., Medury, A., Varaiya, P., Yu, M.: Making intersections safer with I2V communication (2019). arXiv:1803.00471, to appear in Transportation Research, Part C
- 19.Laminar Research: X-Plane 11 (2019). https://www.x-plane.com/
- 20.Palazzi, A.: Finding lane lines on the road (2018). https://github.com/ndrplz/self-driving-car/tree/master/project_1_lane_finding_basic
- 21.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR (2017). arXiv:1707.06347
- 23.Seshia, S.A., Sadigh, D., Sastry, S.S.: Towards Verified Artificial Intelligence. CoRR (2016). arXiv:1606.08514
- 24.Vazquez-Chanlatte, M.: mvcisback/py-metric-temporal-logic: v0.1.1 (2019). https://doi.org/10.5281/zenodo.2548862
- 25.Webots: Commercial mobile robot simulation software. http://www.cyberbotics.com
- 27.Wu, B., Iandola, F., Jin, P.H., Keutzer, K.: SqueezeDet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. In: CVPR 2017 (2016). https://doi.org/10.1109/CVPRW.2017.60
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.