There is a significant body of work devoted to testing, verifying, and certifying the correctness of optimizing compilers. The focus of such work is to determine if source code and optimized code have the same functional semantics. In this paper, we introduce the correctness-security gap, which arises when a compiler optimization preserves the functionality of but violates a security guarantee made by source code. We show with concrete code examples that several standard optimizations, which have been formally proved correct, in- habit this correctness-security gap. We analyze this gap and conclude that it arises due to techniques that model the state of the program but not the state of the underlying machine. We propose a broad research programme whose goal is to identify, understand, and mitigate the impact of security errors introduced by compiler optimizations. Our proposal includes research in testing, program analysis, theorem proving, and the development of new, accurate machine models for reasoning about the impact of compiler optimizations on security.
Formal Language Theory for Security (LANGSEC) has proposed that formal language theory and grammars be used to define and secure protocols and parsers. The assumption is that by restricting languages to lower levels of the Chomsky hierarchy, it is easier to control and verify parser code. In this paper, we investigate an alternative approach to inferring grammars via pattern languages and elementary formal system frameworks. We summarize inferability results for subclasses of both frameworks and discuss how they map to the Chomsky hierarchy. Finally, we present initial results of pattern language learning on logged HTTP sessions and suggest future areas of research.
Data decoding, format, or language ambiguities have been long known for amusement purposes. Only recently, it came to attention that they also pose a security risk. In this paper, we present decoder manipulations based on deliberately caused ambiguities facilitating the error correction mechanisms used in several popular applications. This can be used to encode data in multiple formats or even the same format with different content. Implementation details of the decoder or environmental differences decide which data the decoder locks on. This leads to different users receiving different content based on a language decoding ambiguity. In general, ambiguity is not desired, however in special cases it can be particularly harmful. Format dissectors can make wrong decisions: e.g. a firewall scans based on one format but the user decodes different harmful content. We demonstrate this behavior with popular barcodes and argue that it can be used to deliver exploits based on the software installed, or use probabilistic effects to divert a small percentage of users to fraudulent sites.
We discuss the potential for significant reduction in the size and complexity of verification tasks for input-handling software when such software is constructed according to LangSec principles, i.e., is designed as a recognizer for a particular language of valid inputs and is compiled for a suitably limited computational model no stronger than needed for the recognition task. We introduce Crema, a programming language and restricted execution environment of sub-Turing power, and conduct a case study to estimate and compare the respective sizes of verification tasks for the Qmail SMTP parsing code fragments when executed natively vs in Crema---using LLVM and KLEE. We also study the application of the same principles to the verification of reference monitors.
The Ethos operating system provides a number of features which aid programmers as they craft robust computer programs. One such feature of Ethos is its distributed mandatory type system: Etypes. Etypes provides three key properties: (1) every Ethos object (e.g., a file or network connection) has a declared type, (2) Ethos forbids programs from writing ill-formed data to an object, and (3) Ethos forbids programs from reading ill-formed data from an object. In each case, programmers declare ahead of time the permitted data types, and the application of operating-system-level recognition simplifies their programs. This paper first investigates the generality of Etypes. Toward this end, we describe how to convert a grammar in Chomsky normal form into an Ethos type capable of expressing exactly the set of syntax trees which are valid vis-a-vis the grammar. Next, the paper addresses the convenience of Etypes. If Etypes does not make it easier to craft programs, then programmers will avoid the facilities it provides, for example by declaring string types which in fact serve to encode other types (here Etypes would check the string but not the encoded type). Finally, we present a sample distributed program for Ethos which makes use of the techniques we describe.
Input languages, which describe the set of valid inputs an application has to handle, play a central role in language-theoretic security, in recognition of the fact that overly complex, sloppily specified, or incorrectly implemented input languages are the root cause of many security vulnerabilities.
Often an input language not only involves a language of individual messages, but also some protocol with a notion of a session, i.e. a sequence of messages that makes up a dialogue between two parties. This paper takes a closer look at languages for such sessions, when it comes to specification, implementation, and testing - and as a source of insecurity.
We show that these 'session' languages are often poorly specified and that errors in implementing them can cause security problems. As a way to improve this situation, we discuss the possibility to automatically infer formal specifications of such languages, in the form of protocol state machines, from implementations by black box testing.
To exchange complex data structures in distributed systems, documents written in context-free languages are exchanged among communicating parties. Unparsing these documents correctly is as important as parsing them correctly because errors during unparsing result in injection vulnerabilities such as cross-site scripting (XSS) and SQL injection.
Injection attacks are not limited to the web world. Every program that uses input to produce documents in a context-free language may be vulnerable to this class of attack. Even for widely used languages such as HTML and JavaScript, there are few approaches that prevent injection attacks by context-sensitive encoding, and those approaches are tied to the language. Therefore, the aim of this paper is to derive context-sensitive encoder from context-free grammars to provide correct unparsing of maliciously crafted input data for all context-free languages.
The presented solution integrates encoder definition into context-free grammars and provides a generator for context-sensitive encoders and decoders that are used during (un)parsing. This unparsing process results in documents where the input data does neither influence the structure of the document nor change their intended semantics. By defining encoding during language definition, developers who use the language are provided with a clean interface for writing and reading documents written in that language, without the need to care about security-relevant encoding.
The recently created language Rust has been presented as a safer way to write low level code, even capable of replacing C. Is it able to produce safe and efficient parsers? We show that Rust’s features, like slicing, allow for powerful memory management, and that its type safety helps in writing correct parsers. We then study briefly how it can make streaming parsers, and how to provide better usability in a parsing library.
Formal deductive reasoning for programs with dynamic memory allocations has received tremendous amount of attention from the verification and programming communities in the past years. While many interesting properties are known to be undecidable, the advent of Separation Logic and decidable fragments of first order logic for reasoning on linked data structures allow verification of families of heap invariants useful in program verification with states. In contrast, relatively few formal research has been accomplished to capture arcane knowledge used in the development of exploits and other program artifacts used to demonstrate the absence of safety invariants in programs. Exploitation of security vulnerabilities can there be understood as a proof by construction of the negation of a safety invariant of the program under exploitation. In this article, we focus on a formalizing exploits as abstract machines called "exploit systems", where specific heap configurations must be reached in order to successfully take advantage of a security vulnerability. We present several constructions to reconcile the physical heap properties (adjacency of memory chunks) with logical ones (such as whether a chunk of allocated or available) and express exploits as a reachability problem of specific heap configurations commonly sought to perform successful attacks. Our constructs are parameterized by fitness algorithms, helper functions, and data types so that our model can apply to a variety of heap allocators used in major operating systems.
We present Unambiguous Encapsulation, a project that has successfully demonstrated that forward error correcting codes, if chosen with specific properties in mind, can be used to signify data encapsulation while still providing resilience to errors.
Three classes of codes are defined and shown to provide encapsulation while providing different capabilities to communication technologies. We use cloud computing resources to perform an exhaustive search for codes in all three classes, yielding a significant amount of data for future communication technology and data storage design.
Additionally, we discuss future expansion of this work including additional classes of codes, alternative methods for generating error correcting codes with encapsulation properties, and related technologies that could benefit from data encapsulation.
slides [report to be posted]
DNP3 (IEEE-1815) is the predominant SCADA protocol used in the electric power industry in the North America and Australia. In 2013-2014, more than thirty vulnerabilities were reported in various implementations of the as part of a single "bug campaign" using a custom smart fuzzer. As a whole, the results point to two gaps in the design and implementation of standards in critical infrastructure: scattered approach to parsing and validation, and a lack of reference implementations in the design phase. This talk will give a high level overview of the protocol structure, present the resulting trove of CVEs, and conclude with a complexity analysis of the optional security addendum to the specification known as "DNP3 Secure Authentication". If you are not sufficiently scared about the robustness of critical infrastructure by the end of the talk you are a fearless individual.
This talk will present advances in the Upstanding Hackers' Hammer parser construction kit, highlighting many community contributions over the last year. Hammer can be found at https://github.com/UpstandingHackers/hammer , a primer for learning Hammer at https://github.com/sergeybratus/HammerPrimer.