Semantic Patches and Static Analysis, Featuring Yoann Padioleau
This week on #PLTalk, @jeanqasaur and @hongyihu interviewed Yoann Padioleau about his work on Coccinelle and Semgrep. A recording is available here.
Coccinelle is an open-source utility used for transforming C code. It defines the idea of a semantic patch, that enables matching and transforming code while abstracting away differences in spacing, variable names, and other non-functional changes like the use of equivalent constructs or idioms. This makes it an effective tool for applying modifications across entire projects: like the Linux kernel, where it's been used to fix security vulnerabilities and other bugs at scale.
Semgrep is a natural evolution of Coccinelle, and extends the core idea to apply to a wider variety of programming languages. It currently does this using the pfff tools, but they're in the process of switching to manipulating ASTs generated by Tree-sitter, in order to support any language that has a Tree-sitter grammar. Tree-sitter is the same parsing library used by Github's Semantic project, for analyzing and comparing source code across languages.
While most static-analysis tooling provide built-in suites of checkers for different classes of bugs (security or otherwise), Semgrep is designed to allow the user to define their own rules to be run against their codebase. During the stream, Yoann demonstrated using the tool to detect instances of hardcoded API keys being passed to an AWS client, but it's also intended to be used as a more intelligent replacement to grep, by enabling semantic code search over a codebase.
Semgrep is available for download for offline use, and is available through an online editor. There's also a public Slack available.
This week's stream was very focused on these two tools in particular, so I don't have a ton of resources to share, but a few other helpful tools were mentioned:
- Sgrep is another structured search tool, worked on by Yoann during his time at Facebook
- Sobelow provides lightweight static analysis to Elixir, with a focus on security vulnerabilities
- Brakeman is an equivalent tool for Ruby
- Datalog is a dialect of Prolog used as a declarative query language
Lastly, we didn't spend much time on it, but the idea of soundiness in static analysis was briefly brought up near the end, with the general idea being that nearly all static analysis tools are deliberately unsound in some way, and so we, as tool authors, should be more upfront about what soundness tradeoffs are being made.
Thanks for reading, and I hope you join us next week @ 3:00PM PDT! We're going to be joined by Edwin Brady, where he'll talk about some of his work on the Idris programming language.
- Quinn