Announcing scip-clang: a new SCIP indexer for C and C++
scip-clang is our new indexer for C and C++ code written from the ground up to natively emit SCIP and especially support the wide range of language features present in C++.
For teams using C and C++, indexing your code with scip-clang can provide a significantly improved code navigation experience in Sourcegraph, similar to editors like VS Code (based on clangd) and CLion. scip-clang's precise code navigation is aware of build configurations, macros, and type information. Some examples of when this is particularly useful:
- Navigating class hierarchies, such as those using virtual functions with overriding, or the Curiously Recurring Template Pattern. In such cases, identically named methods appear in multiple class definitions, leading to false positives with search-based code navigation.
- Different C++ container types often use identical method names
like
find
andcontains
, which may have multiple overloads. Precise code navigation enables accurately navigating to the correct method overload in the correct type without false positives. - Understanding whether a definition comes from inside a macro expansion. In such cases, since the definition is not explicitly available in the source, it is not accessible to search-based code navigation. However, precise code navigation can accurately point to the macro expansion.
You can explore precise code navigation powered by scip-clang in the following repositories:1 Chromium (C++), LLVM (C++), Postgres (C).
Here's a quick demo showcasing some features in action:
scip-clang supports a superset of the functionality of lsif-clang. The main additions are:
- scip-clang is more fault-tolerant: Indexing failures, such as crashes, when processing a single translation unit, do not affect indexing for other translation units.
- scip-clang natively supports code navigation for
#include
pragmas and macros. - scip-clang is based on Clang 16 instead of Clang 11. It consumes Clang as a library rather than as a fork, making it easy to update the version of Clang used in the future.
Additional quality-of-life improvements include:
- scip-clang infers paths to standard headers for GCC and Clang from the compilation database without requiring extra command-line flags.
- scip-clang binaries are available for both Linux and macOS.
- Despite supporting more features, thanks to SCIP, index sizes (both compressed and uncompressed) are about 10%-20% the size of the corresponding LSIF indexes. This translates to faster uploads, a lower likelihood of upload errors, and reduced risk of out-of-memory when the Sourcegraph backend processes an index.
- scip-clang uses incremental parsing for compilation databases, reducing the risk of out-of-memory errors on ingestion.
scip-clang is now available in beta. Please try it out, and let us know if you run into issues, or if you have feedback for improvement. As with our other indexers, open source maintainers are welcome to use scip-clang to index their projects and upload indexes to Sourcegraph.com to benefit from precise code navigation for C and C++.
A word about performance
scip-clang requires a traversal of the abstract syntax tree after type-checking the code, so that type information is available.
Two possible baselines are comparing to purely type-checking all translation units in parallel, and comparing to a fast build (no debug information and no optimizations).
We've shown the current performance numbers below for two different configurations:
- Project 1 with 480K SLOC (26M SLOC after preprocessing), tested on a 22 core machine
- Project 2 with 2.75M SLOC (460M SLOC after preprocessing), tested on an 88 core machine
Operation | Normalized time (config 1) | Normalized time (config 2) |
---|---|---|
Type-checking only | 1.00 | 1.00 |
From-scratch fast build | 1.70 | 2.30 |
scip-clang indexing | 1.29 | 1.52 |
lsif-clang indexing | 0.84 | 1.02 |
Compared to a baseline of type-checking all translation units in parallel, scip-clang takes about 30%-50% more time.
In the future, we will be able to reduce this overhead in two ways:
- Optimizing the release build by using well-known techniques such as profile-guided optimization, and changing the default allocator.
- Parallelizing the index merging step, which combines information about forward declarations and project-external symbols across translation units.
With these optimizations, indexing should take less than 10% extra time compared to type-checking.
Perhaps more surprising is the difference between lsif-clang and scip-clang, where scip-clang takes about 50% more time. The reason for this is that lsif-clang avoids type-checking many declarations, such as compiler-synthesized ones, since they are not indexed. We're interested in surfacing information about synthesized declarations in the future, so it may not make sense to perform this same optimization in scip-clang, only to remove it at a later stage.
We hope that the improved robustness and higher quality code navigation with scip-clang make up for the loss in performance for current lsif-clang adopters.
The road ahead
In the coming months, we'll be adding support for cross-repo navigation, as well as proper handling for complex language features like template specializations. This will bring scip-clang up to parity with our other indexers in terms of language support.
With respect to performance, the bigger elephant in the room is incrementality. Many large codebases utilize caching and incrementality via build systems like Bazel to keep up with high commit velocity.
In an ideal world, the indexer would be able to leverage the incrementality from the build system to only reindex changed code. For large codebases, this would bring indexing time from 1-2 hours on a high core count machine to within a few minutes for most changes, making it possible to index every commit. Supporting incremental indexing is also on our roadmap.
Footnotes
-
The code was indexed using default build settings on Linux, so it may lack precise code graph data for platform-specific code like Android, macOS, Windows etc. ↩