Mid-2016 Tor bug retrospective, with lessons for future coding
Programs have bugs because developers make mistakes. Generally, when we discover a serious bug, we try to fix it as soon as we can and move on. But many groups have found it helpful to pause periodically and look for trends in the bugs they have discovered or fixed over the course of their projects. By finding trends, we can try to identify ways to develop our software better.
I recently did an informal review of our major bugs from the last few years. (I'm calling it "informal" rather than "formal" mainly because I made up the process as I went along.)
My goals were to see if we're right in our understanding of what causes bugs in Tor, and what approaches to avoid bugs and limit their impact would be most effective.
By reviewing all the bugs and looking for patterns, I'm hoping that we can test some of our operating hypotheses about what allows severe bugs to happen, and what practices would prevent them. If this information is reasonably accurate, it should help us use our time and resources more effectively to write our code more safely over the coming years.
I took an inventory of "severe bugs" from three sources:
- Tickets in the Tor bugtracker with high priority, closed in 0.2.5.x or later.
- Tickets in the Tor bugtracker with high severity, closed in 0.2.5.x or later.
- An assessment of bugs listed as changelogs for 0.2.5.x and later.
For each of these cases, I assessed "is this severe" and "is this really a bug" more or less ad hoc, erring on the side of inclusion. I wound up with 70 tickets.
At this point, I did a hand-examination of each ticket, asking these questions:
- What was the bug?
- What was the impact?
- Why did the bug happen? What might have prevented it or lowered its impact? What might have detected it earlier?
I then used a set of keywords to group tickets by similar causes or potential prevention methods.
Finally, I grouped tickets by keywords, looking for the keywords that had the largest number of tickets associated.
Limitations of this methodology.
Consider this an exploratory exercise, not a scientific finding. We should look into formalizing the methodology more and giving it more process the next time we do it, for these reasons:
- It's entirely dependent on my judgment of "is this severe" and "is this a bug."
- The categories were made up as I went along.
- Many of the hypotheses it tests are post-hoc hypotheses.
- I haven't done very much checking on the input data yet; I wouldn't consider this scientific till somebody else has looked for bugs I missed and analyses I got wrong. There's no reason to think that I got these particularly right.
- The only objective measure I'm using is "how many bugs did I tag with a given keyword?," with the assumption that any keyword covering a lot of bugs is particularly important. But that's based on a semi-subjective assessment (tags), applied to a semi-subjective population ("bugs" I judged "severe"), and ignores bug impact.
III. Results and recommendations
1. Testing is helpful (big surprise).
We've believed for a while that we can reduce the number of bugs that make it into the wild by using more tests on our codebase. This seems broadly true, but incomplete.
First, it seems that only about half of our severe bugs appeared to be the kind of thing that better tests would have caught. The other half involved logic errors and design oversights that would probably have made it through testing.
Second, it seems that in some cases, our existing tests were adequate to the job, if only we had automated them better, or had run them more consistently, more rigorously, or under more conditions.
In all cases, of course, automation isn't quite enough. We must also have the automated tests run regularly (daily?), and make sure that the results are available to developers in a convenient way.
Recommendation 1.1 Run our automated unit tests under more code-hardening methodologies.
This includes --enable-expensive-hardening under GCC and clang, valgrind with leak checking turned on, and anything else we can find.
Recommendation 1.2: Also run test-network and test-stem in an automated environment.
These checks can detect a lot of problems, but right now we only try the stem tests in automated builds, and we don't try them with hardening.
Cases where a suitably extended (or completely vanilla) stem or chutney test case might have helped include: #8746, #9296, #10465, #10849, #11200, #13698,
#15245, #15801, #16247, #16248, #17668, #17702, #17772, #18116, #18318, and
Recommendation 1.3: Automate use of static analysis tools with Tor.
There were some cases where we found a bug using a static analysis tool later than we might have, because the static analysis tool had to be hand-launched. We can get faster bug resolution by automatically running all the static analysis tools we use. (We've already done this.)
Recommendation 1.4: Continue requiring unit tests for new code, and writing unit tests for old code.
Untested code had bugs at a higher rate than tested code.
Recommendation 1.5: Get more users to try out our nightly builds.
Having more users of our nightly builds would help us notice more bugs on the git master branch before those bugs appear in stable or alpha releases.
Having users for our nightly builds would have prevented #11200 entirely.
Recommendation 1.6: Whenever possible, write integration tests for new features.
Features that lack integration tests via Chutney or some other mechanism tend to have bugs that last longer than other bugs before anybody notices them.
(See stem/chutney list above.)
Recommendation 1.7: We need more tests about shutting down busy clients and relays.
Our code tends to have a fair number of corner cases concerning shutting down at the wrong time, and crashing or asserting rather than exiting cleanly.
2. Which C difficulties are a problem in practice?
C is notoriously tricky, and we've put a fair amount of effort into avoiding its nastier failure modes. The C-related errors that did cause problems for us were not the ones I would have expected: Buffer overflows generally got caught very quickly. There are other C warts we've been less successful at avoiding.
Recommendation 2.1: Pay particular attention to integer overflow.
We had a few cases where signed integer overflow (or unsigned overflow) could cause bad bugs in our code, some resulting in heap corruption.
Perhaps we should prefer using unsigned integers everywhere we don't actually need signed integers? But although unsigned overflow isn't undefined behavior, it's still usually a bug when it's not intentional. So maybe preferring unsigned values wouldn't be so great.
Perhaps smartlists and similar data structures should use size_t internally instead of int for size and capacity. (Their use of int in their APIs isn't easy to change because of the rest of the codebase.)
Our work on using -ftrapv throughout our code by default (in #17983) should help turn subtle signed overflow errors into crashes.
Recommendation 2.2: Avoid void-pointer punning in API design; add more type-specialized APIs.
We have two styles of container: Those that are specialized for a given type, and those that store void*. In nearly all cases, the void* ones have involved programmers making mistakes about what type they actually contained, in some case causing hard-to-debug issues.
Bug #18454 is one example of a void-punning problem.
Recommendation 2.3: Continue to forbid new hand-written parsing code in C.
This caused fewer issues than you'd think (only a few ones for binary encoding and parsing, and only one for text parsing), but they were particularly troublesome.
Outside of hand-written parsing code, memory violations are less frequent than you'd think.
Recommendation 2.4: In the long-term, using a higher-level language than C would be wise.
This is a longer term project, however, and would have to happen after we get more module separation.
Bugs that would be more difficult (or impossible) to cause in a safe language include: #9296, #9602, #11743, #12694, #13104, #13477, #15202, #15823,
#15901, #17041, #17401, #17404, #18162, and #18454.
3. State machines, object lifetimes, and uncommon paths.
Many of our more subtle errors were caused by objects being in states that we didn't think they could actually be in at the same time, usually on error cases, shutdown paths, or other parts of the codebase not directly tied to the dominant path.
Recommendation 3.1: For every object type, we should have a very clear understanding of its lifetime, who creates it, who destroys it, and how long it is guaranteed to last.
We should document this for every type, and try to make sure that the documentation is simple and easy to verify.
We could also consider more reference-counting or handles (qv) to avoid lifetime problems.
Recommendation 3.2: State machines, particularly in error handling, need to be documented and/or simplified.
We have numerous implicit state machines scattered throughout our codebase, but few of them are explicitly expressed or documented as state machines. We should have a standard way to express and create new state machines to ensure their correctness, and to better analyze them.
Recommendation 3.3: Design with error handling in mind.
This might be as simple as documenting state machines and noticing cases where transitions aren't considered, or might be more complicated.
4. Assertion problems (too many, too few).
Recommendation 4.1: Assert less; BUG more.
A great many of our crash bugs were caused by assertions that did not actually need to be assertions. Some part of the code was violating an invariant, but rather than exiting the program, we could have simply had the function that noticed the problem exit with an error, fix up the invariant, or recover in some other way.
We've recently added a family of nonfatal assertion (#18613) functions; we should use them wherever reasonable.
5. Keeping backward compatibility with broken things for too long.
Recommendation 5.1: all backward compatibility code should have a timeout date.
On several occasions we added backward compatibility code to keep an old version of Tor working, but left it enabled for longer than we needed to. This code has tended not to get the same regular attention it deserves, and has also tended to hold surprising deviations from the specification. We should audit the code that's there today and see what we can remove, and we should never add new code of this kind without adding a ticket and a comment planning to remove it.
Recommendation 5.2: Examine all XXX, TODO, and FIXME code.
In several cases, the original author of a piece of buggy code scheduled it for removal with an "XXX02..." comment, or noted some potential problem with it. If we had been more vigilant about searching for and categorizing these cases, we could have removed or fixed this obsolete code before it had caused severe bugs.
There are 407 instances of XXX, TODO, ???, FFF, or FIXME in src/common and src/or right now; we should get on auditing those and recategorizing them as "fix now, open a ticket", "fix later, open a ticket", or something else.
6. Several errors are related to unrelated spec mismatches.
I don't have a firm set of conclusions here, other than to maybe make sure that our tests specifically correspond to what the spec says?
7. Too many options.
Recommendation 7.1: Cull the options; remove ones we don't need/use/advise.
Several severe bugs could only occur when the user specifically set one or more options which, on reflection, nobody should really set. In most cases, these options were added for fairly good reasons (such as protocol migration or bug workarounds), but they no longer serve a good purpose.
We should go through all of our settings and see what we can disable or deprecate. This may also allow us to get rid of more code.
Recommendation 7.2:Possibly, deprecate using a single Tor for many roles at a time.
Many bugs were related to running a HS+relay or HS+client or client+relay configuration. Maybe we should recommend separate processes for these deployment scenarios.
#9819 is one bug of this kind.
8. Release cycle issues.
Recommendation 8.1: Tighten end-of-cycle freezes issues.
Large features merged at the end of release, predictably, caused big bugs down the line.
Recommendation 9.1: We should favor a single convention for return values, and not accept code that doesn't follow it.
We have had a few bugs caused by differing return-value conventions on similar functions. Our most common convention has been to have a negative value indicate failure and zero to indicate success. When 0 indicates failure and positive values indicate success, we usually can expect to have a bug in the calling code someplace.
Bug #16360 was caused by a misunderstanding of this kind, and could have been much worse than it really was.
Recommendation 10.1: Callgraph complexity hits us a lot -- particular in code where a calling function assumes that a called function will not make certain changes in other structures.
We should, whenever possible, simplify our callgraph to remove cycles, and to limit maximum call depth.
Recommendation 10.2: Prefer identify-loop-then-modify-loop to all-operations-at-once-loop.
Several hard-to-diagnose bugs were called by code where we identified targets for some operation and simultaneously performed that operation. In general, we should probably have our default approach involve identifying the items to operate on first, and operating on them afterwards. We might want to operate on them immediately afterwards, or schedule the operation for higher in the mainloop.
Recommendation 10.3: Perhaps we should have a container-freezer.
We have code that supports removing a member from a smartlist or hashtable while iterating over it.... but adding or removing members through other means at the same time won't work. What's more, debugging the results is annoyingly difficult. Perhaps we should have our code catch such attempts and give an assertion failure.
Recommendation 10.4: Duplicated code is trouble; different functions that do the same thing differently are trouble.
This is well-known, but nonetheless, we have a few cases where we grew two functions to do similar things, patched one to solve a problem or add a feature, but forgot to patch the other.
11. Target areas.
Recommendation 11.1: Our path-selection/node-selection code is very complex, and needs more testing and rewrites.
More than in most other areas, we found bugs in the code that selects paths and nodes. This code is hard to test in part because it's randomized, and in part because it looks at several global structures including the nodelist, the microdescriptor set, the networkstatus, and state related to guard nodes. We should look at ways to simplify our logic here as much as possible.
IV. Next steps
Some time over the next month or so, we should re-scan this document for actionable items to improve our future code practices, create bugtracker tickets for those items, and try to sort them by their benefit-to-effort ratio. We should also make a plan to try this again in a year or two.