1310117.png

Personal efforts to improve the quality of Ruby interpreter

This article is Japanese -> English translation of the following post (and some additional messages):

技術部の笹田です。Ruby 3.2 無事にリリースされて良かったよかった。 Rubyインタプリタは複雑なプログラムなので、当然のごとくバグが入ってきます。Rubyインタプリタ開発者は、これに対していろんな対策をしています。たとえば、テストを書いて、CI環境でチェックするとか、今となっては当然のことを、当然のごとくやっています(RubyCIやchkbuild、ruby/spec: The Ruby Spec Suite aka ruby/specなどの整備や、実行環境の日々のメンテナンスの成果です)。 これに追加して、個人的にテストをとにかくたくさん繰り返し行うマシン群を用意しています。テストの…

favicon
techlife.cookpad.com

I’m a programmer at Cookpad Inc. and one of the commiter of the Ruby interpreter (Homepage of Koichi Sasada). We hope you enjoy the recently released Ruby 3.2.

Ruby interpreter is a complex program, so it naturally has bugs, and Ruby interpreter developers are taking various countermeasures against them. For example, we write tests and check them in CI environment (This is the result of daily maintenance of the test environment, such as RubyCI, chkbuild, ruby/spec: The Ruby Spec Suite aka ruby/spec and machines).

In addition to this, I have a group of machines that I personally run a lot of test iterations on. The purpose is to improve the quality of Ruby interpreter by running tests as often as possible to find bugs that occur only occasionally. In this article, I’d like to introduce about such an uncommon test environment.

To prepare this test environment, I have received support from various people. In this article I would like to express my gratitude for their support.

The Need to “Reveal” Bugs

Environment to run tests on a regular basis

In a typical CI/CD context, you run tests on each commit (PR), because if there is a problem, you know there is a problem in the commit. GitHub Actions and the like often target such tests.

To deal with it, Ruby interpreter core team prepares and uses the following test environments.

  1. Per-PR and per-push testing environment with GitHub Actions
  2. Periodical test environment with chkbuild (The results are available on rubyci).

Both 1 and 2 are basically mechanisms to check for problems with the last modification.

To get accurate test results, chkbuild makes an Ruby interpreter from scrach and run tests every time under various operating systems and CPU architectures. However, since it takes time, it is executed about once every two hours on a machine.

Currently, most of the compute resources are built on AWS with support from Ruby Association and other organizations. GitHub actions are provided by GitHub.

Speaking of which, Shopify has been kind enough to test their (presumably huge) application with a development version of Ruby. That’s very helpful.

Digression: Try head/debug versions of setup-ruby

If you have projects which have tests running on GitHub actions with setup-ruby, please consider to add head/debug versions.

  • head: nightly build
  • debug: nightly build with assertions

If you find out the strange behavior, please feel free to report it to the https://bugs.ruby-lang.org/issues/. This kind of contributions are very helpful for us.

Test environment to discover tests that sometimes fail

When you have a software as big as Ruby interpreter, you may encounter a phenomenon that it sometimes fails even though nothing has changed. It is also possible that there may be last minute changes, but the test fails for reasons not contemplated by the modification. This is sometimes called a flaky test. There are several possible reasons for this.

  1. Bad test.
  2. Tests caused by external factors such as “time” and “system status
  3. Bugs are already mixed in, but only if you’re unlucky (or lucky) enough to find them.

The most common experience is that “1. bad test”. For example, if it depends on the order of the tests, it can cause problems at any moment. If you’re writing tests that are timing-sensitive, they can be a bit off and sometimes fail (or fail because of a change in machine spec).

Tests caused by external factors in #2 are also sometimes bad, though it’s not impossible to say that the tests are bad. For example, the example shown in ruby/zlib tests started to fail even though I didn’t do anything – @znz blog (written in Japanese), where the timestamps generated specific data at a specific time, and the tests failed (fixed the tests and solved the problem).

Well, the above are in the category of “bad testing”, so they are not directly related to the quality of the interpreter itself. However, if we leave these things unmodified, you’ll have a hard time checking the test results, so you need to fix them as soon as possible. It’s the Broken Windows Theory.

Even a bug that appears once in 10,000 times by bad luck may be stepped on once a day in a software with 10,000 users a day. Or rather, you step on it. If it is worse, it may become a source of vulnerability.

This kind of bug is likely to appear in the following situations:

  • Automatic memory management (GC)
  • An algorithm using cache
  • Parallel and concurrent executions
  • Networks and other external systems

All of these are areas (and areas I work with a lot) where it’s easy to bring in non-deterministic behavior, i.e., behavior that doesn’t produce the same result even if you run it twice. There are many other things that can cause a “Hey, that’s not the same result as before?” situation, such as memory address randomization by the system.

So, there are many ways to tackle on this, we use the method of “just try many times”. It’s simple: if it appears once in every 10,000 times, then if you run it 10,000 times, it will be reproducible.

In other words: “Bugs that don’t appear very often are already in the code → the more test trials we run, the higher the probability of stepping on such a bug (we can reveal it out).”

The chkbuild I mentioned earlier runs about 12 times a day (and multiplied by the number of environments), and the GitHub actions runs every event, so it’s not enough to say “run a lot”. So I made my own test environment and have been running it for about 5 years.

It originally started when I got fed up with “occasional” bugs in debugging on the newly designed GC development, and ran tests all the time on a single machine.

When I first started, I used the command while make up all test-all; do date; done to run the test indefinitely (stopping if it failed). However, with this, I have to look at the terminal to check the results, and I could not know if the test stops unintentionally. Also, it’s hard to scale, so I had to build my own test environment.

Techniques for running lots of tests

In order to run a lot of tests, we devised the following

  • Use multiple machines (scale-out)
  • Use machines with better performance (scale-up)
  • Run multiple tests simultaneously on a single machine to use up hardware resources
  • Shorten the time of a single build and test run

Here is an introduction to each of them.

Preparing the machine to be used

If we have enough money, the most reliable (and easy for the savvy) way to scale out is to prepare lots of machines on the cloud, but since this is a private activity, there is a limit to how much money I can provide. Also, some of these types of uses that use up computing resources are not suited to cheap cloud services.

Since I have some space at home, I’m currently hosting the actual machines in a temporary location (this activity will likely end there as the children grow older and they needs their own rooms. Please remember that Japanese home doesn’t have enough space to maintain multiple computers).

I’ve been staring at the price list of AWS and other services, but using real machines is the cheapest… (I wonder if it will be cheaper if I find a discount plan). I’m grateful that I can buy a nice little machine with 8 cores 16 threads for just under 100,000 JPY (about 750 USD). Currently, I’m running 4 machines.

Image description

All the new machines are small. I used to have a mid-tower machine in my lineup, but it was getting in the way… I bought the HX90 at the last Black Friday because it was a bit low price.

Test run times correlated nicely with CPU frequency. The faster the better.

If you run one test suite, 2GB of memory seems to be enough even if we build and run each test in parallel (it was my surprise).

I have an electricity meter on, and when I look at it, it seems to go up and down around 400Wh in total. When I look at the Tokyo Electric Power Company’s electricity rates Standard Plan. That’s just under 10,000 JPY (per month). I’m partially compensated by the earnings from GitHub sponsors.

(By the way, this electricity bill includes three Mac minis used for rubyci/chkbuild, which I introduced earlier; the Mac minis were purchased with the support of Nihhon Ruby-no-Kai).

It’s fine now because it is Winter, but during the hot months (since I didn’t turn on the A/C) the fan was making a lot of noise. Worried about a fire. So far, it’s been running for more than a year, even in continuous operation. But after 5 years, 2 of the middle tower machines broke. The smaller machines seem to have a shorter life span.

The machine cost (leaving aside the old one) is 220,000 JPY, and if it depreciates in 3 years, it is about 70,000 JPY/year. Electricity cost is roughly 120,000 JPY/year. In other words, about 200,000 JPY/year (about 1,500 USD/year). It is cheap because it doesn’t need the cost of the place and the maintenance personnel cost. And I don’t need SLA because I don’t have to worry about the system if the system is down. Well, it is a good hobby to arrange machines at home.

As a side note, one of the reasons I’m running physical machines on hand is to do benchmarking. If you use a machine in the cloud, you may have to mess up your instance, so I prefer to use a physical machine as much as possible. For example, the machine on https://rubybench.github.io/ is the machine hosted at my home (this machine is also provided by Ruby-no-kai Japan, thank you very much). When I need to do serious benchmarking of new features, I stop running tests and and use these machines for benchmarking (because sometimes I need more than one machine for benchmarking).

Parallel execution of build and test processes

One way to increase the number of test trials is to launch multiple processes running the test suite on a single machine.

When a test suite is executed, there are times when it consumes resources and times when it is free, so the idea is to improve the overall performance by running another test execution process when one test execution process is not busy. However, if the number of concurrent test processes is too large, the overall performance may deteriorate because resources are consumed in resource conflicts.

Simply running multiple test processes sometimes interfered with each other (e.g. filesystem or network ports), so through some trial and error I figured out that it was ok to tweak some settings in the Docker container. I now have 22 Docker containers running the test suite simultaneously on a single machine (build-ruby/run_sp2.rb at master ・ ko1/build-ruby). The memory is 32GB, which is enough (but I gave up the RAM disk, which will be described later).

Running various tests concurrently in a Docker container (memory consumption)

Reduce build and test time

In order to reduce the time it takes to build the latest version of Ruby and run through the entire test suite, we’ve devised the following:

  • Reuse compile results etc.
  • Using a RAM disk
  • Concurrent build and test processing

The test runs on rubyci.org do not reuse any compile results and so on to ensure the test results. However, in this case, the goal is to get more number of tests, so I try to reuse the compiled results aggressively. However, there are sometimes problems caused by reuse, so if it fails twice, it erases all the compiled results and build from scratch.

In an environment with relatively spare memory, I try to use RAM disks (tmpfs) for all build results to speed up the build a bit. However, it’s not clear how well this works. The reason is that the OS caches performance-related data in memory on its own. It’s just a matter of feeling that it’s faster.

The parallel execution of the build is by make -jN. 10 years ago, there were quite a few bugs caused by this, but now we can build in parallel with almost no problem.

Running tests in parallel, which means splitting up the test suite and running the them in parallel. There are roughly three groups of Ruby tests to run in this environment, one of which has long supported parallel processing (make test-all). I rewrote one more group (make btest) to make it parallelizable, with the goal of making it count.

With these efforts, in an environment where you occupy a fast machine and repeatedly run “build the latest version -> run tests” can be done in less than 2 minutes. In other words, because we always get the latest version of Ruby from the repository for testing, if you make a problematic commit that causes the test to fail, you’ll get a test failure notification in as little as two minutes (the result is notified to the Slack channel).

Results of repeating tests

Technique to reveal bugs

More tests

Even if a bug is introduced, it cannot be detected if there no code that will step on that bug. This is why extensive testing is needed. Ruby already has a large set of tests.

Also, the Ruby interpreter source code contains a lot of assertions (statements of what should happen at this point in the program). You can think of this as a kind of test. In the parts I code, I try to increase the number of such assertions so that we can detect strange states.

Many of these assertions are only enabled for checking in debug builds. For this reason, we use debug builds to run them in some of the environments we run in.

For testing, ideally, it would be nice to bring a prominent app or library and run its tests on the latest development version of Ruby, but I haven’t gotten around to that.

More test patterns

The tests we run are not all the same, but we try to find bugs by running the tests in different patterns.

  • Run the test with a Ruby interpreter built with various parameters.
  • Run tests with different versions of the build environment (compiler).
  • Run the tests in a random order. For example, the method cache status changes depending on the order in which the tests are executed, so there may be bugs to be found there.
  • Repeatedly run the test. Similarly, repeating the same test may change the status of the method cache.

I wrote a software to build Ruby and run tests according to the configuration (ko1/build-ruby: Build Ruby from source code). For example, here is a list of settings: https://github.com/ko1/build-ruby/blob/master/docker/ruby/targets.yaml

Dealing with Errors

We have devised several ways to deal with unforeseen problems.

  • Recording of all execution logs
  • Allow configurable timeouts to prevent infinite stoppages
    • When a timeout occurs, gdb dumps the backtrace of the related process
  • In case of an abnormal exit that spits out core, you can download the core
  • If the failures continue, delete all the data, increase the execution interval, and so on.

However, when a test fails, we often don’t know the cause after all. We would like to devise a little more.

Developing a system to check the results

I made a site ci.rvm.jp to aggregate the test results. The DB is SQLite3 because the number of viewers is limited (so it is slow). It’s a really slow server, so I don’t even link to it.

I’m trying to make the output to stderr visible in the summary page of the execution result so that it’s easy to understand what’s wrong when you look at the failure page (but otehr CI sites seem to be doing this endlessly).

When a test fails, there is a Slack notification (for the Ruby committers to see) and an email notification (just to me). In the rare case of a commit that fails, the notifications are terrible.

Aside: Other possible innovations

There are many possible methods for testing non-deterministic behavior, especially in academic researches.

For example, making any external events demterministics (external events such as input/output and thread scheduling). In other words, you can make sure that the same program (and external input) will always return the same result by using a variety of different techniques. Once a problem is found, if the problem can always be reproduced, it seems to make things easier. However, I’ve heard a lot about this at the research level, but I wonder how practical it will be.

We can also think of methods such as using formal methods to automatically generate exhaustive tests and data that makes it easier to cover. It would be cool to be able to do this kind of thing.

Outcome

I did a lot of hard work at first to fix the problem, because with thousands of attempts every day, it’s quite a flurry of failures. I worked much harder to fix it, mainly because there are a lot of test flops.

I was also able to fix some bugs caused by timing. Here’s the patch I have in my notes.

Conclusion

In this article, I introduced my personal activity to find rare bugs by increasing the number of test runs in order to improve quality of Ruby interpreter.

Bugs are always present in programs, and it is hard to find bugs in large, complex programs. In this article, I introduced some of the trial-and-error process. I wish I could take a more scientific approach. If you know a good method, please let me know.

As I mentioned in the article, this system is made possible by a lot of support. Especially GitHub Sponsors was important to continue this activity. Thank you again.

As for the machines, a few years ago, a certain company gave me three big rack-mount machines with 3-digit GB memory that became unnecessary, and I installed them in another certain company N. I have been operating the machines including these (Mr. S of company N has been taking care of the machine operation for a long time). The other day, these machines were removed because they were old indeed, so I wrote this article with the memorial service and gratitude. Thank you very much.

Well, I wish you a happy New Year and enjoy newly released Ruby 3.2!




Source link

Add a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.