Developing Scientific Software. Part 1: Principles from Test-Driven… | by Carlos Costa, Ph.D. | Jul, 2023


Part 1: Principles from Test-Driven Development

Photo by Noah Windler on Unsplash

We live in an age of rapidly expanding possibilities in the world of computing. AI continues to make strides in solving old and new problems alike, often in completely unexpected ways. Huge datasets have now become ubiquitous in almost any area, and not just something that scientists in white lab coats at expensive facilities can obtain.

And yet many of the challenges that have been encountered in the last decades when developing software to process data remain — or are even exacerbated when handling these new, vast swathes of data.

The area of scientific computing, traditionally focused in developing fast and accurate methods to solve scientific problems has recently become relevant much beyond its original, narrow scope. In this article, I will expose some of the challenges that arise when developing high-quality scientific software, as well as some tactics on how to overcome them. Our end goal put together a step-by-step guide for creating scientific software that ensures an accurate and efficient development process. In a follow-up article, I will follow this step-by-step guide to solve a dummy problem in Python. Check it out after reading this article!

Test-driven development (TDD) redefined software engineering, enabling developers to write more durable, bug-free code. If you have ever used TDD, you are probably familiar with its power in writing quality software. If you have not, hopefully by the end of this article you will understand its importance. Regardless of your experience with TDD, anyone who is familiar with scientific computing knows that using automated testing of the software can be tricky to implement reliably.

The TDD development cycle, which I recommend everyone to read at least once, lays out some sensible instructions on how to develop software in a way that every piece of code written is checked to be right by a test. Periodic testing then guarantees that bugs are often caught before they are introduced.

But some of the tenets of TDD may seem completely at odds with the scientific software development process. In TDD, for example, tests are written before the code; the code is written to accommodate the tests.

But imagine you are implementing a completely new data processing method. How would you write a test before you even have the code? TDD relies on expected behavior: if there is no way to quantify behavior prior to implementing the new method, it is logically impossible to write the test first! I will argue that this case is rare, but even when it does happen, TDD can still help us. How?

Rilee and Clune observe (emphasis mine):

Effective testing of numerical software requires a comprehensive suite of oracles […] as well as robust estimates for the unavoidable numerical errors […] At first glance these concerns often seem exceedingly challenging or even insurmountable for real-world scientific applications. However, we argue that this common perception is incorrect and driven by (1) a conflation between model validation and software verification and (2) the general tendency in the scientific community to develop relatively coarse-grained, large procedures that compound numerous algorithmic steps.

Oracles are known input/output pairs that may or may not involve complex computations. Oracles are used for traditional TDD, but they are often very simple. They play a larger role in scientific software, and not just as a part of unit testing!

When we talk about using oracles to check for some expected behavior, we are talking about software verification. For the software, it doesn’t really matter what it is verifying, only that input X leads to output Y. Validation, on the other hand, is the process of ensuring the code’s output Y accurately matches what is expected by the scientist. This process must obligatorily leverage the scientist’s domain knowledge in the form of experiments, simulations, observations, literature survey, mathematical models, etc.

This important distinction is not exclusive to the domain of scientific computing. Any practitioner of TDD either implicitly or explicitly develops tests which encompass both verification and validation.

Suppose you are writing code to seat a list of people to a given list of labeled chairs. A verification test may check if a list of N persons and M chairs outputs a list of N 2-tuples. Or that if any of the lists are empty, the output must also be an empty list. Meanwhile, a validation test may check that if an input list contains duplicates, the function throws an error. Or that for any output, no two persons are assigned to the same chair. These tests require domain knowledge of our problem.

While TDD operates on both verification and validation, it is important to not conflate the two and use them at the appropriate stage of software development. If you are engaged in writing scientific software — i.e., any non-trivial pieces of numerical code, especially performance-critical ones—read on to understand how to appropriately leverage TDD for these purposes.

One important difference between standard software and scientific software is that in standard software, equality is something generally uncontroversial. When testing if two people are assigned the same chair, checking if labels (modeled as, say, integers) are the same for persons (or chairs) is straightforward. In scientific software, the ubiquitous use of floating point numbers complicates matters considerably. Equality cannot be generally checked via ==, and commonly requires a choice of numerical precision. In fact, the definition of precision can vary depending on the application (e.g., see relative vs. absolute tolerance). Here are some recommended practices for numerical accuracy testing:

  • Start by testing tolerance as precise as allowed by the least precise floating point type used in the computations. Your tests may fail. If they do, loosen the precision one decimal at a time until they pass. If you can’t get a good precision (e.g. you need a tolerance of 10^-2 for a test using float64 operations to pass), you might have a bug.
  • Numerical error generally grows with the number of operations. When possible, validate the precision from domain-specific knowledge (e.g., Taylor methods have explicit remainder terms that can be leveraged in tests, but these situations are rare).
  • Favor absolute tolerances when possible, and avoid relative tolerances (“accuracy”) when comparing values near zero.
  • It is not uncommon to have precision unit test fails when running tests thousands of times in different machines. If this happens consistently, either the precision is too stringent or a bug has been introduced. The latter has been much more common in my experience.
Floating number 😛 (Photo by Johannes W on Unsplash)

Testing new methods

When developing scientific software, one cannot rely on numerical accuracy alone. Often new methods can improve accuracy or change the solution altogether, providing a “better” solution from the scientist’s point of view. In the former case, the scientist may get away with using a previous oracle with decreased tolerance to ensure correctness. In the latter case, the scientist may need to create a new oracle entirely. It is paramount to create a curated suite of oracle examples, which may or may not be checked for numerical precision, but which the scientist can inspect.

  • Curate a set of representative examples that you can automatically or manually inspect.
  • Examples should be representative. This may involve running computationally intensive tasks. Therefore, it is important to decouple from the unit testing suite.
  • Run these examples as periodically as possible.

Random testing

Scientific software may have to deal with nondeterministic behavior. There are many philosophies on how to handle this. My personal approach is to control randomness as much as possible via seed values. This has become the standard in machine learning experiments, which I believe is also “the right way” to do it for generic scientific computing.

I also believe that monkey testing (aka, fuzzing) — the practice of testing random values at each run — has an extremely valuable role in developing scientific software. Monkey testing, when used judiciously, can find obscure bugs and enhance your unit testing library. Done wrong, it can create a completely unpredictable testing suite. Good monkey tests have the following properties:

  • Tests must be reproducible. Log all seeds required to rerun the test.
  • Random inputs must range over all possible inputs, and only over these possible inputs.
  • Treat edge cases separately if you can predict them.
  • Tests should be able to catch errors and other bad behavior, in addition to testing accuracy. A test is useless if it cannot flag bad behavior.
  • Bad behavior should be studied and isolated as separate tests which test for a entirely class of situations which generate these error (e.g., if an input of -1 fails and upon investigation, all negative numbers fail, therefore create a test for all negative numbers).

Apart from verification and validation, developers working on high-performance scientific software must be mindful about performance regressions. Profiling is therefore an integral part of the development process, ensuring that you get the best performance out of your code.

But profiling can be tricky. Here are some of the guiding principles I use to profile scientific software.

  • Profile units. Similarly to testing units, you should be profiling performance-critical units of code. NVIDIA’s CUDA best practice model is Assess, Parallelize, Optimize, Deploy (APOD). Profiling units puts you in a great position to Assess if you want to port your code to GPU.
  • Profile what matters first. Err on the side of caution, but do not profile pieces of code which won’t be run repeatedly, or whose optimization will not result in large gains.
  • Profile diversely. Profile CPU time, memory, and any other useful metrics for the application.
  • Ensure reproducible environments for profiling. Library versions, CPU workloads, etc.
  • Try to profile inside of your unit testing. You need not fail tests that regress, but you should at least flag them.

In this section we are going to briefly describe the main stages of the development methodology that I apply for scientific software. These steps have been informed by writing scientific software in academia, industry, and open-source projects, following the best practices described above. And while I can’t say I have always applied them, I can honestly say that I always regretted not doing it!

Implementation cycle

  1. Gather requirements. What is the context in which you will use your method? Think about what functionality it must provide, how flexible it must be, inputs and outputs, standalone or part of some larger codebase. Consider what it must do now and what you may want it to do in the future. It is easy to prematurely optimize in this stage, so remember: “keep it simple, stupid” and “you aren’t gonna need it”.
  2. Sketch the design. Create a template, either code or diagrams establishing a design which satisfies the above requirements.
  3. Implement initial tests. You’re in step 3 and itching to start coding. Take a deep breath! You will start coding but not your method/feature. At this step you write super simple tests. Like, really small. Start with simple verification tests and move on to basic validation tests. For the validation tests, my suggestion is to leverage analytical oracles as much as possible in the beginning. If it is not possible, skip them.
  4. Implement your alpha version. You have your tests (verification), you can start actually implementing the code to start satisfying them without fearing being (very) wrong. This first implementation does not have to be the fastest, but it needs to be right (validation)! My advice: start with a simple implementation, leveraging standard libraries. Relying on standard libraries considerably reduces the risk of incorrect implementations because it leverages their test suite.
  5. Build an oracle library. I cannot stress how important this is! At this point you want to establish trustworthy oracles that you can always rely on for future implementations and/or changes to your methods. This part is usually missing from traditional TDD, but it is paramount in scientific software. It ensures that your results are not just numerically correct, but future-proofs new and possibly different implementations from being scientifically inaccurate. It is normal to go back and forth between implementation and exploratory scripts to build your validation oracles, but avoid writing tests at the same time.
  6. Revisit tests. Armed with your oracles which you have diligently stored, write some more validation unit tests. Again, avoid going back and forth between implementation and tests.
  7. Implement profiling. Set up profiling within and outside of your unit tests. You will come back to this once you have your first iteration going.

Optimization cycle

  1. Optimize. You now want to make this function as fast as necessary for your application. Armed with your tests and your profilers, you can unleash your scientific computing knowledge to make it fast.
  2. Reimplement. Here you consider new implementations, for example using hardware acceleration libraries like GPUs, distributed computing, etc. I suggest NVIDIA’s APOD (Assess, Parallelize, Optimize, Deploy) as a good optimization methodology. You can go back to the implementation cycle, but now you always have a bunch of oracles and tests. If you expect the functionality to change, see below.

New method cycle

  1. Implement new method. Follow an implementation cycle as if you did not have any oracles up to and including step 6.
  2. Validate against previous curated oracles. After the oracle-building step, you can leverage your previous oracle examples from your previous implementation to ensure that the new one is somehow “better” than it. This step is key in developing algorithms and methods which are robust for a variety of data. It is used frequently in industry to ensure that new algorithms perform in a variety of relevant cases.

Many of these principles may only really make sense afterwards, when looking at specific examples. Scientific computing spans a myriad of different types of software for many purposes, so one approach rarely fits all.

I encourage you to follow the next part of this series to see how to implement many of these steps in practice.



Source link

Leave a Comment