Testing data-based algorithms


The article describes a proposal for solving a problem of evolution the complex algorithms, that operate directly on a database, when we have no unittest, but we have got access to the real data. Of course, most of the properly written algorithm, should not access database during processing or it should be built in the way that mocking the database is possible. However, in reality we sometimes encounter such situations and we have to deal with them. Described solution has been implemented and successfully worked for large computer system, allowing the safe evolution and optimization of algorithms.

The idea

The approach included in this article is based on the following assumptions:

  • we maintain a properly working algorithm that requires business changes or optimization,
  • we can get access to the real data which the algorithm operates on,
  • the algorithm has built-in checks asserting proper processing.

We treat the algorithm as a black box, which interacts with database to perform computation. The idea of test execution is well known. The algorithm is run on sets of various test data and it is checked whether its execution completes successfully, and whether the generated data fullfill the general criteria assumed on the very beginning. Some interesting aspects involve extraction, storage and usage of real data as test fixtures.

In order to perform tests, we need:

  • an empty database,
  • various sets of test data - fixtures,
  • a test case that runs the algorithm and performs check.

Extracting test data from a production database

The real data collected in databases is a great basis for test cases. For each test case we should extract data from tables, narrowing it to the minimum required for the algorithm to run.

Be aware of the responsibility to efface sensitive data fields. Sensitive data include business names, which could relate the test data with real objects. Numerical values usually should not be regarded as sensitive data. They usually mean nothing valuable without a context given by names and in most cases are essential in algorithm processing.

One algorithm test may proceed differently, due to the variety of data provided in fixtures. Therefore, the more unique test cases, the better. Fixtures that cover the same area should be eliminated, because it will result in unnecessary expenditures in maintaining the tests.

If an algorithm is multistage, but implemented as a monolith, in addition to fixtures, the intermediate data could be dump to a text file. It will allow for more detailed inspections and easier identification of problems. Testing framework should automatically collate a dumped text file with its correspondent generated in tests.

In order to automate and standardize data exports, a generic exporting class should be implemented. It allows to derivate dedicated export class for each case. A dedicated exporter is useful for:

  • easily extracting variety of different data sets from different production databases,
  • embedded in production configuration for quickly exporting fixtures when errors occur, to deal with them in a developement environment.
  • dumping sample data for new test cases from developer's configuration when implementing new features in the algorithm.
     class Warehouse_Autotest_Exporter extends Autotest_ExportBase {

     protected function defineDumps() {
             $product = $this->exportParams['product'];
             $date = $this->exportParams['date'];

             $this->addDumpDef('products', "product = ?", array($product));
             $this->addDumpDef('warehouse_doc_lines', 'date >= ? AND product = ?',
                                              array($date, $product));

Data loading

Data should be exported in a format that will reduce loading time during tests to minimum. For this purpose, it is advised to use sql command COPY (PostgreSql) instead of INSERT. It could also be a good idea to drop foreign keys and indexes on tables while importing. This can speed up loading data significantly and relieve us of keeping a proper import order.


TestCase groups methods that test one algorithm or some common context. Each method is a single, independent test performed on various fixture sets.

The entire test is performed in one database transaction, which is rollbacked at the end. Thanks to that, there is no need for special cleaning after the test. It is convenient and fast.

We should distinguish two kinds of test's failures:

  • a program error,
  • expected values of the test differ from the returned ones.

If an algorithm fails because of an inner exception or inner assertions, then it is perceived as an error. If there is no error, then we check the generated data, validating it against universal condition, regardless of the given input. If the tests don't pass, then it is recognized as a test failure. In the last step text files are compared with intermediate data dumps. If files don't match, then it is also a test failure.

We do not compare the state of entire tables at the beginning and at end of a test. That's because in the algorithm there could be various function calls, which can also modify side data. If these functions are changed then a lot of tests will demonstrate the problems, although these data are outside of interest of the test. In this case, the maintenance of tests will not be possible in the future .

Unfortunately, working with neglected code might cause the necessity to inform the running code that it is under test. This happens in two cases:

  • We dump text file data during the algorithm run for the purpose of comparison,
  • Some sections of code, or checks cannot be performed during the test (!). The right solution would require time-consuming refactoring, which at the moment is not possible to carry out. Note! This situation is error prone and should be eliminated (refactored) as soon as possible.

In order to keep the code informed of the test mode it is good to create a dedicated static class. It can be easily used inside the tested algorithms. Static class in that case is similar to a global variable. In fact, it just covers it, but it is a much better solution then naked global. That's because we can monitor, log, debug it at the moments when the variable is set / read.

Automatic test running

Every test framework should provide a mechanism for selection and automatic test execution. It is necessary for programmer, while coding new functionality and for continuous integration system, to run all or selected tests.

Test maintenance

Fixtures are small slices of table dumps, extracted directly from a database. Its format is prepared for fast loading to a database, not for manual editing. During software development, the structure of data changes frequently. It should be easy to introduce these changes also into our fixtures, to keep them up to date.

class Autotest_Fixture_Patcher {
        const DB_BEFORE = 'test_patch_before';
        const DB = 'test_patch_tmp';

        // ...

        private function patch($fixtureName) {
            Shell::db_clone(self::DB_BEFORE, self::DB);

        // ...


The solution described in this article allows secure development of existing, complex algorithms. The next step is refactoring and complementing it with unit tests, which allow for even easier evolution.