Testing data-based algorithms

The article describes a proposal for solving a problem of evolution the complex algorithms, that operate directly on a database, when we have no unit tests, but we have got access to the real data, on which algorithms operate. Of course, most of the properly written algorithm, should not access the database during processing or it should be built-in the way that mocking the database is possible. However, in reality, we sometimes encounter such situations and we have to deal with them. The described solution has been implemented and successfully worked for a large computer system, allowing the safe evolution and optimization of algorithms.

The idea

The approach included in this article is based on the following assumptions:

We treat the algorithm as a black box, which interacts with the database to perform computation. The idea of test execution is well known. The algorithm is run on sets of various test data and it is checked whether its execution completes successfully and whether the generated data fulfill the general criteria assumed at the very beginning. Some interesting aspects involve the extraction, storage, and usage of real data as test fixtures.

To perform tests, we need:

.. image:: ../../static/art/2-autotest/walking.png

Extracting test data from a production database

The real data collected in databases is a great basis for test cases. For each test case, we should extract data from tables, narrowing it to the minimum required for the algorithm to run.

Be aware of the responsibility to efface sensitive data fields. Sensitive data include business names, which could relate the test data with real objects. Numerical values usually should not be regarded as sensitive data. They usually mean nothing valuable without a context given by names and in most cases are essential in algorithm processing.

One algorithm test may proceed differently, due to the variety of data provided in fixtures. Therefore, the more unique test cases, the better. Fixtures that cover the same area should be eliminated because it will result in unnecessary expenditures in maintaining the tests.

If an algorithm is multistage, but implemented as a monolith, in addition to fixtures, you can dump intermediate data to a text file. It will allow for more detailed inspections and easier identification of problems. The testing framework should automatically collate a dumped text file with its correspondent generated in tests.

To automate and standardize data exports, a generic exporting class should be implemented. It allows us to derivate a dedicated export class for each case. A dedicated exporter is useful for:

class Warehouse_Autotest_Exporter extends Autotest_ExportBase {                                   
    protected function defineDumps() {                                                          
        $product = $this->exportParams['product'];                               
        $date = $this->exportParams['date'];                                             
        $this->addDumpDef('products', "product = ?", array($product));
        $this->addDumpDef('warehouse_doc_lines', 'date >= ? AND product = ?', array($date, $product));   

Data loading

Data should be exported in a format that will reduce loading time during tests to a minimum. For this purpose, it is advised to use SQL command COPY (PostgreSql) instead of INSERT. It could also be a good idea to drop foreign keys and indexes on tables while importing. This can speed up loading data significantly and relieve us of keeping a proper import order.


TestCase groups methods that test one algorithm or some common context. Each method is a single, independent test performed on various fixture sets.

The entire test is performed in one database transaction, which is rollbacked at the end. Thanks to that, there is no need for special cleaning after the test. It is convenient and fast.

We should distinguish two kinds of test's failures:

If an algorithm fails because of an inner exception or inner assertions, then it is perceived as an error. If there is no error, then we check the generated data, validating it against the universal condition, regardless of the given input. If the tests don't pass, then it is recognized as a test failure. In the last step, text files are compared with intermediate data dumps. If files don't match, then it is also a test failure.

We do not compare the state of entire tables at the beginning and the end of a test. That's because in the algorithm there could be various function calls, which can also modify side data. If these functions are changed then a lot of tests will demonstrate the problems, although these data are outside of the interest of the test. In this case, the maintenance of tests will not be possible in the future.

Unfortunately, working with neglected code might cause the necessity to inform the running code that it is under test. This happens in two cases:

The right solution would require time-consuming refactoring, which at the moment is not possible to carry out. Note! This situation is error-prone and should be eliminated (refactored) as soon as possible. To keep the code informed of the test mode, it is good to create a dedicated static class. It can be easily used inside the tested algorithms. Static class, in that case, is similar to a global variable. It just covers it, but it is a much better solution then naked global. That's because we can monitor, log, debug it at the moments when the variable is set/read.

Automatic test running

Every test framework should provide a mechanism for selection and automatic test execution. It is necessary for a programmer while coding new functionality and for continuous integration system, to run all or selected tests.

Test maintenance

Fixtures are small slices of table dumps, extracted directly from a database. Its format is prepared for fast loading to a database, not for manual editing. During software development, the structure of data changes frequently. It should be easy to introduce these changes also into our fixtures, to keep them up to date.

class Autotest_Fixture_Patcher 

const DB_BEFORE = 'test_patch_before';
const DB = 'test_patch_tmp';

    private function patch($fixtureName) {                                                       
        Shell::db_clone(self::DB_BEFORE, self::DB);


The solution described in this article allows the secure development of existing, complex algorithms. The next step is refactoring and complementing it with unit tests, which allow for even easier product evolution.