Data Testing

Published date: April 15, 2024, Version: 1.0

Overview

Data Testing is different from application testing because it requires a data-centric testing approach. Some of the Key elements in Data Testing are

Data testing involves comparing large volumes of data, typically millions of records.
Data that needs to be compared can be in heterogeneous data sources such as databases, flat files etc.
Data is often transformed, which might require complex SQL queries to compare the data.
Data Warehouse testing is very much dependent on the availability of test data with different test scenarios.
BI tools such as OBIEE, Cognos, Business Objects and Tableau generate reports on the fly based on a metadata model. Testing various combinations of attributes and measures can be a huge challenge.
The volume of the reports and the data can also make it very challenging to test these reports for regression, stress and functionality.

Quick Links

Onboarding a New Team Member Visual Management Visibility of Work in the Value Stream Streamlining Change Approval

Data Testing Categories

Metadata Testing

Metadata Testing aims to verify that the table definitions conform to the data model and application design specifications.

Data Type check

Verify that the table and column data type definitions are as per the data model design specifications.

Data Length Check

Verify that the length of database columns is as per the data model design specifications.

Index / Constraint Check

Verify that proper constraints and indexes are defined on the database tables as per the design specifications.

Metadata Naming Standards Check

Verify that the names of the database metadata, such as tables, columns, and indexes, are as per the naming standards.

Metadata Check across Environments

Compare table and column metadata across environments to ensure that changes have been migrated appropriately.

Data Completeness Test

The purpose of Data Completeness tests is to verify that all the expected data is loaded into the target from the source. Some of the tests that can be run are Compare and Validate counts, aggregates (min, max, sum, avg) and actual data between the source and target.

Record Count Validation

Compare the count of records of the primary source table and target table. Check for any rejected records.

Column Data Profile Validation

Column or attribute-level data profiling is an effective tool for comparing source and target data without comparing the entire data. It is similar to comparing the checksum of your source and target data. These tests are essential when testing large amounts of data.

Compare Entire Source and Target Data

Compare data (values) between the flat file and target data, effectively validating 100% of the data. In regulated industries such as finance and pharmaceutical, 100% data validation might be a compliance requirement. It is also a key requirement for data migration projects. However, performing 100% data validation is a challenge when large volumes of data are involved. This is where ETL testing tools such as ETL Validator can be used because they have an inbuilt ELV engine (Extract, Load and Validate) capable of comparing large values of data.

Data Quality Test

The purpose of Data Quality tests is to verify the accuracy of the data. Data profiling is used to identify data quality issues and the ETL design to fix or handle these issues. However, source data keeps changing, and new data quality issues may be discovered even after the ETL is used in production. Automating the data quality checks in the source and target system is an important aspect of ETL execution and testing.

Duplicate Data Checks

Look for duplicate rows with the same key column or a unique combination of columns per business requirements. Example: Business requirement says that a combination of First Name, Last Name, Middle Name and Data of Birth should be unique.

Data Validation Rules

Many database fields can contain a range of values that cannot be enumerated. However, there are reasonable constraints or rules that can apply to detect situations where the data is clearly wrong. Instances of fields containing values violating the validation rules defined represent a quality gap that can affect ETL processing.

Data Integrity Checks

This measurement addresses “keyed” relationships of entities within a domain. These checks aim to identify orphan records in the child entity with a foreign key to the parent entity.

Count of records with null foreign key values in the child table.
Count of invalid foreign key values in the child table that do not have a corresponding primary key in the parent table.

Data Transformation Tests

Data transformation happens during the ETL process so that it can consume by applications on the target system. Transformed data is generally important for the target systems, so it is important to test transformations. Two approaches for testing transformations are white box testing and black box testing.

Transformation testing using the White Box approach

White box testing is a testing technique that examines the program structure and derives test data from the program logic/code. Transformation testing involves reviewing the transformation logic from the mapping design document and the ETL code to come up with test cases. The steps to follow are

Review the source to target mapping design document to understand the transformation design.
Apply transformations on the data using SQL or a procedural language such as PLSQL to reflect the ETL transformation logic.
Compare the results of the transformed test data with the data in the target table.
The advantage of this approach is that the test can be rerun easily on larger source data. The disadvantage of this approach is that the tester has to implement the transformation logic.

Transformation testing using the Black Box approach

Black-box testing is a software testing method that examines an application's functionality without peering into its internal structures or workings. For transformation, testing involves reviewing the transformation logic from the mapping design document and setting up the test data appropriately. The steps to follow are listed below:

Review the requirements document to understand the transformation requirements.
Prepare test data in the source systems to reflect different transformation scenarios.
Come with the transformed data values or the expected values for the test data from the previous step.
Compare the results of the transformed test data in the target table with the expected values.
The advantage of this approach is that the transformation logic does not need to implement during the testing. The disadvantage of this approach is that the tester needs to set up test data for each transformation scenario and come up with the expected values for the transformed data manually.

Incremental Data Test

ETL process is designed to run in a Full mode or Incremental mode. When running in Full mode, the ETL process truncates the target tables and reloads all (or most) of the data from the source systems. Incremental ETL only loads the data that changed in the source system using some kind of change capture mechanism to identify changes. Incremental ETL is essential to reducing the ETL run times, and it is an often-used method for updating data on a regular basis. Incremental ETL testing aims to verify that updates on the sources are getting loaded into the target system properly.

While most of the data completeness and data transformation tests are relevant for incremental ETL testing, a few additional tests are relevant. First, setting up test data for updates and inserts is key for testing Incremental ETL.

Duplicate Data Checks

When a source record is updated, the incremental ETL should be able to lookup for the existing record in the target table and update it. If not, this can result in duplicates in the target table.

Compare Data Values

Verify that the changed data values in the source are reflected correctly in the target data. Typically the records updated by an ETL process are stamped by a run ID or a date of the ETL run. This data can be used to identify the newly updated or inserted records in the target system. Alternatively, all the records that got updated in the last few days in the source and target can be compared based on the incremental ETL run frequency.

Data Denormalization Checks

Denormalization of data is quite common in a data warehouse environment. Source data is denormalized in the ETL to improve the report performance. However, the denormalized values can get stale if the ETL process is not designed to update them based on changes in the source data.

Slowly Changing Dimension Checks

While there are different types of slowly changing dimensions (SCD), testing and SCD Type 2 dimension is presently a unique challenge since there can be multiple records with the same natural key. Type 2 SCD is designed to create a new record whenever there is a change to a set of columns. The latest record is tagged with a flag, and there are start date and end date columns to indicate the period of relevance for the record. Some of the tests specific to a Type 2 SCD are listed below:

Is a new record created every time there is a change to the SCD key columns, as expected?
Is the latest record tagged as the latest record by a flag?
Are the old records end dated appropriately?

ETL Regression testing

ETL Regression testing aims to verify that the ETL produces the same output for a given input before and after the change. Any differences need to be validated whether are expected as per the changes. Changes to Metadata Track changes to table metadata in the Source and Target environments. Often changes to source and target system metadata changes are not communicated to the QE and Development teams resulting in ETL and Application failures. This check is important from a regression testing standpoint.

End-to-End Integration Test

Once the data is transformed and loaded into the target by the ETL process, it is consumed by another application or process in the target system. The consuming application is a BI tool for data warehouse projects such as OBIEE, Business Objects, Cognos or SSRS. For a data migration project, data is extracted from a legacy application and loaded into a new application. In a data integration project, data is shared between two different applications, usually regularly. ETL integration testing aims to perform end-to-end testing of the data in the ETL process and the consuming application.

End-to-End Data Testing

Integration testing of the ETL process and the related applications involves the following steps:

Set up test data in the source system.
Execute the ETL process to load the test data into the target.
View or process the data in the target system.
Validate the data and application functionality that uses the data.

ETL Performance Testing

The performance of the ETL process is one of the key issues in any ETL project. Often development environments do not have enough source data for performance testing of the ETL process. This could be because the project has just started, the source system only has a small amount of test data, or production data has PII information, which cannot be loaded into the test database without scrubbing. The ETL process can behave differently with different volumes of data.

Flat File Testing

File Ingestion Testing

When data is moved using flat files between enterprises or organizations within an enterprise, it is important to perform a set of file ingestion validations on the inbound flat files before consuming the data in those files.

File Name Validation

Files are FTP’ed or copied over to a specific folder for processing. These files usually have a specific naming convention so that the process consuming the file is able to understand the contents and date. From a testing standpoint, the file name pattern needs to be validated to verify that it meets the requirement.

Size and Format of the Flat Files

Although flat files are generally delimited or fixed in width, having a header and footer is common in these files. Sometimes, these headers have a row count that can be used to verify that the file contains the entire data as expected.

Some of the relevant checks are:

Verify that the size of the file is within the expected range where applicable.
Verify that the header, footer and column heading rows have the expected format and have the expected location within the flat file.
Perform any row count checks to crosscheck the data in the header with the values in the delimited data.

File Arrival, Processing and Deletion Times

Files arrive periodically in a specific network folder or an FTP location before getting consumed by a process. Usually, specific requirements need to be met regarding the file arrival time, order of arrival and retaining them.

Data Type Testing

Data Type testing aims to verify that the type and length of the data in the flat file are as expected.

Data Type Check

Verify that the type and format of the data in the inbound flat file match the expected data type for the file. For date, timestamp and time data types, the values are expected to be in a specific format so that they can be parsed by the consuming process.