There are several areas in the process / workflow of a big data project where testing will be required. Testing in big data projects is typically related to database testing, infrastructure and performance testing and functional testing. Having a clear test strategy contributes to the success of the project.
Database testing will be a key component of testing in big data applications. In summary, the above steps can be classified into 3 major groups:
- Data Staging Validation: Here we validate the data taken from various sources like sensors, scanners, logs etc. We also validate the data that is pushed into Hadoop (or similar frameworks).
- Process Validation: In this step the tester validates that the data obtained after processing through the big data application is accurate. This also involves testing the accuracy of the data generated from Map Reduce or similar processes.
- Output Validation: In this step the tester validate that the output from the big data application is correctly stored in the data warehouse. They also verify that the data is accurately being represented in the business intelligence system or any other target UI.
Performance testing of the system is required to avoid the above issues. Here we measure metrics like throughput, memory utilization, CPU utilization, time taken to complete a task etc.
It is also recommended to run fail over tests to validate the fault tolerance of the system and ensure that if some nodes fail, other nodes will take up the processing.
Functional testing of big data applications is performed by testing the front end application based on user requirements. The front end can be a web based application which interfaces with Hadoop (or a similar framework on the back end).
Results produced by the front end application will have to be compared with the expected results in order to validate the application.