Bug bites Elephant Testdriven Quality Assurance in Big
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Development Dr. Dominik Benz, inovex Gmb. H 2013/06/03, Berlin Buzzwords
Who speaks… … the Elephant language? Class A extends Mapper… ROI, $$, … apt-get install… ? ? ? ? TDD! Write/execute tests, specify acceptance criteria, … 2
The road… … to Big Data QA the Fit. Nesse approach our Big Data QA problem result inspection test data definition / selection job & workflow control 3
QA problem Web Intelligence @ 1&1 BI reporting, web analytics, … ~ 1 billion log events / day, ~ 1 TB (thrift) logfiles DWH Hadoop Cluster chains of MR jobs, running on 20 nodes / 8 cores / 96 GB RAM (CDH) 4
QA problem An exemplary workflow ? Log Files (thrift) create (sample) input data MR job 1 Intermediate result (avro) ? ? inspect (binary) formats MR job 2 … control workflows DWH (RDBMS) 5
QA problem Existing Approaches method tests what? issues for our usecase JUnit isolated functions no integration, Java syntax MRUnit 1 mapper + 1 reducer „little“ integration, Java syntax i. Test hadoop jobs/workflows Java / Groovy syntax Scripts/CLI (manual) scripting/inspect. „script chaos“, syntax Fit. Nesse as suitable addition / solution! 6
The road… … to Big Data QA the Fit. Nesse approach Big Data QA is different! result inspection test data definition / selection job & workflow control 7
Fit. Nesse In a nutshell „fully integrated standalone wiki and acceptance testing framework” „executable“ Wiki-Pages (returning test results) (almost) natural language test specification connection to SUT via (Java-)“Fixtures“ 8
Fit. Nesse Architecture Overview Fixtures Browser script | check | num results | 3 | Fit. Nesse Server à „calling java methods from wiki“, compare return values à Integrates with REST, Jenkins… public int num. Results {. . . } System under Test 9
Fit. Nesse An Exemplary Test 10
Fit. Nesse Exemplary Test Source !path /home/inovex/lib/*. jar | script | Hadoop | | upload | view. Log. csv | to hdfs | /testdata/ | | hadoop job from jar | view. Log. jar | [. . . ] | | show | job output | | check | number of output files | 3 | 11
Fit. Nesse Hadoop Fixture Java Code public class Hadoop { public boolean upload. To. Hdfs(String local. File, String remote. File) {. . . } public boolean hadoop. Job. From. Jar(String jar, String input, String output) {. . . } public String job. Output() {. . . } public String number. Of. Output. Files() {. . . } } 12
The road… … to Big Data QA Fitnesse Wiki test execution! Big Data QA is different! result inspection test data definition / selection job & workflow control 13
Test Data CSV 14
Test Data Thrift ‣ Big Data: Efficient data transfer among heterogeneous sources ‣ Define Interface via IDL, Compiler for many languages 15
Test Data Real World Data ‣ Dev/Test Hadoop Cluster: Identical Hardware like Prod, but fewer nodes ‣ (random/biased) sampling e. g. on daily basis ‣ Feedback loop: ‣ identify „special cases“ from real data ‣ include them in (manual) data definition ‣ Gradually increase test coverage / artefact quality 16
The road… … to Big Data QA Fit. Nesse Wiki test execution! Big Data QA is different! result inspection Define CSV / thrift / realworld test data! job & workflow control 17
Job Control Swiss Army Knife: Shell ‣ Execute arbitrary (shell) commands ‣ Mainly a wrapper around apache. commons. exec. Command. Line 18
Job Control Hadoop Fixture ‣ Hide complexity from test authors ‣ „define“ appropriate test language via (Java) method names ‣ re-use other fixtures (Shell, …) internally 19
Job Control Workflows & Suites ‣ Fit. Nesse allows to group tests into suites MR job 1 ‣ Can be used to simulate MR processing chains ‣ Setup. Suite / Tear. Down. Suite for creating / destroying test conditions MR job 2 ‣ Tests can still be executed individually 20
The road… … to Big Data QA Fit. Nesse Wiki test execution! Big Data QA is different! result inspection Define CSV / thrift / realworld data! Use suites & fixtures for jobs/workflows! 21
Results Data Warehouse / Hive ‣ Validate RDBMS contents (via JDBC) ‣ E. g. for checking the final result ‣ Or use Hive + Hive-Server to query raw data 22
Results Pig ‣ Execute arbitrary pig commands from Wiki page ‣ Inspect e. g. binary intermediate results (avro, …) 23
Results Pig Fixture extends Pig. Server public class Pig. Console extends Pig. Server { public void load. Avro. File. Using. Alias(String filename, String alias) { this. register. Query( alias + "= LOAD" + filename + "USING" + AVRO_STORAGE_LOADER + "; "); } } 24
Results Server Infrastructure Fitnesse Master Test. Environments Test. Configurations Proj. A Proj. B dev qs Proj. A live dev qs live Import / edit config remotely Import / edit tests remotely Dev Proj. A Slave Proj. B dev QS Proj. A Slave qs Live Proj. A Slave live 25
Thank you! dominik. benz@inovex. de Fit. Nesse Wiki test execution! Big Data QA is different! Inspect results via Pig/Hive Define CSV / thrift / realworld data! Use suites & fixtures for jobs/workflows! 26
Want more? Inovex trains you! § Android Developer Training (3 days, Karlsruhe/München) § Certified Scrum Developer Training (5 days, Köln) § Hadoop Developer Training (3 days, Karlsruhe/Köln) § Liferay Portal-Developer Training (4 days, Karlsruhe) § Liferay Portal-Admin Training (3 days, Karlsruhe) § Pentaho Data Integration Training (4 days, München/Köln) information and registration at www. inovex. de/offene-trainings 27
Inovex Stefan @bbuzz Kathrin Bernhard Jörg Andrew Christian 28
- Slides: 28