Software Performance Testing Based on Workload Characterization Elaine
Software Performance Testing Based on Workload Characterization Elaine Weyuker Alberto Avritzer Joe Kondek Danielle Liu AT&T Labs
Workload Characterization A probability distribution associated with the input domain that describes how the system is used when it is operational in the field. Also called an operational profile or operational distribution. It is derived by monitoring field usage. We have used it to select (correctness) test cases, predict risk, assess software reliability, predict scalability, select performance test cases, among other things.
Steps in Characterizing the Workload • Model the software system. – Identify key parameters that characterize the system’s behavior. – Getting the granularity right. • Collect data while the system is operational or from related operational system. • Analyze the data and determine the probability distribution.
System Description Automated customer care system, built by another company, that can be accessed by both customer care agents and customers. It contains a large database with a web browser front-end a cache facility. For this system data was collected for 2 ½ months and analyzed page hits at 15 minute intervals.
Implementation Information The system was implemented as an extension to the http web server daemon, with a mutex semaphore used to implement database locking. The system is single threaded. Queued processes executed spin lock operations until the semaphore was free.
Before Performance Testing Prior to doing performance testing, users were complaining about poor performance, and the database was “hanging” several times a day. The hypothesis was that these problems were capacity-related, and the vendor was contacted but unable to solve the problems.
Performance Testing Goals Help them determine: • Which resources were overloaded. • Effects of the database size on performance. • Effects of single vs. multiple transactions. • Effects of the cache hit rate. • Effects of the number of http servers.
System Information • The web-server database was modeled as an M/D/1 queue. • The arrival process was assumed to be Poisson. • The cache hit rate was determined to be central to the system’s performance. It ranged between 80 -87%, with the average being 85%.
Distribution of User Requests Agent Requests Customer Requests Page Type Percentage Static Page 50% 23% Error Code 10% 23% Search Form 7% 30% Search Result 8% 16% 25% 8% Other Pages
Computing the Cache Hit Probability Page Type Frequency Prob Occur Cache Prob Wted Prob Home 2707 0. 2236 0. 9996 0. 2235 Static 2515 0. 2077 0. 9407 0. 1954 Error Code 1316 0. 1087 0. 6915 0. 0752 Screen Shot 1076 0. 0889 0. 6078 0. 0540 Search Res. 1035 0. 0855 0. 0463 0. 0040 Search. Form 832 0. 0687 0. 9218 0. 0633 Index 494 0. 0408 0. 9797 0. 0400 Other 2132 0. 1761 0. 9484 0. 1670 Total 12, 106 1. 0000 0. 8224
System Observations Heavier workload for agents on weekdays, peak hours in the afternoon, with little day-to-day variation. Customer peak hours occurred during the evening. Little change in workload as users become familiar with the system. (Agents are already expert users and execute a well-defined process, while individual customers tend to use the system rarely and therefore also maintain the same usage pattern over time. )
What We Achieved • Characterized the agent and customer workload, and used it as a basis for performance testing. • Identified performance limits and as a result detected a software bottleneck. • Provided recommendations for performance improvement. • Increased the understanding of the use of data collection for performance issues.
What We Learned The system was currently running at about 20% utilization. The CISCO routers were not properly load balanced. The spin lock operations consumed CPU time which led to a steep increase in the response time. (We used the SUN SE toolkit to record the number of spin locks).
No Caching 7 avg CPU cost(sec)/request 6 5 4 Customer Agent 3 2 1 0 0 0. 5 1 1. 5 Load (hits/sec) 2 2. 5
No Caching 40000 avg response time(ms) 35000 30000 25000 Customer Agent 20000 15000 10000 5000 0 0 0. 5 1 1. 5 Load (hits/sec) 2 2. 5
All Requests Retrieved From Cache 5000 avg response time(ms) 4500 4000 3500 3000 Customer Agent 2500 2000 1500 1000 500 0 0 20 40 60 80 100 Load (hits/sec) 120 140 160 180
Simulated 85% Cache Hit Rate 60000 avg response time (ms) 50000 40000 30000 Customer Agent 20000 10000 0 0 0. 5 1 1. 5 2 -10000 Load (hits/sec) 2. 5 3 3. 5 4
In Particular Delay strongly depends on caching: Found in cache ~ 100 ms Retrieved from database ~ 5 secs Current Available capacity: Customer: 2 hits/sec Agent: 2. 5 hits/sec Average demand: Customer: 10, 000 hits/day = 0. 12 hits/sec Agent: 25, 000 hits/day = 0. 29 hits/sec Busy Hour demand: Customer: 784 hits/hour = 0. 22 hits/sec Agent 2228 hits/hour = 0. 62 hits/sec
Effect of Database Size Customer – Cache Off avg response time (ms) 60000 50000 40000 Small DB (200 MB) Medium DB (400 MB) Large DB (600 MB) 30000 20000 10000 0 0 0. 2 0. 4 0. 6 0. 8 1 1. 2 Load (hits/sec) 1. 4 1. 6 1. 8 2
Effect of Database Size Agent – Cache Off 35000 avg response time (ms) 30000 25000 20000 Small DB (200 MB) Medium DB (400 MB) Large DB (600 MB) 15000 10000 5000 0 0 0. 2 0. 4 0. 6 0. 8 1 1. 2 Load (hits/sec) 1. 4 1. 6 1. 8 2
Adding Servers For this system, n servers meant n service queues, each operating independently, and hence less lock contention. This led to a significant increase in the workload that could be handled. However, since each server maintains its own caching mechanism, there was a distinct decrease in the cache hit probability, and an associated increase in the response time. The response time is dominated by the cache hit probability when the load is low; as the load increases the queuing for the database also increases.
Multiple Servers avg response time (ms) 60000 50000 40000 1 web server 2 web servers 3 web servers 4 web servers 5 web servers 6 web servers 30000 20000 10000 0 0 1 2 3 4 5 Load (hits/sec) 6 7 8
Recommendations • Projects should collect traffic data on a daily basis. • Performance measurements should be made while the software is being testing in the lab – both for new systems and when changes are being made. • Workload-based testing is a very cost effective way to do performance testing.
THE END
No Caching 70000 60000 avg resp time(ms) 50000 40000 Error Result Form Static 30000 20000 10000 0 0 0. 5 1 1. 5 hits/sec 2 2. 5
Cache Off 30 avg CPU cost(sec)/request 25 20 Error Result Form Static 15 10 5 0 0 0. 5 1 1. 5 hits/sec 2 2. 5
- Slides: 26