COS 368 Graphical User Interface Design Usability Testing

COS 368 Graphical User Interface Design Usability Testing (Sources: Rubin; Shneiderman; Lewis & Rieman; Stone, Jarrett, Woodruffe, and Minocha) Goals of usability testing • Creating a historical record of usability benchmarks for future releases. Try to ensure that future versions work as well as previous ones. (Have you ever gotten a new version of a product and found it more difficult to do the things you used to be able to do easily? ) • Minimize the cost of service and hotline calls. • Acquire a competitive edge since usability has become a market separator for product. Thus, the advertising and marketing people can distort your results to really gain a competitive edge (Welty). • Minimize risk. It is risky to have your usability testing done after product release. If people hate it you may never be able to get into the market. Usability testing can help you be more sure that the product has a market.

• Limitations of testing Testing is always an artificial situation. Even testing when in the field the user knows that the test is occurring. The very act of conducting the text can affect the results. • Test results do not prove that the product is usable. • Participants are rarely fully representative of the target population • Testing is not always the best technique to use. Expert evaluation may be faster and better at very early stages of development.

True experimental design (which is not what we will be doing) • Formulate a hypothesis • Randomly choose participants (using a systematic and reproducible method) • Use tight controls on participant background, the testing given, etc. • Use control groups • Use enough participants to get statistically significant differences between groups.

• Basic elements of usability testing • Development of problem statements or test objectives rather than hypotheses. • Use of a representative sample of end users which may or may not be randomly chosen. • Try to represent the actual work environment. • Observation of end users using the product. • Collection of quantitative and qualitative performance and preference measures. • Recommend improvements to the design of the product.

Evaluating design without users. • Expert reviewers. Hire experts, get their opinions. Must get the right experts. May be expensive but can quickly give useful feedback. • Cognitive walk through. Make up scenarios and follow them through with the prototype. Everyone from the design team participates. (Similar to code walk throughs in software engineering). Expert participation could be very useful. • Heuristic analysis. Review conformity to established guidelines. Be sure your interface consistently follows your own guidelines. (Note: Heuristics and guidelines are the same thing. )

Getting Participants • Must be as representative as possible of the target population. Actual target group members are best. • Be sure that participants have some of the same variation in background ability that the actual users have. Managers may, depending on various circumstances, give you the best or worst employees to test. You should try to get a crosssection. Commonly the top 25% of users perform twice as well as the bottom 25%. (In software, the best programmers generate 10 times as much final code as the worst. ) • Participants may have to be hired if actual users are not available. Students and temporary workers are often used. Retirees are another pool of possible participants. Must be careful, again, to get representative participants. Various ages may be difficult to obtain. A mix of novice and expert users may also be needed.

Participant training. • Often, some level of training is necessary before a participant is ready for testing. For example, if participants have never used a mouse before, starting them directly on a mouse oriented interface will skew the results. You might have them play a mouse oriented game (e. g. solitaire) before the test. • Usually some training is needed. Walk-up, kiosk style interfaces would be an exception.

Ethical Concerns in Working with Human Participants. • People can easily be embarrassed and stressed by an experiment. This can happen inadvertently by giving too difficult tasks, too little time, etc. Participants must never be demeaned by being made to feel incompetent or inadequate. Also keep humor and silliness to a minimum, they can be misinterpreted. Emphasize that the software is being tested, not the participant. Any errors are the fault of the software, not the user. • It is not a good idea to have the implementers/designers as part of the team running the experiment. They may be overly involved in the system. They also may want to defend their choices and see the participant as wrong. • Participants must be informed volunteers. Do not force or coerce people, especially employees and friends, to participate. • Participants must feel comfortable about stopping their participation. They must be informed before the experiment starts that they can stop (not quit) at any time. They must not be made to feel any qualms about stopping. They must be treated with respect. They know that evaluation is taking place, they must never feel that they are not measuring up in the evaluation. (e. g. Never say anything like, "Everyone else has been able to do these simple problems. ")

Ethical Concerns in Working with Human Participants (continued). • Using new software is difficult for anyone. Being observed and measured while doing it is far worse. • Privacy of the collected data is very important. Data should be kept private and the participants informed that it will be private. Assurance that no names will be associated with the data in presentation, papers, etc. is important. • If using video tape, it may be useful to position the camera so the participants face is not shown. Assuring the participant that only the experimenters will view the tape can also be reassuring. • You must keep your word about privacy and the rest of these ethical considerations. • When finished, thank the participant.

Test tasks • Use real tasks. They may be slightly modified due to user background, prototype limitations, and time constraints. Previous task analysis, observations, surveys, and ongoing input from real users should help identify such tasks. • e. g. A word processor could be tested by having a person type in a document. The participant would then be given a set of changes to be made to the document. Requiring the use of unusual symbols in the text could lead to deeper use of the word processor. This test would be for clerical people. Professionals who compose their own documents would have to be tested differently. • The participants must be well informed about the test tasks. They must understand the scenario.

The experiment. (This is an overview. ) • Preparation is all important. A pilot test is very useful for any size user test. This can significantly reduce the unexpected problems that often occur. Usually the main test is difficult and/or expensive to set up. The pilot can help the main experiment run as smoothly as possible. • If using a computer, the computer system must be adequate for the test. A very knowledgeable systems person must be on hand to deal with hardware and software problems. (Having the system go down in the middle of an experiment can destroy the experiment. ) • The experimental space should, in most cases, be quiet, private, and comfortable. Interruptions should be minimized. Everyone should leave their cell phones at the door. If the experimenter takes a call it shows the call is more important than the participant and the test. • It is helpful to isolate the computer system being used so outside problems can not influence the experiment. • The experimenter must be as invisible as possible during the experiment. Body language and facial expressions can send messages to the participants. Be careful.

Experimental Results There are two types of results: quantitative and qualitative. Psychological and human factors testing usually emphasize analyzable, quantitative data. This implies that multiple, large groups of participants are used. Quantitative results • Time to perform a task. • Number of errors made in performing the task • Time required to learn the interface to a certain level of expertise (as determined by a test that must be passed). • Number of tasks done correctly (usually in a specified amount of time). • Number of screens visited. • Number of keystrokes and mouse moves needed to complete a task. • etc. • Remember, especially with quantitative results, that the measures are for a specific task or tasks that is meant to mimic the real world. It is unlikely that the task mix used in the experiment will accurately capture the real world task mix. Thus, these quantities can be used for internal statistical analysis, but may not be appropriate to use to compare with other systems not explicitly studied in this experiment.

Qualitative Results • Usability studies are usually more interested in qualitative results. In usability studies, concentrating on measurements and paying less attention to process can cause problems in understanding what the results means. Does it take a long time for participants to do a specific task because of the instructions, terminology, unfamiliar hardware, or is it the interface itself? Remember: These results are only proxies for real world use. It is unlikely that the task mix will be the same in the experiment as in real use. The results are like the mile per gallon ratings posted on new cars. They can be a basis of comparison but do not necessarily predict what mileage you will get.

Usability testing ( source: Handbook of Usability Testing: How to plan, design, and conduct effective tests. Jeffrey Rubin. Wiley Pub. , 1994) This text gives detailed descriptions of user testing. Four types of tests 1. Exploratory Test • When - Early in the development cycle. User profile and task analysis has been done. The preliminary design stage has started. Complete functional specification is far from done. (Functional specification is the blueprint for the functionality of the product. It describes what the product does and what tasks the user performs using it. Often gives a detailed specification of the targeted user population. ) The prototype itself can be used as a lowlevel functional specification. (In COS 368 this corresponds to the testing that might be done on the in class prototype. ) • Objective - Explore the effectiveness of the preliminary design concepts. See if the designers proposed conceptual model matches the user's. Want to see if the overall layout and organization of the product is usable. Can the user appropriately manipulate objects? Can the user navigate from screen to the desired goal? Does the product do what the user wants it to do? Does the needed user background match the background that the developers assumed? This test works on the skeleton of the product. • Methodology - Use a horizontal (shallow) prototype of just the first layer with some functionality, perhaps just one layer of menus below the interface. Participant would attempt to perform at least the first portions of some tasks. If some more functionality is simulated, the participant should try it out on a real task. Notes - Do not wait too long to perform this sort of test with a real or paper prototype. If the design has reached a certain point before testing occurs, it can be difficult or impossible to bring it back to the table.

2. Assessment test • When - Early or midway through the product development cycle after the fundamental (or high-level or organizational) design has been established. This is the most common type of testing. In COS 368 this would be the level for the assigned usability test. • Objective - Evaluate much of the actual functionality of the product using the prototype. This level test works on the meat of the product, not just the skeleton. You are testing to see how well a user can perform a reasonable set of actual tasks. Problems are noted and deficiencies identified. • Methodology - The user performs tasks. The test monitor will have minimal interaction with the participant. Some quantitative measures may be collected. Tests are those described in previous lectures.

3. Validation test (aka verification test) • When - Late in the life cycle to verify that the product is usable. This is near product release time. Often uses computer prototype. The form and functionality have been pretty well determined but implementers need to see if they have missed any important points. • Objective - Test that the product meets the usability standards defined earlier within the company. These standards often include performance criteria such as response time and accuracy with which the user can use the product. The results should be recorded so that subsequent versions can be tested against these to be sure they perform at least as well. Can also test to see how all the components work together, the functionality, the interface, the help system and the manuals. In the case of a disastrous validation test, the product release could be delayed. Sadly, many companies might prefer not to know about major problems in a released product but they could benefit after release by having their help desk ready to deal with the problem(s). • Methodology - This probably needs to be a more rigorous test because you are probably doing a variety of performance tests that are quantitative in nature. You must decide, for example, what level of accuracy the user attains. If 90% of the user tasks are completed satisfactorily, is this enough? What about 80%. At what point do the results dictate a delay in release?

4. Comparison test • When - At any time. It can be used to test different prototypes resulting from parallel design in the early phase of development. It can be used in the middle of an experiment to compare the effectiveness of single elements, such as whether butcons or regular text buttons are better for a specific use. At the end it can be used to compare against a competitive product. We may try to compare student prototypes with each other. • Objective - Can help determine the comparative advantages of different designs and features. Can be used to compare your product to those of competitors. • Methodology - The experiments can be very formal if the quantitative measurements are being taken and statistically valid results are to be obtained. This might be the case if the results were also to be used for advertising. Less formal experiments can be used to determine the better of several design alternatives or difference between specific features. Rubin says that if basic design is to be compared that the resulting design is often a hybrid of the alternatives. For this reason he suggests having a set of "wildly differing alternatives". The cross-pollenization of these alternatives may result in a vastly superior product.

Setting up a test environment • Many big companies have high-powered, wellappointed, well-equipped, expensive labs. (Welty has visited those at IBM in San Jose, CA, and Binghamton, NY. In fact these were so expensive to run that IBM started giving grants to faculty member to run experiments outside the testing labs. Welty got such a grant. ) • This is not necessary. In fact it may be best to start small and have the necessity and success of usability testing dictate how big and fancy the unit grows. Often a single room, even a non-dedicated room can be sufficient for testing. Even equipment need not be extensive. Paper prototypes require little equipment. Actual prototypes need only a computer. Recording systems can vary from a simple audio recorder or video camera to multiple fixed cameras, one-way mirrors, etc.

The testing team • A large enterprise may have a large group of people involved in the testing of its products. Others may have just one or two people who must play all the roles. The test roles • Test Monitor/Administrator - This person has primary responsibility for all that occurs. Will probably at least interact initially with test participants. More than likely the monitor will conduct the actual text. Data Logger - Records critical activities and events, e. g. what screens are accessed. May record times at which interesting things happen for later review on video tape. Often assigns codes to screens, activities, etc. so recording is easier than the extensive writing that might otherwise be needed. • Timers - Keep track of the time it takes the participants to perform specific activities. Can be done from the video tape of the test.

• Video Recordings Operator - Must try to keep the pertinent items in focus, these usually being the product and the interaction between the user and the product. Takes care of labeling and storing tapes. (A well-placed tripod can take care of some of this and another team member the rest. ) • Product/Technical Expert(s) - If testing actual software, someone has to be there in the event of a system crash. • Computer Simulator(s) - Simulate the action of the prototype by changing screens, putting up dialog boxes, calculating totals, etc. This can be quite a challenge and requires a fair amount of planning so that all the items that will appear on the board are immediately available when needed. Note: They must be available when the participant needs them, not just when the experimenters think the user will need them. • Additional Testing Roles - May need people to simulate the person answering a hot line, a pizza delivery person, etc. • Test Observers - Usually these are people on the design team. It is best if many of the design team members are there to see directly what the issues with the product are. It can also change the attitude of team members who doubt the utility of user testing (of course it may affect attitudes negatively if the test is not well done).