Evaluating Instructional Software

by Deborah Lynn Stirling

Teachers attempting to create a technology-rich learning environment need to know how to evaluate software and how to use the program for instructional purposes. Software within the context of information technology represents one of the various types of media. The other types include CD-ROM, interactive video, and sound. Many teachers attempting to integrate new technologies like software programs into their teaching practice rely on evaluations to help them make instructional decisions. The purpose of this paper is to present key features of various methods of evaluating software that yield useful information for instructional decisions.

The most important component of evaluation is the evaluation of outcomes (Castellan, 1993). Knowing what students learned from the program and how well they learned it provides convincing evidence about the software's effectiveness. Although it seems clear that conducting an evaluation measuring learner's achievement seems appropriate and necessary, the time and cost to conduct field tests are prohibitive factors.

Experimental studies are one of three data sources for software evaluation. The other two are expert opinion and user surveys (Anderson & Chen, 1997). All three types of data sources will be presented, and the process of evaluation discussed in terms of what is questioned, how evidence is obtained, and how judgments are made. The final factor used to determine the utility of the evaluation method questions whether the evaluation yields useful information for instructional use.

Experimental Study

Experimental or quasi experimental studies usually use a pre- and posttest design including a control group and/or another treatment group. The flaws inherent in conducting comparative analysis of different media types has been cogently articulated (Saloman & Clark, 1977; Hagler & Knowlton, 1987; Clark, 1983; Niemiec & Walberg, 1987; Belmore, 1983; Welsh & Null, 1991). These researchers concluded that effects resulted more as a result related to the teacher's instructional method than the software. In reviewing a current list of 56 models for evaluating computer-based learning technologies 10 used an experimental approach with only four using comparative analysis.

The main focus of this approach is the determination of whether "it works." The straight forward question asked is "Does it work?" To answer this question, comparative analysis or descriptive statistical analysis is used to determine effectiveness. Effectiveness is operationally defined in terms of student achievement. This style of inquiry is referred to as a "black box" study. How a student learned or how to use the software is not addressed.

The evidence is obtained using subjects from the target audience and controlling as many variables as possible. This "controlling" method raises the issue of ecological validity. Some researchers assert that experimental design cannot be applied to evaluation contexts (Borich & Jemelka, 1981). Ransdell (1993) investigated the potential threats to validity, including internal, external, and ecological, when conducting software evaluations. The greatest threat to internal validity is the operational definition of the medium as an independent variable. Threats to external and ecological validity revolve around the instructor's style of teaching, and how the instructor facilitated the use of the software program. Like Castellan, Ransdell believes that how an instructor uses the software affects learning as much as the software itself. However, this approach does yield evidence of effectiveness, excluding validity threats. This approach does not yield information on facilitated use.

Expert Opinion

Experts use a criteria set or heuristic based on their professional experience and academic training to develop a checklist. The resulting checklist may remain an internal heuristic or become a checklist instrument. These lists represent their mental model of what effective instructional software should be. Judgment then is based on their unique set of values represented by the questions on their checklists. Checklist or expert review is subjective when no data is collected from the intended learners. Clearinghouses or evaluation organizations make this type of information readily available on the Web, on CD's, or publications. The reliability of an expert review based on a subjective criteria set is lower than the use of a more objective set of criteria (Micceri, Pritchard, & Barrett, 1989), for instance, a field test using learners.

In 1994, Reiser and Kegelmann published an article entitled, Evaluating Instructional Software: A Review and Critique of Current Methods. Their review focused on the methods used by evaluation organizations. The number of evaluation organizations including learners was nine out of 18. The authors indicate that the majority of evaluators base their judgment on a holistic view of the software program rather than an atomistic one. Although, evaluators' process of judgment making was concluded as being holistic, most evaluation methods required evaluators to review program features. Computer program features include content, technical characteristics, documentation, instructional design and learning considerations, the objectives of the software, and the handling of social issues. The questions asked focus on the quality of the standard features, ease of use, and perceived utility.

The use of a standard or common feature list does not resolve the subjectivity issue if learners are excluded. Several investigators have raised the issue that these features are not valid indicators of the instructional effectiveness of the software program (Caldwell, 1983; Jolicoeur & Berger, 1986; 1988a, 1988b; Owston & Dudley-Marling, 1988; Reiser & Dick, 1990; Tucker, 1989). A study conducted by Jolicoeur and Berger (1988b) reported the occurrence that even subjective ratings by teachers were not valid indicators. To address the apparent flaws with expert opinion or checklist review, several researchers strongly suggest involving learners (Kosmoski, 1984; Dudley-Marling & Owston, 1987; Johnston, 1987; Schueckler & Shuell, 1989; Callison & Haycock, 1988; Reiser & Dick, 1990). These researchers recommended that the evaluation organizations need to include a method of data collection that addresses the question "How does the software effect student learning?" The effectiveness of the existing evaluation organizations is thereby seen contingent on their ability to assist instructors in selecting effective instructional software. Instructors want to know the bottom line—does the software positively impact learning.

A recent ERIC digest entitled Microcomputer Courseware Evaluation Sources warned that not all evaluation organizations are reliable. Robin Taylor, the author, identified the most reliable sources. Evaluations conducted by the MicroSIFT project were ranked the best. This organization used at least three expert opinions and tested the program with the intended learners. Evaluations also included guidelines for usage. Another identified publication by EPIE Institute and Teachers College Press, Columbia University was TESS: The Educational Software Selector. Although only one expert evaluator is used, the evaluator is required to test the program with learners. The strengths and weaknesses of the program are reported as well as potential use.

Evaluating software before it is used with learners like an expert review or evaluation organizations' checklist approach may be classified as predictive evaluation (Squires and McDougall, 1996). Squires and McDougall (1995) investigated the checklist approach. The authors identified seven problem areas: weighting values of questions; evaluations of programs within the same category, for instance, simulation programs focus on similarities rather than differences; focus too heavily on technical versus educational issues; no application to innovative software evaluation; does not factor in varying teaching styles; does not address issue of follow up or extension activities by instructor; different content areas require unique criteria sets.

The first problem was addressed in 1982 by the MicroSift checklist, which required evaluators to rank items relative to perceived importance. They concluded that software evaluation using checklists and frameworks fail to generate information relative to use that helps an instructor imagine the use of the program in the intended learning environment. However, expert review has the potential to yield useful information if learners are included in their checklist approach. Especially true if an expert reviewer uses Castellan's (1993) strategic evaluation approach (discussed in the next section).

User Surveys

Evaluation approaches that observe the program's use by learners is termed interpretative evaluation. Reiser and Dick (1990) addressed the need for higher evaluation reliability by developing an alternative approach involving the learner. Their model offered what they believed to be the first objective method. Questions at the time of their field testing addressed the time and cost involved implementing their procedure. They felt that since the field test results differed from subjective results the extra time and money is well spent. Two years later, they published a modified version (Zahner, Reiser, Dick, & Gill, 1992). The question their model attempts to answer is "Do students learn the skills that the program is designed to teach?" Data sources are obtained from pretests, immediate and delayed posttest, and learner attitude surveys.

The original model included conducting a one-on-one evaluation as well as a small group evaluation. The revised model replaces the two aforementioned evaluations and uses a three-on-one evaluation. The three chosen students represent respectively high, medium and low ability levels. The software is judged effective based on the delayed posttest of achievement.

In their discussion, they acknowledge the small sampling size of the revised model is problematic. Additionally, they make note that the teachers involved in this field test had judged the software program valuable based on an expert review approach before implementing the revised model. Even though the students' performance were modest, the teachers retained their highly favorable subjective opinion. Their comparative analysis of the two models resulted in no significant differences.

Their findings show evidence that teachers do in fact judge software based on evidence other than student achievement. They conclude by stating that teachers, although pressed for time, can benefit from field testing software within their own classrooms. By taking the time to evaluate software, teachers will encounter more uses and better instructional design for using the software.

Anderson and Chen (1997) have developed an econometric model of software evaluation based on data obtained from users. Their method address the question of determining the relative importance users ascribe to software attributes. Software users were asked to evaluate each software feature using scaled responses of satisfaction ranging from 1 (poor) to 10 (excellent). Their method of equation development based on principal component analysis reveals that software satisfaction relies on three factors: function/feature, service/support, and implementation/friendliness. Additionally their results indicate that ease of use and documentation are two of the three most relative explanations for software satisfaction.

The theme of Castellan's Evaluating information technology in teaching and learning is that "the effective evaluation of instructional software and technology is a continuing and dynamic process that is never complete and may change in character over time" (1993). He believes that knowing how and when to evaluate student learning is very important and presents problems for teachers. His method of strategic evaluations attempts to answer questions relative to the technology and its use. His criteria set asks questions about: technical accuracy,pedagogical soundness, substantive fidelity, integrative flexibility, and cyclic improvement. The subsets of questions within each criterion is very extensive, although the data yield would be obviously valuable especially from a reflective teaching perspective, this questioning approach would not be easily adapted by teachers. His method has theoretical value but not practical application.

[ Insert the Squires and McDougall situated approach to interpretative evaluation.] Conclusion

The discussed approaches have their individual strengths and weaknesses. Weckworth wrote in 1969:

First, there is no one way to do evaluation; second, there is no generic logical structure which will assure a unique ‘right' method of choice. Third, evaluation ultimately becomes judgment and will remain so, so long as there is no ultimate criterion for monotonic ordering of priorities; and fourth, the critical element in evaluation is simply: who has the right, i.e., the power, the influence, or the authority to decide. (p. 48)

Software evaluation should be conducted by the instructor. The evaluation method used should yield information about quality, effectiveness, and instructional use.

References

Anderson, E. E., & Chen, Y-M. (1997). Microcomputer software evaluation: An econometric model. Decision Support Systems, 19 (2), 75-92.

Belmore, S. M. (1983). Release from pi: Comparison of traditional and computer modules in an experimental psychology laboratory. Behavior Research Methods & Instrumentation, 15, 191-1-94.

Borich, G., & Jemelka, R. (1981). Evaluation. In Computer-based instruction: A sate-of-the-art assessment (pp. 161-209). New York: Academic Press

Callison, D., & Haycock, G.(1988). A methodology for student evaluation of educational microcomputer software. Educational Technology, 28 (1), 25-32.

Castellan, N. J. (1993). Evaluating information technology in teaching and learning. Behavior Research Methods, Instruments, & Computers, 25 (2), 233-237.

Clark, R. E. (1983). Reconsidering research on learning from media. Review of Educational Research, 53, 445-459.

Dudley-Marling, C., & Owston, R. D. (1987). The state of educational software: A criterion-based evaluation. Educational Technology, 27 (3), 39-44.

Hagler, P., & Knowlton, J. (1987). Invalid implicit assumption in CBI comparison research. Journal of Computer-Based Instruction, 14, 84-88.

Johnston, V. M. (1987). The evaluation of microcomputer programs: An area of debate. Journal of Computer-Assisted Learning, 3, 40-50.

Kosmoski, P, K, (1984). Educational computing: The burden of insuring quality. Phi Delta Kappan, 66, 244-248.

MicroSift (1982). Evaluator's Guide for Microcomputer-Base Instructional Packages. The International Council for Computers in Education, Eugene, Oregon.

Niemiec, R., & Walberg, H. (1987). Comparative effects of computer-based instruction: A synthesis of reviews. Journal of Educational Computing Research, 3, 19-37.

Ransdell, S. (1993). Educational software evaluation research: Balancing internal, external, and ecological validity. Behavior Research Methods, Instruments, & Computers, 25 (2), 228-232.

Reiser, R. A., & Dick, W. (1990). Evaluating instructional software. Educational Technology Research and Development, 38 (3), 43-50.

Reiser, R. A., & Kegelmann, H. W. (1994). Evaluating instructional software: A review and critique of current methods. Educational Technology Research and Development, 42 (3), 63-69.

Saloman, G., & Clark, R. (1977). Reexamining the methodology of research on media and technology in education. Review of Educational Research, 47, 99-120.

Schueckler, L. M., & Shuell, T. J. (1989). A comparison of software evaluation forms and reviews. Journal of Educational Computing Research, 5(1), 17-33.

Squires, D., & McDougall, A. (1995). A critical examination of the checklist approach in software selection. Journal of Educational Computing Research, 12 (3), 263-274.

Squires, D., & McDougall, A. (1996). Software evaluation: A situated approach. Journal of Computer Assisted Learning, 12, (3), 146-161.

Weckworth, V. E. (1969). A conceptual model versus the real world of health care service delivery. In Hopkins, C.E. (Ed.) Outcomes Conference: Methodology of identifying, measuring and evaluating outcomes of Health Service Programs, Systems and Subsystems. Rockville, MD: Office of Scientific and Technical Information, NCH SRD, 39-62.

Welsh, J. A., & Null, C. H. (1991). The effects of computer-based instruction on college students' comprehension of classic research. Behavior Research Methods, Instruments, & Computers, 23, 301-305.

Zahner, J. E., Reiser, R. A., Dick, W. & Gill, B. (1992). Evaluating instructional software: A simplified model. Educational Technology Research and Development, 40 (3), 55-62.