You can evaluate the performance of assessments, questions, and distractors by using the available ExamSoft reports. You can see how your questions are performing and make improvements in your questions and/or your curriculum.
Psychometrics are the mathematical statistics that can be interpreted very differently based on the purpose of the question. No single statistic can give you the entire picture of how an item should be interpreted. Likewise, there are no ideal psychometrics that are accurate for every item. The best practice is always to evaluate all pieces of information available about an item while accounting for the intention of the question.
A question that was created by the faculty member as an “easy” question may serve that specific purpose, but it has very different psychometric results than that of a question that was created to be discriminatory. Additionally, take into account any outside factors that could be influencing the statistics (including content delivery method, conflicting information given to the exam-takers, testing environment, etc.)
Lastly, always keep in mind how many exam-takers took the exam. If the number is very low, then the statistics are significantly less reliable than if it is a large group of exam-takers.
Exam Statistics
Exam Statistics are statistics that are determined based on the performance of all exam-takers on all questions of the exam. This data can be found on a number of reports (including the Item Analysis and Summary Reports).
- Mean: The mean is the average score of all exam-takers who took the exam. It is found by dividing the sum of the scores by the total number of exam-takers who took the exam.
- Median: The median is the score that marks the midpoint of all exam-takers’ scores. It Is the score that is halfway between the highest and lowest scores.
- Standard Deviation: The standard deviation indicates the variation of exam scores. A low standard deviation indicates that exam-taker’s score were all close to the average, while a high standard deviation indicates that there was a large variation in scores.
- Reliability KR-20 (Kuder-Richardson Formula) (0.00-1.00): The KR-20 measures internal consistency reliability. It takes into account all dichotomous questions and how many exam takers answered each question correctly. A high KR-20 indicates that if the same exam-takers took the same assessment there is a higher chance that the results would be the same. A low KR-20 means that the results would be more likely to be different.
Note: If your institution is participating in the Insights beta, the KR-20 metric will be replaced with Cronbach's Alpha.
-
- Cronbach's Alpha: Cronbach's can be used with continuous and non-dichotomous data. In particular, it can be used for testing with partial credit, tests where questions carry different weights, and for questionnaires using a Likert scale. Like the KR20, Cronbach's Alpha will generally increase when the correlation between the test items increase. For this reason, the coefficient measures the internal consistency of the test.
Cronbach's Alpha can range from 0.00 to 1.00. A commonly accepted rule of thumb is that an alpha of 0.70 indicates acceptable reliability and 0.80 or higher indicates good reliability. Very high reliability (over 0.95) is not necessarily desirable, as this indicates that the items may actually be redundant. The goal in creating a reliable test is for scores on similar items to be related (internally consistent) but for each to contribute some unique information as well.
- Cronbach's Alpha: Cronbach's can be used with continuous and non-dichotomous data. In particular, it can be used for testing with partial credit, tests where questions carry different weights, and for questionnaires using a Likert scale. Like the KR20, Cronbach's Alpha will generally increase when the correlation between the test items increase. For this reason, the coefficient measures the internal consistency of the test.
Question Statistics
Question Statistics are statistics that assess a single question. These can be found on the item analysis as well as in the question history for each question. These can be calculated based on question performance from a single assessment or across the life of the question.
- Difficulty Index (p) (0.00-1.00): The difficulty index measures the proportion of exam-takers who answered an item correctly. A higher value indicates a greater proportion of exam-takers responded to an item correctly. A lower value indicates that fewer exam-takers got the question correct.
Note: If your institution is participating in the Insights beta, the Difficulty Index will be calculated as # of exam takers / # of exam takers that got the question correct. Enterprise Portal calculates assessment stats difficulty as # of exam takers that attempted the question / # of exam takers that got the question correct.
In addition to being provided the overall difficulty index, there is an Upper Difficulty Index and Lower Difficulty Index. These follow the same format as above but only take into account the top 27% of the class and the lower 27% of the class respectively. Thus the Upper Difficulty Index/Lower Difficulty index reflects what percentage of the top 27%/lower 27% of scorers on an exam answered the question correctly. The Upper and Lower Groups of exam-takers are based on the top 27% and bottom 27% of performers respectively. 27% is an industry standard in item analyses.
- Discrimination Index (-1.00-1.00): The discrimination index of a question shows the difference in performance between the upper 27% and the lower 27%. It is determined by subtracting the difficulty index of the lower 27% from the difficulty index of the upper 27%. A score close to 0 indicates that the upper exam-takers and the lower exam-takers performed similarly on this question. As a discrimination index becomes negative, this indicates that more of the lower performers got this question correct than the upper performers. As it becomes more positive, more of the upper performers got this question correct.
Determining an acceptable item discrimination score depends on the intention of the item. For example, if it is intended to be a mastery-level item, then a score as low as 0 to .2 is acceptable. If it is intended to be a highly discriminating item, target a score of .25 to .5.
- Point Bi-Serial (-1.00-1.00): The point bi-serial measures the correlation between an exam-taker's response on a given item and how the exam-taker performed on the overall exam.
Notes: If your institution is participating in the Insights beta, the calculation of Point Bi-Serial will use sample standard deviation √(∑(xi - μ)2/(n-1)) instead of a population standard deviation √(∑(xi - μ)2/(n)).
Insights does not calculate the Point Bi-Serial correlation value for questions in a posting where partial credit was awarded to at least one student.
- Greater than 0: Indicates a positive correlation between the performance on the question and the performance on the exam. Exam takers who succeeded on the exam also succeeded on this question, while exam-takers who had trouble on the exam also had trouble on this question. A point bi-serial closer to 1 indicates a very strong correlation; success/failure on this question is a strong predictor of success/failure on the exam as a whole.
- Near 0: There was little correlation between the performance of this item and performance on the test as a whole. Possibly this question covered on material outside of the other learning outcomes, so that all or most exam-takers struggled with this question. Possibly this question was a review item, so that all or most exam-takers were able to answer it correctly.
- Less than 0: Indicates a negative correlation between the performance on the question and the performance on the exam. Exam takers who succeeded on this question had trouble with the exam as a whole, while exam-takers who had trouble with this question did well on the exam as a whole.
- Response Frequencies: This details the percentage of exam-takers who selected each answer choice. If there is an incorrect distractor that is receiving a very large portion of the answers, you may need to assess if that was the intention for this question or if something in that answer choice is causing confusion. Additionally, an answer choice with very low proportions of responses may need to be reviewed as well.
When reviewing response frequencies, you may wish to also examine the distribution of responses from your top 27% and lower 27%. If a large portion of your top 27% picked the same incorrect answer choice, it could indicate the need for further review.