You can evaluate the performance of assessments, questions, and distractors by using the available ExamSoft reports. You can see how your questions are performing and make improvements in your questions and/or your curriculum.
Psychometrics are the mathematical statistics that can be interpreted very differently based on the purpose of the question. No single statistic can give you the entire picture of how an item should be interpreted. Likewise, there are no ideal psychometrics that are accurate for every item. The best practice is always to evaluate all pieces of information available about an item while accounting for the intention of the question.
A question that was created by the faculty member as an “easy” question may serve that specific purpose, but it has very different psychometric results than that of a question that was created to be discriminatory. Additionally, take into account any outside factors that could be influencing the statistics (including content delivery method, conflicting information given to the examtakers, testing environment, etc.)
Lastly, always keep in mind how many examtakers took the exam. If the number is very low, then the statistics are significantly less reliable than if it is a large group of examtakers.
Exam Statistics
Exam Statistics are statistics that are determined based on the performance of all examtakers on all questions of the exam. This data can be found on a number of reports (including the Item Analysis and Summary Reports).
 Mean: The mean is the average score of all examtakers who took the exam. It is found by dividing the sum of the scores by the total number of examtakers who took the exam.
 Median: The median is the score that marks the midpoint of all examtakers’ scores. It Is the score that is halfway between the highest and lowest scores.
 Standard Deviation: The standard deviation indicates the variation of exam scores. A low standard deviation indicates that examtaker’s score were all close to the average, while a high standard deviation indicates that there was a large variation in scores.
 Reliability KR20 (KuderRichardson Formula) (0.001.00): The KR20 measures internal consistency reliability. It takes into account all dichotomous questions and how many exam takers answered each question correctly. A high KR20 indicates that if the same examtakers took the same assessment there is a higher chance that the results would be the same. A low KR20 means that the results would be more likely to be different.
Note: If your institution is participating in the Insights beta, the KR20 metric will be replaced with Cronbach's Alpha.

 Cronbach's Alpha: Cronbach's can be used with continuous and nondichotomous data. In particular, it can be used for testing with partial credit, tests where questions carry different weights, and for questionnaires using a Likert scale. Like the KR20, Cronbach's Alpha will generally increase when the correlation between the test items increase. For this reason, the coefficient measures the internal consistency of the test.
Cronbach's Alpha can range from 0.00 to 1.00. A commonly accepted rule of thumb is that an alpha of 0.70 indicates acceptable reliability and 0.80 or higher indicates good reliability. Very high reliability (over 0.95) is not necessarily desirable, as this indicates that the items may actually be redundant. The goal in creating a reliable test is for scores on similar items to be related (internally consistent) but for each to contribute some unique information as well.
 Cronbach's Alpha: Cronbach's can be used with continuous and nondichotomous data. In particular, it can be used for testing with partial credit, tests where questions carry different weights, and for questionnaires using a Likert scale. Like the KR20, Cronbach's Alpha will generally increase when the correlation between the test items increase. For this reason, the coefficient measures the internal consistency of the test.
Question Statistics
Question Statistics are statistics that assess a single question. These can be found on the item analysis as well as in the question history for each question. These can be calculated based on question performance from a single assessment or across the life of the question.
 Difficulty Index (p) (0.001.00): The difficulty index measures the proportion of examtakers who answered an item correctly. A higher value indicates a greater proportion of examtakers responded to an item correctly. A lower value indicates that fewer examtakers got the question correct.
Note: If your institution is participating in the Insights beta, the Difficulty Index will be calculated as # of exam takers / # of exam takers that got the question correct. Enterprise Portal calculates assessment stats difficulty as # of exam takers that attempted the question / # of exam takers that got the question correct.
In addition to being provided the overall difficulty index, there is an Upper Difficulty Index and Lower Difficulty Index. These follow the same format as above but only take into account the top 27% of the class and the lower 27% of the class respectively. Thus the Upper Difficulty Index/Lower Difficulty index reflects what percentage of the top 27%/lower 27% of scorers on an exam answered the question correctly. The Upper and Lower Groups of examtakers are based on the top 27% and bottom 27% of performers respectively. 27% is an industry standard in item analyses.
 Discrimination Index (1.001.00): The discrimination index of a question shows the difference in performance between the upper 27% and the lower 27%. It is determined by subtracting the difficulty index of the lower 27% from the difficulty index of the upper 27%. A score close to 0 indicates that the upper examtakers and the lower examtakers performed similarly on this question. As a discrimination index becomes negative, this indicates that more of the lower performers got this question correct than the upper performers. As it becomes more positive, more of the upper performers got this question correct.
Determining an acceptable item discrimination score depends on the intention of the item. For example, if it is intended to be a masterylevel item, then a score as low as 0 to .2 is acceptable. If it is intended to be a highly discriminating item, target a score of .25 to .5.
 Point BiSerial (1.001.00): The point biserial measures the correlation between an examtaker's response on a given item and how the examtaker performed on the overall exam.
Notes: If your institution is participating in the Insights beta, the calculation of Point BiSerial will use sample standard deviation √(∑(xi  μ)2/(n1)) instead of a population standard deviation √(∑(xi  μ)2/(n)).
Insights does not calculate the Point BiSerial correlation value for questions in a posting where partial credit was awarded to at least one student.
 Greater than 0: Indicates a positive correlation between the performance on the question and the performance on the exam. Exam takers who succeeded on the exam also succeeded on this question, while examtakers who had trouble on the exam also had trouble on this question. A point biserial closer to 1 indicates a very strong correlation; success/failure on this question is a strong predictor of success/failure on the exam as a whole.
 Near 0: There was little correlation between the performance of this item and performance on the test as a whole. Possibly this question covered on material outside of the other learning outcomes, so that all or most examtakers struggled with this question. Possibly this question was a review item, so that all or most examtakers were able to answer it correctly.
 Less than 0: Indicates a negative correlation between the performance on the question and the performance on the exam. Exam takers who succeeded on this question had trouble with the exam as a whole, while examtakers who had trouble with this question did well on the exam as a whole.
 Response Frequencies: This details the percentage of examtakers who selected each answer choice. If there is an incorrect distractor that is receiving a very large portion of the answers, you may need to assess if that was the intention for this question or if something in that answer choice is causing confusion. Additionally, an answer choice with very low proportions of responses may need to be reviewed as well.
When reviewing response frequencies, you may wish to also examine the distribution of responses from your top 27% and lower 27%. If a large portion of your top 27% picked the same incorrect answer choice, it could indicate the need for further review.