I want to determine the reliability between two advisors. They make «yes/no» decisions about a large number of variables for a large number of participants. The data file I have has one line per subscriber; Each pair of columns represents the coding decisions of Rater A and Rater B for a given variable. Solution: the modeling agreement (for example. B on log-linear or other models) is usually an informative approach. Indicators of the reliability of inter-evaluation are used in many fields such as computer linguistics, psychology and medicine; However, the interpretation of the resulting values and the setting of appropriate thresholds lack context and are often guided only by arbitrary «thumb rules» or are not addressed at all. Our objective for this work was to develop a method of determining the relationship between inter-assessment agreement and error, in order to allow a judicious interpretation of values, thresholds and reliability. To define a perfect disagreement, film audiences would have to clash in this case, ideally in extremes. In a 2 x 2 table, it is possible to define a perfect disagreement, because any positive assessment might have some negative rating (z.B. Love vs. Hate`s), but what about a 3 x 3 square table or higher? In these cases, there are more opportunities to disagree, so it quickly becomes more complicated to oppose it completely. To think of a total disagreement, one would have to have a situation that minimizes the consistency in each combination, and in the higher tables, it would probably be a situation where there is no counting in certain cells, because it would be impossible to have perfect disagreements on all combinations at the same time.
In assessing the agreement between the evaluators, there was a mismatch between the evaluators Cph and Aph (b — 0.162) (RTOG scale) and Cph with Bph (0.197; 0.183, WHO and RTOG). However, the use of WHO and RTOG found a good agreement between A and Aph, statistically significant (P < .001) (Table 44). If the observed agreement is due only to chance, i.e. if the evaluations are completely independent, then each diagonal element is a product of the two marginalized groups. Paired correspondence among evaluators was fair and moderate (RTOG scale: 0.408, 95% confidence interval, IC 0.370-0.431; WHO scale: 0.559, 95% CI 0.529-0.590). In addition, the general approval rates were 10.2% and 29.2%, respectively. With respect to the assessment of absolute overall compliance between evaluators with different types of phototypes and surgical types, there was a fair agreement on the RTOG scale for the evaluation of patients with a phototype V or VI and a mastectomy (3.7% and 8.8%) respectively. This quantitative example is provided to illustrate the application of the method and because it was a practical challenge that we encountered in our research; solution that served as the basis for this general method.
Here too, it is important to note that the specifics of the application of the method (measurement of the agreement, measurement of error, etc.) are not due to constraints inherent in the method, but that our chosen reliability index, our error size, the size of the error, etc., have been chosen as being best suited to our specific evaluation task. Bassam, you can use Cohenkappa to determine the agreement between two advisors A and B, A being the gold standard. If you have another C, you can also use Kappa Cohens to compare A to C. I`m not sure how to use Cohen`s Kappa in your case with 100 themes and 30,000 eras. If epochs are among the themes, your data may be measurements for the 30,000 epochs.