# Measuring agreement between healthcare survey instruments using mutual information

- 692 Downloads

**Part of the following topical collections:**

## Abstract

### Background

Healthcare researchers often use multiple healthcare survey instruments to examine a particular patient symptom. The use of multiple instruments can pose some interesting research questions, such as whether the outcomes produced by the different instruments are in agreement. We tackle this problem using information theory, focusing on mutual information to compare outcomes from multiple healthcare survey instruments.

### Methods

We review existing methods of measuring agreement/disagreement between the instruments and suggest a procedure that utilizes mutual information to quantitatively measure the amount of information shared by outcomes from multiple healthcare survey instruments. We also include worked examples to explain the approach.

### Results

As a case study, we employ the suggested procedure to analyze multiple healthcare survey instruments used for detecting delirium superimposed on dementia (DSD) in community-dwelling older adults. In addition, several examples are used to assess the mutual information technique in comparison with other measures, such as odds ratio and Cohen’s kappa.

### Conclusions

Analysis of mutual information can be useful in explaining agreement/disagreement between multiple instruments. The suggested approach provides new insights into and potential improvements for the application of healthcare survey instruments.

## Keywords

Healthcare survey instrument Agreement Mutual information Delirium superimposed on dementia## Background

Numerous healthcare survey instruments exist to identify or evaluate individual health status. Many are used to diagnose a particular symptom, serving as a diagnostic survey instrument. Since such diagnostic survey instruments are noninvasive, several instruments can be used on the same patient. Multiple healthcare survey instruments may be used to corroborate a preliminary conclusion when two or more instruments are used to target a particular phenomenon. Techniques for interpreting diagnostic results collectively can be found in several studies [1, 2, 3]. For example, by utilizing the area below a receiver operating characteristic (ROC) curve, one can combine multiple results to increase diagnostic accuracy [2]. It is also possible to examine the potential benefits of adding another diagnostic result onto the original results [1]. Generally, a healthcare researcher uses the reference standard instrument (gold standard when error-free) to adequately explain a patient’s true health status or to validate newly developed instruments. In such instances, diagnostic results can be combined to test the level of agreement with a reference standard and validate a newly developed diagnostic instrument [4]. To summarize, the use of multiple instruments suggests some interesting research questions, such as how outcomes examined by different instruments can be interpreted collectively.

In this paper we present a procedure that quantitatively compares and evaluates outcomes from multiple healthcare survey instruments using information theoretic measures, especially using mutual information. Conventionally, the odds ratio or Cohen’s kappa have been used to determine agreement/disagreement among assessments. The present novel procedure differentiates itself from existing techniques by using mutual information, which has been used in revealing association among results. Specifically, the core element of the procedure includes utilizing mutual information and checking the significance of the measure among multiple health care survey instruments. The advantage of this procedure over existing methods is that it provides well-bounded, responsive and easy-to-use measures. More importantly, this approach has great potential, allowing a broader, richer and more precise identification of meaningful information from the data collected from multiple survey instruments, when it is used with other information theoretic measures for an analysis. With the suggested approach, we compare multiple healthcare survey instruments used for detecting delirium.

Delirium is common and deadly in persons with dementia. It results in a more rapid downward trajectory in functional outcomes that can lead to institutionalization. Therefore, delirium must be detected quickly in order to prevent further functional decline and high healthcare costs associated with institutionalization. Family caregivers, because of their close relationship to the person with dementia, are in an excellent position to detect delirium in the home environment. In light of this, we aim to use the suggested approach to validate an assessment which was developed for family members to administer at their home. In addition, we applied the suggested approach to the other external studies for comparison purposes.

The layout of this paper is as follows. In the ‘Methods’ section, we review existing measures to quantitatively measure agreement/disagreement in outcomes from multiple instruments, and suggest mutual information as possible measure for this purpose. To this end, we suggest a procedure to compare with multiple instruments that utilizes mutual information. In the ‘Results’ section we use the suggested framework to compare and analyze a number of instruments that were used in a pilot study for detecting delirium. In addition, we apply the suggested approach to other studies that compare multiple healthcare survey instruments. Finally, through several illustrations, we compare mutual information with other competing measures that have been used in conventional studies. The ‘Discussion’ section considers a validity of the FAM-CAM used in the pilot study and offers discussion of some of the benefits of mutual information, exploring the applicability and limitations of the suggested approach.

## Methods

Before we review existing measures, we first distinguish ‘association’ from ‘agreement/disagreement.’ An association among data can be interpreted as dependency among the data. Dependency exists when one element of the data changes and one or more other elements of data also change. Even if the change is opposite, we still say that dependency exists among the data, since one change affects the other changes (this is sometimes referred to as ‘negative association’). In this paper, we restrict the use of the term agreement (disagreement) to positive association (negative association) and vice versa.

### Existing measures

where *P(a)* refers to the observed probability of agreement while *P(e)* is the expected probability of agreement. Although Cohen’s kappa is widely used, it is only valid in the case of two independent raters [6]. For multiple raters, Fleiss’ kappa can be considered for evaluating agreement, but it provides relatively weak evidence on the significance level [7]. The results from both Cohen’s kappa and Fleiss’ kappa can be difficult to apply across studies without an error generation model [8].

where *P(x* _{ ij } *)* is the probability of the cell whose row is i and column is j in a contingency table. Since odds ratio is easy to calculate, it is widely used to explain the level of association among the data, especially when validating alternative medical treatments. Odds ratio characterizes positive association (above 1), and negative association (below 1) among data.

### Mutual information and local mutual information

where i and j stand for Instrument 1 and Instrument 2. Conventionally, mutual information has been used to measure the level of association among data. Some association measures can explain the direction of association, whether positive (agreement) or negative (disagreement). Mutual information itself, however, does not show direction of association, and it has been used to explain only the level of association, regardless of direction. Interestingly, recent studies suggest that local mutual information can be used to explain the level not only of association, but of agreement among the data [8, 11].

Theoretically, mutual information is the sum of local mutual information, which is \( P\left({x}_{ij}\right)\kern0.5em { \log}_2\frac{P\left({x}_{ij}\right)}{P\left({x}_{i\cdot}\right)P\left({x}_{\cdot j}\right)} \). Such local mutual information is often regarded as an important measure and is used for information retrieval [12]. More importantly, some sets of local mutual information offer a quantitative measure of the level of agreement, beyond explaining association among the data. From a theoretical perspective, the use of mutual information as a measure of agreement can be beneficial as compared to other similar measures used for inter-rater reliability. First, it can be easily calculated without an error generation model [8], so it does not require any additional assumptions when applied across studies. Moreover, it can be approximated to a chi-square statistic, which enables us to statistically measure the level of significance of the mutual information obtained [13, 14]. Also, since mutual information is an information theoretical measure, it can be used along with other useful information theoretical measures, such as relative entropy or conditional entropy, thereby enhancing the potential for interpreting the data. A recent study utilizes mutual information and conditional entropy to extract “novel” information from medical data [15].

### Proposed procedure: comparison of two diagnostic instruments

A procedure for comparing two instruments

Step 1. Design a contingency table and collect data using the table |

Step 2. Measure mutual information |

Step 3. Determine significance of the mutual information |

Step 4. Check the sum of the local mutual information on agreement section |

where *A* stands s and *D* stands for a set of disagreement sections between the two instruments (expressed as Inst.1 (Inst.2) representing instrument 1 (instrument 2, respectively)). By convention, \( 0{ \log}_2\frac{0}{p}=0 \) and \( p{ \log}_2\frac{p}{0}=\infty \), which is *p* ≠ 0.

*χ*

^{2}distribution, as shown in Eq. 6 [14].

where *n* is the size of the sample. If the base of the logarithm of Eq. 6 is changed from *e* to 2, then the mutual information term can be expressed as \( 2n\cdot \ln 2\cdot {\displaystyle {\sum}_{i=1}^m{\displaystyle {\sum}_{j=1}^nP\left({x}_{ij}\right){ \log}_2}}\frac{P\left({x}_{ij}\right)}{P\left({x}_{i\cdot}\right)P\left({x}_{\cdot j}\right)} \). It is known that this is distributed as a *χ* ^{ 2 } distribution with degree of freedom, *υ. υ* is determined by (the number of rows of a contingency table −1) × (the number of columns of a contingency table −1) [13]. From this conversion, the corresponding *χ* ^{ 2 } statistic can be calculated to determine the significance of the obtained mutual information in a frame of statistical testing. Also, we use p-value as a measure of strength of evidence, that is, significance of the mutual information.

To interpret the mutual information obtained, when a significant amount of mutual information is observed and the sum of local mutual information in the agreement sections is larger than that in the disagreement sections, we say that the two instruments are in agreement with each other. Similarly, if we observe a significant amount of mutual information and the sum of the local mutual information in the disagreement area is larger than that in the agreement area, we say the two instruments are in disagreement. Meanwhile, if we observe only a low level of mutual information from the outcomes, we cannot determine the level of agreement/disagreement of the outcomes. This situation falls under the inconclusive category, and is caused either by a lack of sufficient data to draw out associations or the existence of a truly independent relationship among the data.

### Example

Illustrative example of comparing two diagnostic instruments (Note: At Step 1, Numbers in agreement sections in each table are expressed in boldface)

Procedure | Scenario 1 | Scenario 2 | Scenario 3 | |||
---|---|---|---|---|---|---|

Data (\( {x}_{11} \), \( {x}_{12} \), \( {x}_{21} \), \( {x}_{22} \)) | (10, 5, 5, 20) | (5, 10, 20, 5) | (5, 10, 5, 20) | |||

Step 1 Contingency table | 5 | 5 | 20 | 10 | 5 | 10 |

Step 2 | \( {I_{agreement}}_{+\left(\frac{20}{40}\right){ \log}_2\frac{\left(\frac{20}{40}\right)}{\left(\frac{25}{40}\right)\left(\frac{25}{40}\right)}}^{=\left(\frac{10}{40}\right){ \log}_2\frac{\left(\frac{10}{40}\right)}{\frac{(15)}{40}\left(\frac{15}{40}\right)}} \) | \( {I_{agreement}}_{+\left(\frac{5}{40}\right){ \log}_2\frac{\left(\frac{5}{40}\right)}{\left(\frac{15}{40}\right)\left(\frac{25}{40}\right)}}^{=\left(\frac{5}{40}\right){ \log}_2\frac{\left(\frac{5}{40}\right)}{\frac{(15)}{40}\left(\frac{25}{40}\right)}} \) | \( {I_{agreement}}_{+\left(\frac{20}{40}\right){ \log}_2\frac{\left(\frac{20}{40}\right)}{\left(\frac{30}{40}\right)\left(\frac{25}{40}\right)}}^{=\left(\frac{5}{40}\right){ \log}_2\frac{\left(\frac{5}{40}\right)}{\frac{(15)}{40}\left(\frac{10}{40}\right)}} \) | |||

Mutual information | ||||||

\( {I_{disagreement}}_{+\left(\frac{5}{40}\right){ \log}_2\frac{\left(\frac{5}{40}\right)}{\left(\frac{15}{40}\right)\left(\frac{25}{40}\right)}}^{=\left(\frac{5}{40}\right){ \log}_2\frac{\left(\frac{5}{40}\right)}{\frac{(15)}{40}\left(\frac{25}{40}\right)}} \) | \( {I_{disagreement}}_{+\left(\frac{20}{40}\right){ \log}_2\frac{\left(\frac{20}{40}\right)}{\left(\frac{25}{40}\right)\left(\frac{25}{40}\right)}}^{=\left(\frac{10}{40}\right){ \log}_2\frac{\left(\frac{10}{40}\right)}{\frac{(15)}{40}\left(\frac{15}{40}\right)}} \) | \( {I_{disagreement}}_{+\left(\frac{5}{40}\right){ \log}_2\frac{\left(\frac{5}{40}\right)}{\left(\frac{10}{40}\right)\left(\frac{25}{40}\right)}}^{=\left(\frac{10}{40}\right){ \log}_2\frac{\left(\frac{10}{40}\right)}{\frac{(15)}{40}\left(\frac{30}{40}\right)}} \) | ||||

Step 3 Significance | Mutual information = 0.386-0.227 = 0.159 From (Eq. 6), 2*40*ln2*0.159 = 8.809 Highly significant (with | Mutual information = -0.227 + 0.386 = 0.159 From (Eq. 6), 2*40*ln2*0.159 = 8.809 Highly significant (with | Mutual information = 0.098-0.083 = 0.015 From (Eq. 6), 2*40*ln2*0.015 = 0.871 Less significant (with | |||

Step 4 Local mutual information | and highly significant mutual information. Thus, agreement | and highly significant mutual information. Thus, disagreement | but very low mutual information observed. Thus, inconclusive |

## Results: analysis of assessments to detect delirium

As a case study, we analyzed the outcomes of a pilot study on the feasibility of enlisting family caregivers to electronically report delirium symptoms in patients with dementia [17]. The purpose of this pilot study was to prospectively explore the feasibility of engaging family caregivers to electronically report observations of delirium symptoms in community-dwelling older adults with dementia. This study also sought to describe agreement between family observations of delirium (Family Confusion Assessment Method [FAM-CAM]) and researcher assessments (Confusion Assessment Method [CAM]). Family caregivers accessed an electronic delirium assessment instrument via their personal computer or a study supplied smart phone daily to transmit data. There were 13 patient participants in this pilot study. All were Caucasian, mean age 80, sixty-nine percent were female and mean years of education was 11. Caregivers were adult children (*N* = 8), spouses (*N* = 4) and siblings (*N* = 1). Eight caregivers used their own personal computers and five used study supplied smart phones. The pilot study and consents were approved by the Penn State and University of Pennsylvania Institutional Review Board (IRB)s.

Delirium was operationally defined according to the validated CAM criteria and the Delirium Rating Scale (DRS-R-98). The CAM features are 1) acute onset and fluctuating course, 2) inattention, and either 3) disorganized thinking, or 4) altered level of consciousness [18]. The CAM is a standardized screening tool allowing persons without formal psychiatric training to quickly and accurately identify delirium. The FAM-CAM was developed as part of a larger cohort study as a means to detect delirium in elders; it relies on caregiver information to screen for the CAM features. While the FAM-CAM is based on the original CAM, there are differences between the two instruments. The health care professional administering the CAM and employing observational skills, assesses the four main features of delirium directly. In contrast, the FAM-CAM includes questions directed for the family member to help identify the cardinal signs of delirium as well as those sensitive to detect delirium (i.e., inattention, disorganized thinking, lethargy, disorientation, perceptual disturbances and inappropriate behavior/agitation). According to the diagnostic algorithm, delirium is identified if the patient shows the presence of acute onset, fluctuating course, inattention, and either the presence of disorganized thinking or an altered level of consciousness.

Theoretically, CAM and FAM-CAM share the same features for detecting delirium, so they are expected to produce identical outcomes. Practically, however, they are not identical in terms of content; the assessments have different structure and content, as appropriate to their original purposes. Compared to the CAM, the FAM-CAM paraphrases all contents of the CAM so that the assessment can be conducted easily by family caregivers. For this reason, some of the contents of FAM-CAM might be conveyed inaccurately to caregivers and fail to meet the original intentions of the CAM. Consequently, the validity of FAM-CAM should be checked in a rigorous way. Along with the CAM and FAM-CAM, other instruments designed to detect different aspects of a patient’s DSD were used in the pilot. Trained professionals administered instruments such as the Mini-Mental State Examination (MMSE), Clinical Dementia Rating (CDR), and the Delirium Rating Scale (DRS). In reality, CAM was used as the tool to detect delirium, while the other instruments were used to detect the severity of either dementia or delirium.

*p <*0.001. For comparison purposes, we note that the odds ratio was 248 and Cohen’s kappa was 0.858, both significant. For the next step, we tested agreement between the methods by comparing the sign of the local mutual information of agreement sections with that of non-agreement sections. The local mutual information from the agreement area (0.629) was found to be positive, while the local mutual information from the disagreement sections (−0.137) was found to be negative. Thus, we can conclude that the outcomes from CAM and FAM-CAM share high levels of information, and they were in agreement with each other. In other words, the outcomes from the two instruments show a high level of agreement.

A contingency table for CAM and FAM-CAM (Numbers in agreement sections expressed in boldface)

CAM | ||||
---|---|---|---|---|

Positive | Negative | Total | ||

FAM-CAM | Positive | | 1 (Disagreement) | 9 |

Negative | 1 (Disagreement) | | 12 | |

Total | 9 | 32 | 41 |

Similarly, we checked the level of agreement of each primary feature between CAM and FAM-CAM. In terms of the significance of the mutual information, we observed a high level of significance between CAM and FAM-CAM in the case of feature 1 (acute onset and fluctuating courses) and feature 3 (disorganized thinking). In addition, we observed a positive amount of local mutual information on the agreement area for all features. Therefore, we can conclude that the two instruments are in agreement in terms of features 1 and 3. For feature 2, although we observed that the outcomes for the feature agreed, we also observed that its significance level is relatively low, compared to features 1 and 3. Meanwhile, we measured quite a low level of mutual information on feature 4 (altered level of consciousness). That is, this situation falls into the inconclusive category. In this case, odds ratio and Cohen’s kappa was also measured as insignificant. This can be caused by either independent outcomes or a shortage of data on the corresponding feature, so further investigation with a domain expert is necessary.

Comparison of FAM-CAM with other instruments

Comparison using local mutual information | |||||||
---|---|---|---|---|---|---|---|

Pair no. | Comparisons of FAM-CAM with | | | Mutual information | Result | Odds ratio | Kappa |

1 | CAM | 0.629 | -0.137 | 0.492 | Agreement ( | 248 | 0.858 |

2 | Feature of CAM: Acute onset & fluctuating courses | 0.356 | -0.177 | 0.179 | Agreement ( | 17.6 | 0.424 |

3 | Feature of CAM: Inattention | 0.196 | -0.148 | 0.048 | Agreement ( | 3.125* | 0.237 |

4 | Feature of CAM: Disorganized thinking | 0.320 | -0.116 | 0.204 | Agreement ( | ∞** | 0.346 |

5 | Feature of CAM: Altered level of consciousness | -0.022 | 0.025 | 0.003 | Inconclusive | 0.65* | -0.051* |

6 | FAM-CAM on a group using smart phones | 0.519 | -0.245 | 0.274 | Agreement ( | 18*** | 0.607 |

7 | FAM-CAM from other study | 0.551 | - 0.099 | 0.452 | Agreement ( | ∞** | 0.805 |

In addition to the CAM, trained professionals of the pilot team also implemented a series of instruments, including the Delirium Rating Scale (DRS) and the Mini-Mental State Examination (MMSE). Of these, DRS is an assessment for measuring level of delirium and was used repeatedly on individual patients, along with the CAM. Theoretically, CAM, DRS, and FAM-CAM share a similar diagnostic purpose, since they were all designed to assess delirium symptoms. Both CAM and FAM-CAM provide binary results on the presence of delirium, while DRS uses numerical scale numbers indicating the severity of delirium. Thus, we can conjecture that the three different instruments will provide similar results for detecting delirium.

Comparison result of each pair of instruments

Comparison using local mutual information | |||||||
---|---|---|---|---|---|---|---|

Pair no. | Pair | \( {I}_{agreement} \) | \( {I}_{disagreement} \) | Mutual information | Result | Odds ratio | Kappa |

8 | (CAM, FAMCAM) | 0.397 | -0.173 | 0.224 | Agreement ( | 25.333 | 0.575 |

9 | (FAMCAM, DRS) | -0.090 | 0.115 | 0.025 | Inconclusive ( | 0.381* | -0.128* |

10 | (DRS, CAM) | 0.275 | -0.176 | 0.099 | Agreement ( | 5.833 | 0.349 |

## Results: analysis of assessments from other case studies

Agreement/disagreement between two instruments using three measures

Data set | Measure | ||
---|---|---|---|

Odds ratio | Cohen’s kappa | Local mutual information | |

(Table 2 in Shulman et al. 1986) Clock exam and MMSE | 15.6 | 0.493 | Agreement (0.422 in |

(Figure 1 in Russell et al. 2012) BDI | ∞** | 0.693 | Agreement (0.18 in |

(Figure 1 in Russell et al. 2012) CDRS_R and ICD-10 | 0.311* | -0.015* | Inconclusive ( -0.018 in |

(Table 4 in Seago 2002) Score scheme comparison | 4 | 0.265 | Agreement ( 0.152 in |

## Results: illustrative comparisons among measures

_{11}from 1 to 1000 and decrease x

_{22}from 1000 to 1 by 1, and set other cells, x

_{12}and x

_{21}to k > 0 and measure both mutual information and Cohen’s kappa for each combination of x

_{11,}x

_{12,}x

_{21,}x

_{22}. That is, we change the ratio between positive-positive type agreement and negative-negative agreement, as the total number of agreements is maintained. Intuitively, we can conjecture that the level of agreement of data increases as x

_{11}approaches x

_{22}, and the level of agreement of data will be the highest when x

_{11}reaches x

_{22}. From Fig. 3(a), both mutual information and Cohen’s kappa show the highest agreement at x11 = x22 = 499 when k is set as 1 (Cohen’s kappa = 0.996 and mutual information = 0.979), but the trend approaching the highest value turns out to be different; Cohen’s kappa tends to quickly approach the highest value and then slowly converges before hitting the peak, while mutual information increases smoothly to the highest value as x

_{11}and x

_{22}approach equality. In other words, we can say that Cohen’s kappa tends to maintain high value regardless of the ratio of each agreement section, as long as the total amount of agreement overwhelms disagreement. On the other hand, mutual information emphasizes the situation where the amounts of agreement, positive-positive and negative-negative, approach each other.

We increase k from 1 to 100, which is the amount of disagreement, to see if the trend seen in Fig. 3(a) persists. We plot the changes of mutual information and Cohen’s kappa with the changes of k as shown in Fig. 3(b) and (c). Both figures show that the amount of agreement is maximized when x_{11=}x_{22} for every k, but decreases as the amount of disagreement (k) increases. This confirms the trend shown in Fig. 3(a). We combine Fig. 3(b) and (c) into Fig. 3(d) for comparison.

## Discussion

The results from CAM and FAM-CAM show moderate or high levels of agreement in terms of overall results, featured levels, and platform/study. In most cases, mutual information and the other measures considered here result in the same results in terms of agreement/disagreement. Therefore, we can conclude that FAM-CAM, an adapted version of CAM, is a valid instrument for detecting delirium when used by family members.

During the comparisons, we examined the amount of information from each agreement and disagreement section separately, using the suggested approach. In some cases, we observed that there exists a weak level of agreement between the instruments compared. ‘Weak’ agreement can occur when there is a low amount of local mutual information from either agreement sections or disagreement sections. CAM and FAM-CAM (pair 1), for example, showed a high level of local mutual information on the agreement sections, compared to that from the disagreement sections (0.629 from agreement sections vs. -0.137 from disagreement sections). In the comparison of CAM and FAM-CAM in terms of the ‘Inattention’ feature (pair 3), however, the amount of local mutual information from the disagreement sections is measured at −0.148, while that of agreement sections is measured at only 0.196, resulting in 0.048 as mutual information, which represents a low level of agreement compared to pair 1. Thus, we conjecture that the level of agreement of pair 3 results in a relatively weak agreement, due to the low level of local mutual information from its agreement sections. In other words, FAM-CAM did not secure enough information to explain agreement with CAM in terms of the ‘Inattention’ feature, thereby leading to the need to further clarify FAM-CAM questions related to this feature. Meanwhile, although pair 8 (comparison with CAM and FAM-CAM) and pair 10 (comparison with CAM and DRS) show similar levels of local mutual information from disagreement sections (−0.173 and −0.176), pair 8 shows greater agreement due to the greater amount of local mutual information (0.397) from agreement sections compared to that of pair 10 (0.275). Thus, we conjecture that FAM-CAM shows better performance in terms of explaining agreement compared to DRS.

As another example, although the outcomes of CAM and FAM-CAM of smartphone groups is in agreement, mutual information measured from the group was lower compared to the overall group; pair 1(overall group) and pair 6 (smartphone group) show a weak levels of local mutual information from both agreement sections and disagreement sections. In this case, we can conclude that different access to the system affects the overall outcomes in terms of both agreement and disagreement type, thereby needing to improve usability of smartphone environment overall (e.g. user interface). In sum, with the suggested approach, we were able to explain why the comparison results in a weak level of agreement, by referring to local mutual information observed from both agreement sections and disagreement sections. Meanwhile, odds ratio and Cohen’s kappa do not have the ability to explain why such weak agreement between the instruments occurs.

Through a series of illustrations, we also show that mutual information offers various benefits over other competing measures. First, odds ratio sometimes scales up poorly. When we analyze the data from the case study for detecting delirium, we see that the odds ratio is highly likely to exaggerate the level of agreement (248 from the comparison of CAM and FAM-CAM) and sometimes cannot be measured properly (infinity due to 0 in the data). Meanwhile, both Cohen’s kappa and mutual information are measured within a reasonable bound for most of the studies. In addition, we see that Cohen’s kappa is affected only by the total amount of agreement between the instruments. Mutual information, on the other hand, can be affected not only by the total amount of agreement, but also by the amount of each agreement type. In our case study, for example, for pair 1, Cohen’s kappa coefficient was measured at 0.858, which is fairly high. In that case, mutual information was measured at 0.492. We used Table 3 as data for measuring those measures; 39 observations from the agreement sections produce 8 positive-positive agreements and 31 negative-negative agreements. As a hypothetical situation, if there were an almost even number of observations from each agreement (20 positive-positive and 19 negative-negative), the mutual information would be increased to 0.718 (an increase of 45.9 %) while Cohen’s kappa is would increase only slightly, to 0.902 (an increase of 5.1 %). In other words, mutual information places more weight on the evenness of the amount of agreement evidence, as compared to Cohen’s kappa. Consequently, we conjecture that the suggested approach is more capable of providing separate information both agreement and disagreement sections, a well-scaled measure as compared to odds ratio, and adequate-responsive measure to the agreement type.

Some limitations of using mutual information in comparing instruments are as follows. First, applying mutual information could be demanding in cases of three or more instruments, since the concept of mutual information only applies to comparison of two instruments. Meanwhile Fleiss’ kappa, an extended form of Cohen’s kappa, could be used to measure agreement level in this situation. Another limitation is that use of mutual information assumes that each instrument to be compared should have the similar diagnostic purpose. In our case study, the team also administered the Mini-Mental State Examination (MMSE), which is used to measure the level of a patient’s cognitive impairment. A severe level of cognitive impairment may be due not only to some level of delirium, but also to the severity of dementia. In this sense, the diagnostic purpose of the MMSE is somewhat different from those of the CAM, FAM-CAM, and DRS, which are dedicated to examining a patient’s delirium. Since mutual information can be applied only for comparing instruments with similar diagnostic purpose, its usage could be limited in comparing MMSE and other instruments for delirium. Our future research will primarily focus on how we can compare three or more diagnostic instruments and different instruments that do not share the same diagnostic purpose. For this, we need to extend the suggested approach and explore other information theoretic approaches.

## Conclusion

In this paper, we suggest a procedure for comparing multiple healthcare survey instruments using mutual information, an information theoretic approach. Mutual information is used to measure the amount of information shared among the outcomes from multiple instruments. With the suggested procedure, we explain agreement/disagreement between the instruments used in several studies and compare with other competing measures to show the benefits of the mutual information. Our suggested approach is more capable of providing separate information with both agreement and disagreement existing in the data, a well-scaled measure and adequate-responsive measure to the agreement type compared to other competing measures. We also mentioned an instrument can be further improved by referring to the information measured from agreement and disagreement. We believe the use of this approach will provide a reliable approach to evaluate agreement/disagreement of outcomes from multiple instruments and may also offer clues to improving healthcare survey instruments.

## Abbreviations

CAM, Confusion Assessment Method; CDR, Clinical Dementia Rating; DRS, Delirium Rating Scale; DSD, Delirium superimposed on dementia; FAM-CAM, a family version of CAM; MMSE, Mini-Mental State Examination

## Notes

### Acknowledgements

Not applicable.

### Funding

Funded in part by Children, Youth, and Families Consortium, The Pennsylvania State University. Verizon supplied the smart phones.

### Availability of data and material

In this study, we aggregate individual participant’s data to examine the suggested approaches. Those interested in data can contact the corresponding author, Yuncheol Kang, Ph.D., at yckang@hongik.ac.kr.

### Authors’ contributions

YK proposed the framework, analyzed the data, and interpreted the results. YK and MRS wrote part of the manuscript. MRS, AMK and DF provided application expertise and critical comments to improve the manuscript. VVP provided initial idea to enable this research and validated the framework. All authors have reviewed and proofread the final manuscript. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

### Consent for publication

Not applicable.

### Ethics approval and consent to participate

This study and consents were approved by the Susquehanna Health System, The Pennsylvania State University and The University of Pennsylvania IRBs.

## References

- 1.Macaskill P, Walter SD, Irwig L, Franco EL. Assessing the gain in diagnostic performance when combining two diagnostic tests. Stat Med. 2002;21(17):2527–46.CrossRefPubMedGoogle Scholar
- 2.Pepe MS, Thompson ML. Combining diagnostic test results to increase accuracy. Biostatistics. 2000;1(2):123–40.CrossRefPubMedGoogle Scholar
- 3.Thompson ML. Assessing the diagnostic accuracy of a sequence of tests. Biostatistics. 2003;4(3):341–51.CrossRefPubMedGoogle Scholar
- 4.Rutjes AW, Reitsma J, Coomarasamy A, Khan K, Bossuyt P. Evaluation of diagnostic tests when there is no gold standard: a review of methods. Health Technol Assess. 2007;11(50):iii. ix-51.CrossRefPubMedGoogle Scholar
- 5.Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20(1):37–46.CrossRefGoogle Scholar
- 6.Uebersax JS. Diversity of decision-making models and the measurement of interrater agreement. Psychol Bull. 1987;101(1):140.CrossRefGoogle Scholar
- 7.Gwet KL. Handbook of inter-rater reliability: Advanced Analytics, LLC. 2001.Google Scholar
- 8.Klemens B. Mutual information as a measure of intercoder agreement. J Off Stat. 2012;28(3):395–412.Google Scholar
- 9.Tan PN, Kumar V, Srivastava J. Selecting the right objective measure for association analysis. Inf Syst. 2004;29(4):293–313.CrossRefGoogle Scholar
- 10.Cover TM, Thomas, J. A. Elements of Information Theory, 2nd edn: Wiley; 2006.Google Scholar
- 11.Kang Y, Prabhu VV, Steis MR, Kolanowski AM, Fick DM, Bowles KH. Integrating information from family caregivers for eldercare. In: Proceedings of the 2010 Industrial Engineering Research Conference 2010; Cancun, Mexico. 2010.Google Scholar
- 12.Bouma G. Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL Conference: 2009; Potsdam, Germany. 2009. p. 31–40.Google Scholar
- 13.Miller GA, Madow WG. On the Maximum Likelihood Estimate of the Shannon-Weiner Measure of Information: Operational Applications Laboratory, Air Force Cambridge Research Center, Air Research and Development Command, Bolling Air Force Base. 1954.Google Scholar
- 14.Gokhale D, Kullback S. The information in contingency tables: Marcel Dekker. 1978.Google Scholar
- 15.Lee J, Maslove DM. Using information theory to identify redundancy in common laboratory tests in the intensive care unit. BMC Med Inform Decis Mak. 2015;15(1):59.CrossRefPubMedPubMedCentralGoogle Scholar
- 16.van der Wulp I, van Stel HF. Adjusting weighted kappa for severity of mistriage decreases reported reliability of emergency department triage systems: a comparative study. J Clin Epidemiol. 2009;62(11):1196–201.CrossRefPubMedGoogle Scholar
- 17.Steis MR, Prabhu VV, Kolanowski AM, Kang Y, Bowles KH, Fick DM, Evans L. eCare for eldercare: detection of delirium in community-dwelling persons with dementia. Online J Nurs Inform. 2012;16:1.Google Scholar
- 18.Inouye S, van Dyck C, Alessi C, Balkin S, Siegal A, Horwitz R. Clarifying confusion: the confusion assessment method: a new method for detection of delirium. Ann Intern Med. 1990;113(12):941.CrossRefPubMedGoogle Scholar
- 19.Naylor MD. Hospital to Home: Cognitively Impaired Elders/Caregivers. In: Marian S, editor. Ware Alzheimer Program at the University of Pennsylvania: NIH/NIA - R01AG023116-05. 2007.Google Scholar
- 20.Trzepacz PT, Mittal D, Torres R, Kanary K, Norton J, Jimerson N. Validation of the delirium rating scale-revised-98 comparison with the delirium rating scale and the cognitive test for delirium. J Neuropsychiatry Clin Neurosci. 2001;13(2):229–42.CrossRefPubMedGoogle Scholar
- 21.Russell P, Basker M, Russell S, Moses P, Nair M, Minju K. Comparison of a self-rated and a clinician-rated measure for identifying depression among adolescents in a primary-care setting. Indian J Pediatr. 2012;79(1):45–51.CrossRefGoogle Scholar
- 22.Seago JA. A comparison of two patient classification instruments in an acute care hospital. J Nurs Adm. 2002;32(5):243–9.CrossRefPubMedGoogle Scholar
- 23.Shulman KI, Shedletsky R, Silver IL. The challenge of time: clock‐drawing and cognitive function in the elderly. Int J Geriatr Psychiatry. 1986;1(2):135–40.CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.