PyCM is a multi-class confusion matrix library written in Python that supports both input data vectors and direct matrix, and a proper tool for post-classification model evaluation that supports most classes and overall statistics parameters. PyCM is the swiss-army knife of confusion matrices, targeted mainly at data scientists that need a broad array of metrics for predictive models and accurate evaluation of a large variety of classifiers.
Fig1. ConfusionMatrix Block Diagram
⚠️ PyCM 2.4 is the last version to support Python 2.7 & Python 3.4
⚠️ Plotting capability requires Matplotlib (>= 3.0.0) or Seaborn (>= 0.9.1)
pip install -r requirements.txt
or pip3 install -r requirements.txt
(Need root access)python3 setup.py install
or python setup.py install
(Need root access)pip install pycm==3.0
or pip3 install pycm==3.0
(Need root access)conda install -c sepandhaghighi pycm
(Need root access)easy_install --upgrade pycm
(Need root access)docker pull sepandhaghighi/pycm
(Need root access)Add to PATH
optionInstall pip
optionpip install pycm
or pip3 install pycm
(Need root access)>> pyversion PYTHON_EXECUTABLE_FULL_PATH
from pycm import *
y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]
y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2]
cm = ConfusionMatrix(y_actu, y_pred,digit=5)
cm
cm.actual_vector
cm.predict_vector
cm.classes
cm.class_stat
cm.overall_stat
cm.table
cm.matrix
cm.normalized_matrix
cm.normalized_table
import numpy
y_actu = numpy.array([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2])
y_pred = numpy.array([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2])
cm = ConfusionMatrix(y_actu, y_pred,digit=5)
cm
cm2 = ConfusionMatrix(matrix={0: {0: 3, 1: 0, 2: 0}, 1: {0: 0, 1: 1, 2: 2}, 2: {0: 2, 1: 1, 2: 3}},digit=5)
cm2
cm2.actual_vector
cm2.predict_vector
cm2.classes
cm2.class_stat
cm2.overall_stat
threshold
is added in version 0.9
for real value prediction.
For more information visit Example 3
file
is added in version 0.9.5
in order to load saved confusion matrix with .obj
format generated by save_obj
method.
For more information visit Example 4
sample_weight
is added in version 1.2
For more information visit Example 5
transpose
is added in version 1.2
in order to transpose input matrix (only in Direct CM
mode)
cm = ConfusionMatrix(matrix={0: {0: 3, 1: 0, 2: 0}, 1: {0: 0, 1: 1, 2: 2}, 2: {0: 2, 1: 1, 2: 3}},digit=5,transpose=True)
cm.print_matrix()
relabel
method is added in version 1.5
in order to change ConfusionMatrix class names.
cm.relabel(mapping={0:"L1",1:"L2",2:"L3"})
cm
mapping
: mapping dictionary (type : dict
)position
method is added in version 2.8
in order to find the indexes of observations in predict_vector
which made TP, TN, FP, FN.
cm3 = ConfusionMatrix(y_actu, y_pred,digit=5)
cm3.position()
to_array
method is added in version 2.9
in order to returns the confusion matrix in the form of a NumPy array. This can be helpful to apply different operations over the confusion matrix for different purposes such as aggregation, normalization, and combination.
cm.to_array()
cm.to_array(normalized=True)
cm.to_array(normalized=True,one_vs_all=True, class_name="L1")
normalized
: a flag for getting normalized confusion matrix (type : bool
, default : False
)one_vs_all
: One-Vs-All mode flag (type : bool
, default : False
)class_name
: target class name for One-Vs-All mode (type : any valid type
, default : None
)Confusion Matrix in NumPy array format
combine
method is added in version 3.0
in order to merge two confusion matrices. This option will be useful in mini-batch learning.
cm_combined = cm2.combine(cm3)
cm_combined.print_matrix()
other
: the other matrix that is going to be combined (type : ConfusionMatrix
)New ConfusionMatrix
plot
method is added in version 3.0
in order to plot a confusion matrix using Matplotlib or Seaborn.
import sys
!{sys.executable} -m pip -q -q install matplotlib;
!{sys.executable} -m pip -q -q install seaborn;
import matplotlib.pyplot as plt
cm.plot()
cm.plot(cmap=plt.cm.Greens,number_label=True,normalized=True)
cm.plot(plot_lib = "seaborn",number_label=True)
cm.plot(cmap=plt.cm.Blues,number_label=True,one_vs_all=True,class_name="L1")
cm.plot(cmap=plt.cm.Reds,number_label=True,normalized=True,one_vs_all=True,class_name="L3")
normalized
:normalized flag for matrix (type : bool
, default : False
)one_vs_all
: one_vs_all flag for matrix (type : bool
, default : False
)class_name
: class name of one_vs_all action (type : any valid type
, default : None
)title
: plot title (type : str
, default : Confusion Matrix
)number_label
: number label flag (type : bool
, default : False
)cmap
: color map (type : matplotlib.colors.ListedColormap
, default : None
)plot_lib
: plotting library (type : str
, default : matplotlib
)Plot axes
online_help
function is added in version 1.1
in order to open each statistics definition in web browser.
>>> from pycm import online_help
>>> online_help("J")
>>> online_help("J", alt_link=True)
>>> online_help("SOA1(Landis & Koch)")
>>> online_help(2)
online_help()
(without argument)alt_link = True
online_help()
param
: input parameter (type : int or str
, default : None
)alt_link
: alternative link for document flag (type : bool
, default : False
)This option has been added in version 1.9
to recommend the most related parameters considering the characteristics of the input dataset. The suggested parameters are selected according to some characteristics of the input such as being balance/imbalance and binary/multi-class. All suggestions can be categorized into three main groups: imbalanced dataset, binary classification for a balanced dataset, and multi-class classification for a balanced dataset. The recommendation lists have been gathered according to the respective paper of each parameter and the capabilities which had been claimed by the paper.
Fig2. Parameter Recommender Block Diagram
cm.imbalance
cm.binary
cm.recommended_list
In version 2.0
, a method for comparing several confusion matrices is introduced. This option is a combination of several overall and class-based benchmarks. Each of the benchmarks evaluates the performance of the classification algorithm from good to poor and give them a numeric score. The score of good and poor performances are 1 and 0, respectively.
After that, two scores are calculated for each confusion matrices, overall and class-based. The overall score is the average of the score of six overall benchmarks which are Landis & Koch, Fleiss, Altman, Cicchetti, Cramer, and Matthews. In the same manner, the class-based score is the average of the score of six class-based benchmarks which are Positive Likelihood Ratio Interpretation, Negative Likelihood Ratio Interpretation, Discriminant Power Interpretation, AUC value Interpretation, Matthews Correlation Coefficient Interpretation and Yule's Q Interpretation. It should be noticed that if one of the benchmarks returns none for one of the classes, that benchmarks will be eliminated in total averaging. If the user sets weights for the classes, the averaging over the value of class-based benchmark scores will transform to a weighted average.
If the user sets the value of by_class
boolean input True
, the best confusion matrix is the one with the maximum class-based score. Otherwise, if a confusion matrix obtains the maximum of both overall and class-based scores, that will be reported as the best confusion matrix, but in any other case, the compared object doesn’t select the best confusion matrix.
Fig3. Compare Block Diagram
cm2 = ConfusionMatrix(matrix={0:{0:2,1:50,2:6},1:{0:5,1:50,2:3},2:{0:1,1:7,2:50}})
cm3 = ConfusionMatrix(matrix={0:{0:50,1:2,2:6},1:{0:50,1:5,2:3},2:{0:1,1:55,2:2}})
cp = Compare({"cm2":cm2,"cm3":cm3})
print(cp)
cp.scores
cp.sorted
cp.best
cp.best_name
cp2 = Compare({"cm2":cm2,"cm3":cm3},by_class=True,weight={0:5,1:1,2:1})
print(cp2)
actual_vector
: python list
or numpy array
of any stringable objectspredict_vector
: python list
or numpy array
of any stringable objectsmatrix
: dict
digit
: int
threshold
: FunctionType (function or lambda)
file
: File object
sample_weight
: python list
or numpy array
of numberstranspose
: bool
help(ConfusionMatrix)
for more informationcm_dict
: python dict
of ConfusionMatrix
object (str
: ConfusionMatrix
)by_class
: bool
weight
: python dict
of class weights (class_name
: float
)digit
: int
help(Compare)
for more informationA true positive test result is one that detects the condition when the condition is present (correctly identified) [3].
cm.TP
A true negative test result is one that does not detect the condition when the condition is absent (correctly rejected) [3].
cm.TN
A false positive test result is one that detects the condition when the condition is absent (incorrectly identified) [3].
cm.FP
A false negative test result is one that does not detect the condition when the condition is present (incorrectly rejected) [3].
cm.FN
Number of positive samples. Also known as support (the number of occurrences of each class in y_true) [3].
$$P=TP+FN$$
cm.P
Number of negative samples [3].
$$N=TN+FP$$
cm.N
Number of positive outcomes [3].
$$TOP=TP+FP$$
cm.TOP
Number of negative outcomes [3].
$$TON=TN+FN$$
cm.TON
Total sample size [3].
$$POP=TP+TN+FN+FP$$
cm.POP
Sensitivity (also called the true positive rate, the recall, or probability of detection in some fields) measures the proportion of positives that are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition) [3].
$$TPR=\frac{TP}{P}=\frac{TP}{TP+FN}$$
cm.TPR
Specificity (also called the true negative rate) measures the proportion of negatives that are correctly identified as such (e.g. the percentage of healthy people who are correctly identified as not having the condition) [3].
$$TNR=\frac{TN}{N}=\frac{TN}{TN+FP}$$
cm.TNR
Positive predictive value (PPV) is the proportion of positives that correspond to the presence of the condition [3].
$$PPV=\frac{TP}{TP+FP}$$
cm.PPV
Negative predictive value (NPV) is the proportion of negatives that correspond to the absence of the condition [3].
$$NPV=\frac{TN}{TN+FN}$$
cm.NPV
The false negative rate is the proportion of positives which yield negative test outcomes with the test, i.e., the conditional probability of a negative test result given that the condition being looked for is present [3].
$$FNR=\frac{FN}{P}=\frac{FN}{FN+TP}=1-TPR$$
cm.FNR
The false positive rate is the proportion of all negatives that still yield positive test outcomes, i.e., the conditional probability of a positive test result given an event that was not present [3].
The false positive rate is equal to the significance level. The specificity of the test is equal to $ 1 $ minus the false positive rate.
$$FPR=\frac{FP}{N}=\frac{FP}{FP+TN}=1-TNR$$
cm.FPR
The false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the expected proportion of "discoveries" (rejected null hypotheses) that are false (incorrect rejections) [3].
$$FDR=\frac{FP}{FP+TP}=1-PPV$$
cm.FDR
False omission rate (FOR) is a statistical method used in multiple hypothesis testing to correct for multiple comparisons and it is the complement of the negative predictive value. It measures the proportion of false negatives which are incorrectly rejected [3].
$$FOR=\frac{FN}{FN+TN}=1-NPV$$
cm.FOR
The accuracy is the number of correct predictions from all predictions made [3].
$$ACC=\frac{TP+TN}{P+N}=\frac{TP+TN}{TP+TN+FP+FN}$$
cm.ACC
The error rate is the number of incorrect predictions from all predictions made [3].
$$ERR=\frac{FP+FN}{P+N}=\frac{FP+FN}{TP+TN+FP+FN}=1-ACC$$
cm.ERR
In statistical analysis of classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision $ p $ and the recall $ r $ of the test to compute the score. The F1 score is the harmonic average of the precision and recall, where F1 score reaches its best value at $ 1 $ (perfect precision and recall) and worst at $ 0 $ [3].
$$F_{\beta}=(1+\beta^2)\times \frac{PPV\times TPR}{(\beta^2 \times PPV)+TPR}=\frac{(1+\beta^2) \times TP}{(1+\beta^2)\times TP+FP+\beta^2 \times FN}$$
cm.F1
cm.F05
cm.F2
cm.F_beta(beta=4)
beta
: beta parameter (type : float
){class1: FBeta-Score1, class2: FBeta-Score2, ...}
The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classifications, introduced by biochemist Brian W. Matthews in 1975. It takes into account true and false positives and negatives and is generally regarded as a balanced measure that can be used even if the classes are of very different sizes. The MCC is, in essence, a correlation coefficient between the observed and predicted binary classifications; it returns a value between $ −1 $ and $ +1 $. A coefficient of $ +1 $ represents a perfect prediction, $ 0 $ no better than random prediction and $ −1 $ indicates total disagreement between prediction and observation [27].
$$MCC=\frac{TP \times TN-FP \times FN}{\sqrt{(TP+FP)\times (TP+FN)\times (TN+FP)\times (TN+FN)}}$$
cm.MCC
The informedness of a prediction method as captured by a contingency matrix is defined as the probability that the prediction method will make a correct decision as opposed to guessing and is calculated using the bookmaker algorithm [2].
Equals to Youden Index
$$BM=TPR+TNR-1$$
cm.BM
In statistics and psychology, the social science concept of markedness is quantified as a measure of how much one variable is marked as a predictor or possible cause of another and is also known as $ \triangle P $ in simple two-choice cases [2].
$$MK=PPV+NPV-1$$
cm.MK
Likelihood ratios are used for assessing the value of performing a diagnostic test. They use the sensitivity and specificity of the test to determine whether a test result usefully changes the probability that a condition (such as a disease state) exists. The first description of the use of likelihood ratios for decision rules was made at a symposium on information theory in 1954 [28].
$$LR_+=PLR=\frac{TPR}{FPR}$$
cm.PLR
Likelihood ratios are used for assessing the value of performing a diagnostic test. They use the sensitivity and specificity of the test to determine whether a test result usefully changes the probability that a condition (such as a disease state) exists. The first description of the use of likelihood ratios for decision rules was made at a symposium on information theory in 1954 [28].
$$LR_-=NLR=\frac{FNR}{TNR}$$
cm.NLR
The diagnostic odds ratio is a measure of the effectiveness of a diagnostic test. It is defined as the ratio of the odds of the test being positive if the subject has a disease relative to the odds of the test being positive if the subject does not have the disease [28].
$$DOR=\frac{LR_+}{LR_-}$$
cm.DOR
Prevalence is a statistical concept referring to the number of cases of a disease that are present in a particular population at a given time (Reference Likelihood) [14].
$$Prevalence=\frac{P}{POP}$$
cm.PRE
The geometric mean of precision and sensitivity, also known as Fowlkes–Mallows index [3].
$$G=\sqrt{PPV\times TPR}$$
cm.G
The expected accuracy from a strategy of randomly guessing categories according to reference and response distributions [24].
$$RACC=\frac{TOP \times P}{POP^2}$$
cm.RACC
The expected accuracy from a strategy of randomly guessing categories according to the average of the reference and response distributions [25].
$$RACCU=(\frac{TOP+P}{2 \times POP})^2$$
cm.RACCU
The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard), is a statistic used for comparing the similarity and diversity of sample sets [29].
$$J=\frac{TP}{TOP+P-TP}$$
cm.J
$$IS=-log_2(\frac{TP+FN}{POP})+log_2(\frac{TP}{TP+FP})$$
cm.IS
CEN based upon the concept of entropy for evaluating classifier performances. By exploiting the misclassification information of confusion matrices, the measure evaluates the confusion level of the class distribution of misclassified samples. Both theoretical analysis and statistical results show that the proposed measure is more discriminating than accuracy and RCI while it remains relatively consistent with the two measures. Moreover, it is more capable of measuring how the samples of different classes have been separated from each other. Hence the proposed measure is more precise than the two measures and can substitute for them to evaluate classifiers in classification applications [17].
$$P_{i,j}^{j}=\frac{Matrix(i,j)}{\sum_{k=1}^{|C|}\Big(Matrix(j,k)+Matrix(k,j)\Big)}$$
$$P_{i,j}^{i}=\frac{Matrix(i,j)}{\sum_{k=1}^{|C|}\Big(Matrix(i,k)+Matrix(k,i)\Big)}$$
$$CEN_j=-\sum_{k=1,k\neq j}^{|C|}\Bigg(P_{j,k}^jlog_{2(|C|-1)}\Big(P_{j,k}^j\Big)+P_{k,j}^jlog_{2(|C|-1)}\Big(P_{k,j}^j\Big)\Bigg)$$
cm.CEN
Modified version of CEN [19].
$$P_{i,j}^{j}=\frac{Matrix(i,j)}{\sum_{k=1}^{|C|}\Big(Matrix(j,k)+Matrix(k,j)\Big)-Matrix(j,j)}$$
$$P_{i,j}^{i}=\frac{Matrix(i,j)}{\sum_{k=1}^{|C|}\Big(Matrix(i,k)+Matrix(k,i)\Big)-Matrix(i,i)}$$
$$MCEN_j=-\sum_{k=1,k\neq j}^{|C|}\Bigg(P_{j,k}^jlog_{2(|C|-1)}\Big(P_{j,k}^j\Big)+P_{k,j}^jlog_{2(|C|-1)}\Big(P_{k,j}^j\Big)\Bigg)$$
cm.MCEN
The area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative'). Thus, AUC corresponds to the arithmetic mean of sensitivity and specificity values of each class [23].
$$AUC=\frac{TNR+TPR}{2}$$
cm.AUC
Euclidean distance of a ROC point from the top left corner of the ROC space, which can take values between 0 (perfect classification) and $ \sqrt{2} $ [23].
$$dInd=\sqrt{(1-TNR)^2+(1-TPR)^2}$$
cm.dInd
sInd is comprised between $ 0 $ (no correct classifications) and $ 1 $ (perfect classification) [23].
$$sInd = 1 - \sqrt{\frac{(1-TNR)^2+(1-TPR)^2}{2}}$$
cm.sInd
Discriminant power (DP) is a measure that summarizes sensitivity and specificity. The DP has been used mainly in feature selection over imbalanced data [33].
$$X=\frac{TPR}{1-TPR}$$
$$Y=\frac{TNR}{1-TNR}$$
$$DP=\frac{\sqrt{3}}{\pi}(log_{10}X+log_{10}Y)$$
cm.DP
Youden’s index evaluates the algorithm’s ability to avoid failure; it’s derived from sensitivity and specificity and denotes a linear correspondence balanced accuracy. As Youden’s index is a linear transformation of the mean sensitivity and specificity, its values are difficult to interpret, we retain that a higher value of Y indicates better ability to avoid failure. Youden’s index has been conventionally used to evaluate tests diagnostic, improve the efficiency of Telemedical prevention [33] [34].
Equals to Bookmaker Informedness
$$Y=BM=TPR+TNR-1$$
cm.Y
For more information visit [33].
PLR | Model contribution |
1 > | Negligible |
1 - 5 | Poor |
5 - 10 | Fair |
> 10 | Good |
cm.PLRI
For more information visit [48].
NLR | Model contribution |
0.5 - 1 | Negligible |
0.2 - 0.5 | Poor |
0.1 - 0.2 | Fair |
0.1 > | Good |
cm.NLRI
For more information visit [33].
DP | Model contribution |
1 > | Poor |
1 - 2 | Limited |
2 - 3 | Fair |
> 3 | Good |
cm.DPI
For more information visit [33].
AUC | Model performance |
0.5 - 0.6 | Poor |
0.6 - 0.7 | Fair |
0.7 - 0.8 | Good |
0.8 - 0.9 | Very Good |
0.9 - 1.0 | Excellent |
cm.AUCI
MCC | Interpretation |
0.3 > | Negligible |
0.3 - 0.5 | Weak |
0.5 - 0.7 | Moderate |
0.7 - 0.9 | Strong |
0.9 - 1.0 | Very Strong |
cm.MCCI
For more information visit [67].
Q | Interpretation |
0.25 > | Negligible |
0.25 - 0.5 | Weak |
0.5 - 0.75 | Moderate |
> 0.75 | Strong |
cm.QI
A chance-standardized variant of the AUC is given by Gini coefficient, taking values between $ 0 $ (no difference between the score distributions of the two classes) and $ 1 $ (complete separation between the two distributions). Gini coefficient is widespread use metric in imbalanced data learning [33].
$$GI=2\times AUC-1$$
cm.GI
$$LS=\frac{PPV}{PRE}$$
cm.LS
Difference between automatic and manual classification i.e., the difference between positive outcomes and of positive samples.
$$AM=TOP-P=(TP+FP)-(TP+FN)$$
cm.AM
In ecology and biology, the Bray–Curtis dissimilarity, named after J. Roger Bray and John T. Curtis, is a statistic used to quantify the compositional dissimilarity between two different sites, based on counts at each site [37].
$$BCD=\frac{|AM|}{\sum_{i=1}^{|C|}\Big(TOP_i+P_i\Big)}$$
cm.BCD
Optimized precision is a type of hybrid threshold metric and has been proposed as a discriminator for building an optimized heuristic classifier. This metric is a combination of accuracy, sensitivity and specificity metrics. The sensitivity and specificity metrics were used for stabilizing and optimizing the accuracy performance when dealing with an imbalanced class of two-class problems [40] [42].
$$OP = ACC - \frac{|TNR-TPR|}{|TNR+TPR|}$$
cm.OP
$$IBA_{\alpha}=(1+\alpha \times(TPR-TNR))\times TNR \times TPR$$
cm.IBA
cm.IBA_alpha(0.5)
cm.IBA_alpha(0.1)
alpha
: alpha parameter (type : float
){class1: IBA1, class2: IBA2, ...}
$$GM=\sqrt{TPR \times TNR}$$
cm.GM
In statistics, Yule's Q, also known as the coefficient of colligation, is a measure of association between two binary variables [45].
$$OR = \frac{TP\times TN}{FP\times FN}$$
$$Q = \frac{OR-1}{OR+1}$$
cm.Q
An adjusted version of the geometric mean of specificity and sensitivity [46].
$$N_n=\frac{N}{POP}$$
$$AGM=\frac{GM+TNR\times N_n}{1+N_n};TPR>0$$
$$AGM=0;TPR=0$$
cm.AGM
The F-measures used only three of the four elements of the confusion matrix and hence two classifiers with different TNR values may have the same F-score. Therefore, the AGF metric is introduced to use all elements of the confusion matrix and provide more weights to samples which are correctly classified in the minority class [50].
$$AGF=\sqrt{F_2 \times InvF_{0.5}}$$
$$F_{2}=5\times \frac{PPV\times TPR}{(4 \times PPV)+TPR}$$
$$InvF_{0.5}=(1+0.5^2)\times \frac{NPV\times TNR}{(0.5^2 \times NPV)+TNR}$$
cm.AGF
The overlap coefficient, or Szymkiewicz–Simpson coefficient, is a similarity measure that measures the overlap between two finite sets. It is defined as the size of the intersection divided by the smaller of the size of the two sets [52].
$$OC=\frac{TP}{min(TOP,P)}=max(PPV,TPR)$$
cm.OC
In biology, there is a similarity index, known as the Otsuka-Ochiai coefficient named after Yanosuke Otsuka and Akira Ochiai, also known as the Ochiai-Barkman or Ochiai coefficient. If sets are represented as bit vectors, the Otsuka-Ochiai coefficient can be seen to be the same as the cosine similarity [53].
$$OOC=\frac{TP}{\sqrt{TOP\times P}}$$
cm.OOC
The Tversky index, named after Amos Tversky, is an asymmetric similarity measure on sets that compares a variant to a prototype. The Tversky index can be seen as a generalization of Dice's coefficient and Tanimoto coefficient [54].
$$TI(\alpha,\beta)=\frac{TP}{TP+\alpha FN+\beta FP}$$
cm.TI(2,3)
alpha
: alpha coefficient (type : float
)beta
: beta coefficient (type : float
){class1: TI1, class2: TI2, ...}
$$AUPR=\frac{TPR+PPV}{2}$$
cm.AUPR
The Individual Classification Success Index (ICSI), is a class-specific symmetric measure defined for classification assessment purpose. ICSI is hence $ 1 $ minus the sum of type I and type II errors. It ranges from $ -1 $ (both errors are maximal, i.e. $ 1 $) to $ 1 $ (both errors are minimal, i.e. $ 0 $), but the value $ 0 $ does not have any clear meaning. The measure is symmetric, and linearly related to the arithmetic mean of TPR and PPV [58].
$$ICSI=PPV+TPR-1$$
cm.ICSI
In statistics, a confidence interval (CI) is a type of interval estimate (of a population parameter) that is computed from the observed data. The confidence level is the frequency (i.e., the proportion) of possible confidence intervals that contain the true value of their corresponding parameter. In other words, if confidence intervals are constructed using a given confidence level in an infinite number of independent experiments, the proportion of those intervals that contain the true value of the parameter will match the confidence level [31].
Supported statistics : ACC
,AUC
,PRE
,Overall ACC
,Kappa
,TPR
,TNR
,PPV
,NPV
,PLR
,NLR
Supported alpha values (two-sided) : 0.001, 0.002, 0.01, 0.02, 0.05, 0.1, 0.2
Supported alpha values (one-sided) : 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1
Confidence intervals for TPR
,TNR
,PPV
,NPV
,ACC
,PRE
and Overall ACC
are calculated using the normal approximation to the binomial distribution [59], Wilson score [62] and Agresti-Coull method [63]:
$$SE=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
$$CI=\hat{p}\pm z\times SE$$
$$n=\begin{cases}P & \hat{p} == TPR/FNR\\N & \hat{p} == TNR/FPR\\TOP & \hat{p} == PPV\\TON & \hat{p} ==NPV \\POP& \hat{p} == ACC/ACC_{Overall}\end{cases}$$
$$CI=\frac{\hat{p}+\frac{z^2}{2n}}{1+\frac{z^2}{n}}\pm\frac{z}{1+\frac{z^2}{n}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}+\frac{z^2}{4n^2}}$$
$$\hat{p}=\frac{x}{n}$$
$$\tilde{p}=\frac{x+\frac{z^2}{2}}{n+z^2}$$
$$CI =\tilde{p}\pm\sqrt{\frac{\tilde{p}(1-\tilde{p})}{n+z^2}}$$
Confidence intervals for NLR
and PLR
are calculated using the log method [60] :
$$SE_{LR}=\sqrt{\frac{1}{a}-\frac{1}{b}+\frac{1}{c}-\frac{1}{d}}$$
$$CI_{LR}=e^{ln(LR)\pm z\times SE_{LR}}$$
$$PLR:\begin{cases}a=TP\\b=P\\c=FP\\d=N\end{cases}$$
$$NLR:\begin{cases}a=FN\\b=P\\c=TN\\d=N\end{cases}$$
Confidence interval for AUC
is calculated using Hanley and McNeil formula [61] :
$$SE_{AUC}=\sqrt{\frac{q_0+(N-1)q_1+(P-1)q_2}{N\times P}}$$
$$q_0=AUC(1-AUC)$$
$$q_1=\frac{AUC}{2-AUC}-AUC^2$$
$$q_2=\frac{2AUC^2}{1+AUC}-AUC^2$$
$$CI_{AUC}=AUC\pm z\times SE_{AUC}$$
cm.CI("TPR")
cm.CI("FNR",alpha=0.001,one_sided=True)
cm.CI("PRE",alpha=0.05,binom_method="wilson")
cm.CI("Overall ACC",alpha=0.02,binom_method="agresti-coull")
cm.CI("Overall ACC",alpha=0.05)
param
: input parameter (type : str
)alpha
: type I error (type : float
, default : 0.05
)one_sided
: one-sided mode (type : bool
, default : False
)binom_method
: binomial confidence intervals method (type : str
, default : normal-approx
){class1: [SE1, (Lower CI, Upper CI)], ...}
{class1: [SE1, (Lower one-sided CI, Upper one-sided CI)], ...}
$$NB=\frac{TP-w\times FP}{POP}$$
Vickers and Elkin (2006) suggested considering a range of thresholds and calculating the NB across these thresholds. The results can be plotted in a decision curve [66].
$$p_t=threshold$$ $$w=\frac{p_t}{1-p_t}$$
cm.NB(w=0.059)
w
: weight{class1: NB1, class2: NB2, ...}
Here "average" refers to the arithmetic mean, the sum of the numbers divided by how many numbers are being averaged.
cm.average("PPV")
cm.average("F1")
cm.average("DOR",none_omit=True)
param
: input parameter (type : str
)none_omit
: none items omitting flag (type : bool
, default : False
)Average
The weighted average is similar to an ordinary average, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others.
Default weight is condition positive (number of positive samples).
cm.weighted_average("PPV")
cm.weighted_average("F1")
cm.weighted_average("DOR",none_omit=True)
cm.weighted_average("F1",weight={"L1":23,"L2":2,"L3":1})
param
: input parameter (type : str
)weight
: explicitly passes weights (type : dict
, default : None
)none_omit
: none items omitting flag (type : bool
, default : False
)Weighted average
Kappa is a statistic that measures inter-rater agreement for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as kappa takes into account the possibility of the agreement occurring by chance [24].
$$Kappa=\frac{ACC_{Overall}-RACC_{Overall}}{1-RACC_{Overall}}$$
cm.Kappa
The unbiased kappa value is defined in terms of total accuracy and a slightly different computation of expected likelihood that averages the reference and response probabilities [25].
Equals to Scott's Pi
$$Kappa_{Unbiased}=\frac{ACC_{Overall}-RACCU_{Overall}}{1-RACCU_{Overall}}$$
cm.KappaUnbiased
The kappa statistic adjusted for prevalence [14].
$$Kappa_{NoPrevalence}=2 \times ACC_{Overall}-1$$
cm.KappaNoPrevalence
$$v_{ij}=1-\frac{w_{ij}}{max(w)}$$
$$P_e=\sum_{i,j=1}^{|C|}\frac{TOP_i \times P_j}{POP^2}\times v_{ij}$$
$$P_a=\sum_{i,j=1}^{|C|}\frac{Matrix(i,j)}{POP}\times v_{ij}$$
$$Kappa_{Weighted}=\frac{P_a-P_e}{1-P_e}$$
cm.weighted_kappa(weight={"L1":{"L1":0,"L2":1,"L3":2},"L2":{"L1":1,"L2":0,"L3":1},"L3":{"L1":2,"L2":1,"L3":0}})
cm.weighted_kappa()
weight
: weight matrix (type : dict
, default : None
)Weighted kappa
$$SE_{Kappa}=\sqrt{\frac{ACC_{Overall}\times (1-RACC_{Overall})}{(1-RACC_{Overall})^2}}$$
cm.Kappa_SE
$$CI_{Kappa}=Kappa \pm 1.96\times SE_{Kappa}$$
cm.Kappa_CI
Pearson's chi-squared test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is suitable for unpaired data from large samples [10].
$$\chi^2=\sum_{i=1}^{|C|}\sum_{j=1}^{|C|}\frac{\Big(Matrix(i,j)-E(i,j)\Big)^2}{E(i,j)}$$
$$E(i,j)=\frac{TOP_j\times P_i}{POP}$$
cm.Chi_Squared
Number of degrees of freedom of this confusion matrix for the chi-squared statistic [10].
$$DF=(|C|-1)^2$$
cm.DF
In statistics, the phi coefficient (or mean square contingency coefficient) is a measure of association for two binary variables. Introduced by Karl Pearson, this measure is similar to the Pearson correlation coefficient in its interpretation. In fact, a Pearson correlation coefficient estimated for two binary variables will return the phi coefficient [10].
$$\phi^2=\frac{\chi^2}{POP}$$
cm.Phi_Squared
In statistics, Cramér's V (sometimes referred to as Cramér's phi) is a measure of association between two nominal variables, giving a value between $ 0 $ and $ +1 $ (inclusive). It is based on Pearson's chi-squared statistic and was published by Harald Cramér in 1946 [26].
$$V=\sqrt{\frac{\phi^2}{|C|-1}}$$
cm.V
The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation [31].
$$SE_{ACC}=\sqrt{\frac{ACC\times (1-ACC)}{POP}}$$
cm.SE
In statistics, a confidence interval (CI) is a type of interval estimate (of a population parameter) that is computed from the observed data. The confidence level is the frequency (i.e., the proportion) of possible confidence intervals that contain the true value of their corresponding parameter. In other words, if confidence intervals are constructed using a given confidence level in an infinite number of independent experiments, the proportion of those intervals that contain the true value of the parameter will match the confidence level [31].
$$CI=ACC \pm 1.96\times SE_{ACC}$$
cm.CI95
Bennett, Alpert & Goldstein’s S is a statistical measure of inter-rater agreement. It was created by Bennett et al. in 1954. Bennett et al. suggested adjusting inter-rater reliability to accommodate the percentage of rater agreement that might be expected by chance was a better measure than a simple agreement between raters [8].
$$p_c=\frac{1}{|C|}$$
$$S=\frac{ACC_{Overall}-p_c}{1-p_c}$$
cm.S
Scott's pi (named after William A. Scott) is a statistic for measuring inter-rater reliability for nominal data in communication studies. Textual entities are annotated with categories by different annotators, and various measures are used to assess the extent of agreement between the annotators, one of which is Scott's pi. Since automatically annotating text is a popular problem in natural language processing, and the goal is to get the computer program that is being developed to agree with the humans in the annotations it creates, assessing the extent to which humans agree with each other is important for establishing a reasonable upper limit on computer performance [7].
Equals to Kappa Unbiased
$$p_c=\sum_{i=1}^{|C|}(\frac{TOP_i + P_i}{2\times POP})^2$$
$$\pi=\frac{ACC_{Overall}-p_c}{1-p_c}$$
cm.PI
AC1 was originally introduced by Gwet in 2001 (Gwet, 2001). The interpretation of AC1 is similar to generalized kappa (Fleiss, 1971), which is used to assess inter-rater reliability when there are multiple raters. Gwet (2002) demonstrated that AC1 can overcome the limitations that kappa is sensitive to trait prevalence and rater's classification probabilities (i.e., marginal probabilities), whereas AC1 provides more robust measure of inter-rater reliability [6].
$$\pi_i=\frac{TOP_i + P_i}{2\times POP}$$
$$p_c=\frac{1}{|C|-1}\sum_{i=1}^{|C|}\Big(\pi_i\times (1-\pi_i)\Big)$$
$$AC_1=\frac{ACC_{Overall}-p_c}{1-p_c}$$
cm.AC1
The entropy of the decision problem itself as defined by the counts for the reference. The entropy of a distribution is the average negative log probability of outcomes [30].
$$Likelihood_{Reference}=\frac{P_i}{POP}$$
$$Entropy_{Reference}=-\sum_{i=1}^{|C|}Likelihood_{Reference}(i)\times\log_{2}{Likelihood_{Reference}(i)}$$
$$0\times\log_{2}{0}\equiv0$$
cm.ReferenceEntropy
The entropy of the response distribution. The entropy of a distribution is the average negative log probability of outcomes [30].
$$Likelihood_{Response}=\frac{TOP_i}{POP}$$
$$Entropy_{Response}=-\sum_{i=1}^{|C|}Likelihood_{Response}(i)\times\log_{2}{Likelihood_{Response}(i)}$$
$$0\times\log_{2}{0}\equiv0$$
cm.ResponseEntropy
The cross-entropy of the response distribution against the reference distribution. The cross-entropy is defined by the negative log probabilities of the response distribution weighted by the reference distribution [30].
$$Likelihood_{Reference}=\frac{P_i}{POP}$$
$$Likelihood_{Response}=\frac{TOP_i}{POP}$$
$$Entropy_{Cross}=-\sum_{i=1}^{|C|}Likelihood_{Reference}(i)\times\log_{2}{Likelihood_{Response}(i)}$$
$$0\times\log_{2}{0}\equiv0$$
cm.CrossEntropy
The entropy of the joint reference and response distribution as defined by the underlying matrix [30].
$$P^{'}(i,j)=\frac{Matrix(i,j)}{POP}$$
$$Entropy_{Joint}=-\sum_{i=1}^{|C|}\sum_{j=1}^{|C|}P^{'}(i,j)\times\log_{2}{P^{'}(i,j)}$$
$$0\times\log_{2}{0}\equiv0$$
cm.JointEntropy
The entropy of the distribution of categories in the response given that the reference category was as specified [30].
$$P^{'}(j|i)=\frac{Matrix(j,i)}{P_i}$$
$$Entropy_{Conditional}=\sum_{i=1}^{|C|}\Bigg(Likelihood_{Reference}(i)\times\Big(-\sum_{j=1}^{|C|}P^{'}(j|i)\times\log_{2}{P^{'}(j|i)}\Big)\Bigg)$$
$$0\times\log_{2}{0}\equiv0$$
cm.ConditionalEntropy
$$Likelihood_{Response}=\frac{TOP_i}{POP}$$
$$Likelihood_{Reference}=\frac{P_i}{POP}$$
$$Divergence=-\sum_{i=1}^{|C|}Likelihood_{Reference}\times\log_{2}{\frac{Likelihood_{Reference}}{Likelihood_{Response}}}$$
cm.KL
Mutual information is defined as Kullback-Leibler divergence, between the product of the individual distributions and the joint distribution. Mutual information is symmetric. We could also subtract the conditional entropy of the reference given the response from the reference entropy to get the same result [11] [30].
$$P^{'}(i,j)=\frac{Matrix(i,j)}{POP}$$
$$Likelihood_{Reference}=\frac{P_i}{POP}$$
$$Likelihood_{Response}=\frac{TOP_i}{POP}$$
$$MI=-\sum_{i=1}^{|C|}\sum_{j=1}^{|C|}P^{'}(i,j)\times\log_{2}\Big({\frac{P^{'}(i,j)}{Likelihood_{Reference}(i)\times Likelihood_{Response}(i) }\Big)}$$
$$MI=Entropy_{Response}-Entropy_{Conditional}$$
cm.MutualInformation
In probability theory and statistics, Goodman & Kruskal's lambda is a measure of proportional reduction in error in cross tabulation analysis [12].
$$\lambda_A=\frac{\sum_{j=1}^{|C|}Max\Big(Matrix(-,j)\Big)-Max(P)}{POP-Max(P)}$$
cm.LambdaA
In probability theory and statistics, Goodman & Kruskal's lambda is a measure of proportional reduction in error in cross tabulation analysis [13].
$$\lambda_B=\frac{\sum_{i=1}^{|C|}Max\Big(Matrix(i,-)\Big)-Max(TOP)}{POP-Max(TOP)}$$
cm.LambdaB
For more information visit [1].
Kappa | Strength of Agreement |
0 > | Poor |
0 - 0.2 | Slight |
0.2 – 0.4 | Fair |
0.4 – 0.6 | Moderate |
0.6 – 0.8 | Substantial |
0.8 – 1.0 | Almost perfect |
cm.SOA1
For more information visit [4].
Kappa | Strength of Agreement |
0.40 > | Poor |
0.40 - 0.75 | Intermediate to Good |
More than 0.75 | Excellent |
cm.SOA2
For more information visit [5].
Kappa | Strength of Agreement |
0.2 > | Poor |
0.2 – 0.4 | Fair |
0.4 – 0.6 | Moderate |
0.6 – 0.8 | Good |
0.8 – 1.0 | Very Good |
cm.SOA3
For more information visit [9].
Kappa | Strength of Agreement |
0.40 > | Poor |
0.40 – 0.59 | Fair |
0.59 – 0.74 | Good |
0.74 – 1.00 | Excellent |
cm.SOA4
For more information visit [47].
Cramer's V | Strength of Association |
0.1 > | Negligible |
0.1 – 0.2 | Weak |
0.2 – 0.4 | Moderate |
0.4 – 0.6 | Relatively Strong |
0.6 – 0.8 | Strong |
0.8 – 1.0 | Very Strong |
cm.SOA5
Overall MCC | Strength of Association |
0.3 > | Negligible |
0.3 - 0.5 | Weak |
0.5 - 0.7 | Moderate |
0.7 - 0.9 | Strong |
0.9 - 1.0 | Very Strong |
cm.SOA6
For more information visit [3].
$$ACC_{Overall}=\frac{\sum_{i=1}^{|C|}TP_i}{POP}$$
cm.Overall_ACC
For more information visit [24].
$$RACC_{Overall}=\sum_{i=1}^{|C|}RACC_i$$
cm.Overall_RACC
For more information visit [25].
$$RACCU_{Overall}=\sum_{i=1}^{|C|}RACCU_i$$
cm.Overall_RACCU
For more information visit [3].
$$PPV_{Micro}=\frac{\sum_{i=1}^{|C|}TP_i}{\sum_{i=1}^{|C|}TP_i+FP_i}$$
cm.PPV_Micro
For more information visit [3].
$$TPR_{Micro}=\frac{\sum_{i=1}^{|C|}TP_i}{\sum_{i=1}^{|C|}TP_i+FN_i}$$
cm.TPR_Micro
For more information visit [3].
$$TNR_{Micro}=\frac{\sum_{i=1}^{|C|}TN_i}{\sum_{i=1}^{|C|}TN_i+FP_i}$$
cm.TNR_Micro
For more information visit [3].
$$FPR_{Micro}=\frac{\sum_{i=1}^{|C|}FP_i}{\sum_{i=1}^{|C|}TN_i+FP_i}$$
cm.FPR_Micro
For more information visit [3].
$$FNR_{Micro}=\frac{\sum_{i=1}^{|C|}FN_i}{\sum_{i=1}^{|C|}TP_i+FN_i}$$
cm.FNR_Micro
For more information visit [3].
$$F_{1_{Micro}}=2\frac{\sum_{i=1}^{|C|}TPR_i\times PPV_i}{\sum_{i=1}^{|C|}TPR_i+PPV_i}$$
cm.F1_Micro
For more information visit [3].
$$PPV_{Macro}=\frac{1}{|C|}\sum_{i=1}^{|C|}\frac{TP_i}{TP_i+FP_i}$$
cm.PPV_Macro
For more information visit [3].
$$TPR_{Macro}=\frac{1}{|C|}\sum_{i=1}^{|C|}\frac{TP_i}{TP_i+FN_i}$$
cm.TPR_Macro
For more information visit [3].
$$TNR_{Macro}=\frac{1}{|C|}\sum_{i=1}^{|C|}\frac{TN_i}{TN_i+FP_i}$$
cm.TNR_Macro
For more information visit [3].
$$FPR_{Macro}=\frac{1}{|C|}\sum_{i=1}^{|C|}\frac{FP_i}{TN_i+FP_i}$$
cm.FPR_Macro
For more information visit [3].
$$FNR_{Macro}=\frac{1}{|C|}\sum_{i=1}^{|C|}\frac{FN_i}{TP_i+FN_i}$$
cm.FNR_Macro
For more information visit [3].
$$F_{1_{Macro}}=\frac{2}{|C|}\sum_{i=1}^{|C|}\frac{TPR_i\times PPV_i}{TPR_i+PPV_i}$$
cm.F1_Macro
For more information visit [3].
$$ACC_{Macro}=\frac{1}{|C|}\sum_{i=1}^{|C|}{ACC_i}$$
cm.ACC_Macro
For more information visit [29].
$$J_{Mean}=\frac{1}{|C|}\sum_{i=1}^{|C|}J_i$$
$$J_{Sum}=\sum_{i=1}^{|C|}J_i$$
$$J_{Overall}=(J_{Sum},J_{Mean})$$
cm.Overall_J
The average Hamming loss or Hamming distance between two sets of samples [31].
$$L_{Hamming}=\frac{1}{POP}\sum_{i=1}^{POP}1(y_i \neq \widehat{y}_i)$$
cm.HammingLoss
Zero-one loss is a common loss function used with classification learning. It assigns $ 0 $ to loss for a correct classification and $ 1 $ for an incorrect classification [31].
$$L_{0-1}=\sum_{i=1}^{POP}1(y_i \neq \widehat{y}_i)$$
cm.ZeroOneLoss
Largest class percentage in the data [57].
$$NIR=\frac{1}{POP}Max(P)$$
cm.NIR
In statistical hypothesis testing, the p-value or probability value is, for a given statistical model, the probability that, when the null hypothesis is true, the statistical summary (such as the absolute value of the sample mean difference between two compared groups) would be greater than or equal to the actual observed results [31] .
Here a one-sided binomial test to see if the accuracy is better than the no information rate [57].
$$x=\sum_{i=1}^{|C|}TP_{i}$$
$$p=NIR$$
$$n=POP$$
$$P-Value_{(ACC > NIR)}=1-\sum_{i=1}^{x}\left(\begin{array}{c}n\\ i\end{array}\right)p^{i}(1-p)^{n-i}$$
cm.PValue
For more information visit [17].
$$P_j=\frac{\sum_{k=1}^{|C|}\Big(Matrix(j,k)+Matrix(k,j)\Big)}{2\sum_{k,l=1}^{|C|}Matrix(k,l)}$$
$$CEN_{Overall}=\sum_{j=1}^{|C|}P_jCEN_j$$
cm.Overall_CEN
For more information visit [19].
$$\alpha=\begin{cases}1 & |C| > 2\\0 & |C| = 2\end{cases}$$
$$P_j=\frac{\sum_{k=1}^{|C|}\Big(Matrix(j,k)+Matrix(k,j)\Big)-Matrix(j,j)}{2\sum_{k,l=1}^{|C|}Matrix(k,l)-\alpha \sum_{k=1}^{|C|}Matrix(k,k)}$$
$$MCEN_{Overall}=\sum_{j=1}^{|C|}P_jMCEN_j$$
cm.Overall_MCEN
$$MCC_{Overall}=\frac{cov(X,Y)}{\sqrt{cov(X,X)\times cov(Y,Y)}}$$
$$cov(X,Y)=\sum_{i,j,k=1}^{|C|}\Big(Matrix(i,i)Matrix(k,j)-Matrix(j,i)Matrix(i,k)\Big)$$
$$cov(X,X) = \sum_{i=1}^{|C|}\Bigg[\Big(\sum_{j=1}^{|C|}Matrix(j,i)\Big)\Big(\sum_{k,l=1,k\neq i}^{|C|}Matrix(l,k)\Big)\Bigg]$$
$$cov(Y,Y) = \sum_{i=1}^{|C|}\Bigg[\Big(\sum_{j=1}^{|C|}Matrix(i,j)\Big)\Big(\sum_{k,l=1,k\neq i}^{|C|}Matrix(k,l)\Big)\Bigg]$$
cm.Overall_MCC
For more information visit [21].
$$RR=\frac{1}{|C|}\sum_{i,j=1}^{|C|}Matrix(i,j)$$
cm.RR
As an evaluation tool, CBA creates an overall assessment of model predictive power by scrutinizing measures simultaneously across each class in a conservative manner that guarantees that a model’s ability to recall observations from each class and its ability to do so efficiently won’t fall below the bound [22] [51].
$$CBA=\frac{\sum_{i=1}^{|C|}\frac{Matrix(i,i)}{Max(TOP_i,P_i)}}{|C|}$$
cm.CBA
When dealing with multiclass problems, a global measure of classification performances based on the ROC approach (AUNU) has been proposed as the average of single-class measures [23].
$$AUNU=\frac{\sum_{i=1}^{|C|}AUC_i}{|C|}$$
cm.AUNU
Another option (AUNP) is that of averaging the $ AUC_i $ values with weights proportional to the number of samples experimentally belonging to each class, that is, the a priori class distribution [23].
$$AUNP=\sum_{i=1}^{|C|}\frac{P_i}{POP}AUC_i$$
cm.AUNP
$$H_d=-\sum_{i=1}^{|C|}\Big(\frac{\sum_{l=1}^{|C|}Matrix(i,l)}{\sum_{h,k=1}^{|C|}Matrix(h,k)}log_2\frac{\sum_{l=1}^{|C|}Matrix(i,l)}{\sum_{h,k=1}^{|C|}Matrix(h,k)}\Big)=Entropy_{Reference}$$
$$H_o=\sum_{j=1}^{|C|}\Big(\frac{\sum_{k=1}^{|C|}Matrix(k,j)}{\sum_{h,l=0}^{|C|}Matrix(h,l)}H_{oj}\Big)=Entropy_{Conditional}$$
$$H_{oj}=-\sum_{i=1}^{|C|}\Big(\frac{Matrix(i,j)}{\sum_{k=1}^{|C|}Matrix(k,j)}log_2\frac{Matrix(i,j)}{\sum_{k=1}^{|C|}Matrix(k,j)}\Big)$$
$$RCI=\frac{H_d-H_o}{H_d}=\frac{MI}{Entropy_{Reference}}$$
cm.RCI
$$C=\sqrt{\frac{\chi^2}{\chi^2+POP}}$$
cm.C
The Classification Success Index (CSI) is an overall measure defined by averaging ICSI over all classes [58].
$$CSI=\frac{1}{|C|}\sum_{i=1}^{|C|}{ICSI_i}$$
cm.CSI
The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used [68].
The Adjusted Rand Index (ARI) is frequently used in cluster validation since it is a measure of agreement between two partitions: one given by the clustering process and the other defined by external criteria, but it can also be used in supervised learning [69].
$$X=\frac{\sum_{i}C_{2}^{P_i}\times \sum_{j}C_{2}^{TOP_j}}{C_2^{POP}}$$
$$ARI=\frac{\sum_{i,j}C_{2}^{Matrix(i,j)}-X}{\frac{1}{2}[\sum_{i}C_{2}^{P_i} + \sum_{j}C_{2}^{TOP_j}]-X}$$
cm.ARI
$$B=\frac{\sum_{i=1}^{|C|}TP_i^2}{\sum_{i=1}^{|C|}TOP_i\times P_i}$$
cm.B
Krippendorff's alpha coefficient, named after academic Klaus Krippendorff, is a statistical measure of the agreement achieved when coding a set of units of analysis in terms of the values of a variable. Krippendorff's alpha generalizes several known statistics, often called measures of inter-coder agreement, inter-rater reliability, reliability of coding given sets of units (as distinct from unitizing) but it also distinguishes itself from statistics that are called reliability coefficients but are unsuitable to the particulars of coding data generated for subsequent analysis [74].
$$\epsilon = \frac{1}{2\times POP}$$
$$P_a=(1-\epsilon)\times ACC_{Overall}+\epsilon$$
$$P_e=RACCU_{Overall}$$
$$\alpha=\frac{P_a-P_e}{1-P_e}$$
cm.Alpha
Weighted Krippendorff's alpha coefficient [74].
$$\epsilon = \frac{1}{2\times POP}$$
$$v_{ij}=1-\frac{w_{ij}}{max(w)}$$
$$P_e=\sum_{i,j=1}^{|C|}(\frac{TOP_i \times P_j}{2 \times POP})^2\times v_{ij}$$
$$P_a^*=\sum_{i,j=1}^{|C|}\frac{Matrix(i,j)}{POP}\times v_{ij}$$
$$P_a=(1-\epsilon)\times P_a^*+\epsilon$$
$$\alpha_{Weighted}=\frac{P_a-P_e}{1-P_e}$$
cm.weighted_alpha(weight={"L1":{"L1":0,"L2":1,"L3":2},"L2":{"L1":1,"L2":0,"L3":1},"L3":{"L1":2,"L2":1,"L3":0}})
cm.weighted_alpha()
weight
: weight matrix (type : dict
, default : None
)Weighted alpha
Aickin's alpha coefficient [75].
$$\alpha^{(t+1)}=\frac{p_a-p_e^{(t)}}{1-p_e^{(t)}}$$
$$p_e^{(t)}=\sum_{k=1}^{|C|}p_{k|H}^{A(t)}\times p_{k|H}^{B(t)}$$
$$p_a=ACC_{Overall}$$
$$p_{k|H}^{A(t+1)}=\frac{TOP_k}{(1-\alpha^{(t)})+\alpha^{(t)}\times \frac{p_{k|H}^{B(t)}}{p_e^{(t)}}\times POP}$$
$$p_{k|H}^{B(t+1)}=\frac{P_k}{(1-\alpha^{(t)})+\alpha^{(t)}\times \frac{p_{k|H}^{A(t)}}{p_e^{(t)}}\times POP}$$
$$Stop:|\alpha^{(t+1)}-\alpha^{(t)}|<\epsilon$$
cm.aickin_alpha()
cm.aickin_alpha(max_iter=2000,epsilon=0.00003)
epsilon
: difference threshold (type : float
, default : 0.0001
)max_iter
: maximum iteration (type : int
, default : 200
)Aickin's alpha
print(cm)
cm.print_matrix()
cm.matrix
cm.print_matrix(one_vs_all=True,class_name = "L1")
sparse_cm = ConfusionMatrix(matrix={1:{1:0,2:2},2:{1:0,2:18}})
sparse_cm.print_matrix(sparse=True)
one_vs_all
: One-Vs-All mode flag (type : bool
, default : False
)class_name
: target class name for One-Vs-All mode (type : any valid type
, default : None
)sparse
: sparse mode printing flag (type : bool
, default : False
)cm.print_normalized_matrix()
cm.normalized_matrix
cm.print_normalized_matrix(one_vs_all=True,class_name = "L1")
sparse_cm.print_normalized_matrix(sparse=True)
one_vs_all
: One-Vs-All mode flag (type : bool
, default : False
)class_name
: target class name for One-Vs-All mode (type : any valid type
, default : None
)sparse
: sparse mode printing flag (type : bool
, default : False
)cm.stat()
cm.stat(overall_param=["Kappa"],class_param=["ACC","AUC","TPR"])
cm.stat(overall_param=["Kappa"],class_param=["ACC","AUC","TPR"],class_name=["L1","L3"])
cm.stat(summary=True)
overall_param
: overall statistics names for print (type : list
, default : None
)class_param
: class statistics names for print (type : list
, default : None
)class_name
: class names for print (sub set of classes) (type : list
, default : None
)summary
: summary mode flag (type : bool
, default : False
)cp.print_report()
print(cp)
import os
if "Document_Files" not in os.listdir():
os.mkdir("Document_Files")
cm.save_stat(os.path.join("Document_Files","cm1"))
cm.save_stat(os.path.join("Document_Files","cm1_filtered"),overall_param=["Kappa"],class_param=["ACC","AUC","TPR"])
cm.save_stat(os.path.join("Document_Files","cm1_filtered2"),overall_param=["Kappa"],class_param=["ACC","AUC","TPR"],class_name=["L1"])
cm.save_stat(os.path.join("Document_Files","cm1_summary"),summary=True)
sparse_cm.save_stat(os.path.join("Document_Files","sparse_cm"),summary=True,sparse=True)
cm.save_stat("cm1asdasd/")
name
: output file name (type : str
)address
: flag for address return (type : bool
, default : True
)overall_param
: overall statistics names for save (type : list
, default : None
)class_param
: class statistics names for save (type : list
, default : None
)class_name
: class names for print (sub set of classes) (type : list
, default : None
)summary
: summary mode flag (type : bool
, default : False
)sparse
: sparse mode printing flag (type : bool
, default : False
)cm.save_html(os.path.join("Document_Files","cm1"))
cm.save_html(os.path.join("Document_Files","cm1_filtered"),overall_param=["Kappa"],class_param=["ACC","AUC","TPR"])
cm.save_html(os.path.join("Document_Files","cm1_filtered2"),overall_param=["Kappa"],class_param=["ACC","AUC","TPR"],class_name=["L1"])
cm.save_html(os.path.join("Document_Files","cm1_colored"),color=(255, 204, 255))
cm.save_html(os.path.join("Document_Files","cm1_colored2"),color="Crimson")
cm.save_html(os.path.join("Document_Files","cm1_normalized"),color="Crimson",normalize=True)
cm.save_html(os.path.join("Document_Files","cm1_summary"),summary=True,normalize=True)
cm.save_html("cm1asdasd/")
name
: output file name (type : str
)address
: flag for address return (type : bool
, default : True
)overall_param
: overall statistics names for save (type : list
, default : None
)class_param
: class statistics names for save (type : list
, default : None
)class_name
: class names for print (sub set of classes) (type : list
, default : None
)color
: matrix color (R,G,B) (type : tuple
/str
, default : (0,0,0)
), support X11 color namesnormalize
: save normalize matrix flag (type : bool
, default : False
)summary
: summary mode flag (type : bool
, default : False
)alt_link
: alternative link for document flag (type : bool
, default : False
)cm.save_csv(os.path.join("Document_Files","cm1"))
cm.save_csv(os.path.join("Document_Files","cm1_filtered"),class_param=["ACC","AUC","TPR"])
cm.save_csv(os.path.join("Document_Files","cm1_filtered2"),class_param=["ACC","AUC","TPR"],normalize=True)
cm.save_csv(os.path.join("Document_Files","cm1_filtered3"),class_param=["ACC","AUC","TPR"],class_name=["L1"])
cm.save_csv(os.path.join("Document_Files","cm1_header"),header=True)
cm.save_csv(os.path.join("Document_Files","cm1_summary"),summary=True,matrix_save=False)
cm.save_csv("cm1asdasd/")
name
: output file name (type : str
)address
: flag for address return (type : bool
, default : True
)class_param
: class statistics names for save (type : list
, default : None
)class_name
: class names for print (sub set of classes) (type : list
, default : None
)matrix_save
: flag for saving matrix in seperate CSV file (type : bool
, default : True
)normalize
: flag for saving normalized matrix instead of matrix (type : bool
, default : False
)summary
: summary mode flag (type : bool
, default : False
)header
: flag for adding header to matrix CSV file (type : bool
, default : False
)cm.save_obj(os.path.join("Document_Files","cm1"))
cm.save_obj(os.path.join("Document_Files","cm1_stat"),save_stat=True)
cm.save_obj(os.path.join("Document_Files","cm1_no_vectors"),save_vector=False)
cm.save_obj("cm1asdasd/")
name
: output file name (type : str
)address
: flag for address return (type : bool
, default : True
)save_stat
: save statistics flag (type : bool
, default : False
)save_vector
: save vectors flag (type : bool
, default : True
)cp.save_report(os.path.join("Document_Files","cp"))
cp.save_report("cm1asdasd/")
name
: output file name (type : str
)address
: flag for address return (type : bool
, default : True
)try:
cm2=ConfusionMatrix(y_actu, 2)
except pycmVectorError as e:
print(str(e))
try:
cm3=ConfusionMatrix(y_actu, [1,2,3])
except pycmVectorError as e:
print(str(e))
try:
cm_4 = ConfusionMatrix([], [])
except pycmVectorError as e:
print(str(e))
try:
cm_5 = ConfusionMatrix([1,1,1,], [1,1,1,1])
except pycmVectorError as e:
print(str(e))
try:
cm3=ConfusionMatrix(matrix={})
except pycmMatrixError as e:
print(str(e))
try:
cm_4=ConfusionMatrix(matrix={1:{1:2,"1":2},"1":{1:2,"1":3}})
except pycmMatrixError as e:
print(str(e))
try:
cm_5=ConfusionMatrix(matrix={1:{1:2}})
except pycmMatrixError as e:
print(str(e))
try:
cp=Compare([cm2,cm3])
except pycmCompareError as e:
print(str(e))
try:
cp=Compare({"cm1":cm,"cm2":cm2})
except pycmCompareError as e:
print(str(e))
try:
cp=Compare({"cm1":[],"cm2":cm2})
except pycmCompareError as e:
print(str(e))
try:
cp=Compare({"cm2":cm2})
except pycmCompareError as e:
print(str(e))
try:
cp=Compare({"cm1":cm2,"cm2":cm3},by_class=True,weight={1:2,2:0})
except pycmCompareError as e:
print(str(e))
try:
cm.CI("MCC")
except pycmCIError as e:
print(str(e))
try:
cm.CI(2)
except pycmCIError as e:
print(str(e))
try:
cm.average("AXY")
except pycmAverageError as e:
print(str(e))
try:
cm.weighted_average("AXY")
except pycmAverageError as e:
print(str(e))
try:
cm.weighted_average("AUC",weight={1:22})
except pycmAverageError as e:
print(str(e))
try:
cm.position()
except pycmVectorError as e:
print(str(e))
try:
cm.combine(2)
except pycmMatrixError as e:
print(str(e))
If you use PyCM in your research, we would appreciate citations to the following paper :
Haghighi, S., Jasemi, M., Hessabi, S. and Zolanvari, A. (2018). PyCM: Multiclass confusion matrix library in Python.
Journal of Open Source Software, 3(25), p.729.
@article{Haghighi2018, doi = {10.21105/joss.00729}, url = {https://doi.org/10.21105/joss.00729}, year = {2018}, month = {may}, publisher = {The Open Journal}, volume = {3}, number = {25}, pages = {729}, author = {Sepand Haghighi and Masoomeh Jasemi and Shaahin Hessabi and Alireza Zolanvari}, title = {{PyCM}: Multiclass confusion matrix library in Python}, journal = {Journal of Open Source Software} }
Download PyCM.bib
1- J. R. Landis, G. G. Koch, “The measurement of observer agreement for categorical data. Biometrics,” in International Biometric Society, pp. 159–174, 1977.
2- D. M. W. Powers, “Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation,” in Journal of Machine Learning Technologies, pp.37-63, 2011.
3- C. Sammut, G. Webb, “Encyclopedia of Machine Learning” in Springer, 2011.
4- J. L. Fleiss, “Measuring nominal scale agreement among many raters,” in Psychological Bulletin, pp. 378-382, 1971.
5- D.G. Altman, “Practical Statistics for Medical Research,” in Chapman and Hall, 1990.
6- K. L. Gwet, “Computing inter-rater reliability and its variance in the presence of high agreement,” in The British Journal of Mathematical and Statistical Psychology, pp. 29–48, 2008.”
7- W. A. Scott, “Reliability of content analysis: The case of nominal scaling,” in Public Opinion Quarterly, pp. 321–325, 1955.
8- E. M. Bennett, R. Alpert, and A. C. Goldstein, “Communication through limited response questioning,” in The Public Opinion Quarterly, pp. 303–308, 1954.
9- D. V. Cicchetti, "Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology," in Psychological Assessment, pp. 284–290, 1994.
10- R.B. Davies, "Algorithm AS155: The Distributions of a Linear Combination of χ2 Random Variables," in Journal of the Royal Statistical Society, pp. 323–333, 1980.
11- S. Kullback, R. A. Leibler "On information and sufficiency," in Annals of Mathematical Statistics, pp. 79–86, 1951.
12- L. A. Goodman, W. H. Kruskal, "Measures of Association for Cross Classifications, IV: Simplification of Asymptotic Variances," in Journal of the American Statistical Association, pp. 415–421, 1972.
13- L. A. Goodman, W. H. Kruskal, "Measures of Association for Cross Classifications III: Approximate Sampling Theory," in Journal of the American Statistical Association, pp. 310–364, 1963.
14- T. Byrt, J. Bishop and J. B. Carlin, “Bias, prevalence, and kappa,” in Journal of Clinical Epidemiology pp. 423-429, 1993.
15- M. Shepperd, D. Bowes, and T. Hall, “Researcher Bias: The Use of Machine Learning in Software Defect Prediction,” in IEEE Transactions on Software Engineering, pp. 603-616, 2014.
16- X. Deng, Q. Liu, Y. Deng, and S. Mahadevan, “An improved method to construct basic probability assignment based on the confusion matrix for classification problem, ” in Information Sciences, pp.250-261, 2016.
17- J.-M. Wei, X.-J. Yuan, Q.-H. Hu, and S.-Q. J. E. S. w. A. Wang, "A novel measure for evaluating classifiers," in Expert Systems with Applications, pp. 3799-3809, 2010.
18- I. Kononenko and I. J. M. L. Bratko, "Information-based evaluation criterion for classifier's performance," in Machine Learning, pp. 67-80, 1991.
19- R. Delgado and J. D. Núñez-González, "Enhancing Confusion Entropy as Measure for Evaluating Classifiers," in The 13th International Conference on Soft Computing Models in Industrial and Environmental Applications, pp. 79-89, 2018: Springer.
20- J. J. C. b. Gorodkin and chemistry, "Comparing two K-category assignments by a K-category correlation coefficient," in Computational Biology and chemistry, pp. 367-374, 2004.
21- C. O. Freitas, J. M. De Carvalho, J. Oliveira, S. B. Aires, and R. Sabourin, "Confusion matrix disagreement for multiple classifiers," in Iberoamerican Congress on Pattern Recognition, pp. 387-396, 2007.
22- P. Branco, L. Torgo, and R. P. Ribeiro, "Relevance-based evaluation metrics for multi-class imbalanced domains," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 698-710, 2017. Springer.
23- D. Ballabio, F. Grisoni, R. J. C. Todeschini, and I. L. Systems, "Multivariate comparison of classification performance measures," in Chemometrics and Intelligent Laboratory Systems, pp. 33-44, 2018.
24- J. J. E. Cohen and p. measurement, "A coefficient of agreement for nominal scales," in Educational and Psychological Measurement, pp. 37-46, 1960.
25- S. Siegel, "Nonparametric statistics for the behavioral sciences," in New York : McGraw-Hill, 1956.
26- H. Cramér, "Mathematical methods of statistics (PMS-9),"in Princeton university press, 2016.
27- B. W. J. B. e. B. A.-P. S. Matthews, "Comparison of the predicted and observed secondary structure of T4 phage lysozyme," in Biochimica et Biophysica Acta (BBA) - Protein Structure, pp. 442-451, 1975.
28- J. A. J. S. Swets, "The relative operating characteristic in psychology: a technique for isolating effects of response bias finds wide use in the study of perception and cognition," in American Association for the Advancement of Science, pp. 990-1000, 1973.
29- P. J. B. S. V. S. N. Jaccard, "Étude comparative de la distribution florale dans une portion des Alpes et des Jura," in Bulletin de la Société vaudoise des sciences naturelles, pp. 547-579, 1901.
30- T. M. Cover and J. A. Thomas, "Elements of information theory," in John Wiley & Sons, 2012.
31- E. S. Keeping, "Introduction to statistical inference," in Courier Corporation, 1995.
32- V. Sindhwani, P. Bhattacharya, and S. Rakshit, "Information theoretic feature crediting in multiclass support vector machines," in Proceedings of the 2001 SIAM International Conference on Data Mining, pp. 1-18, 2001.
33- M. Bekkar, H. K. Djemaa, and T. A. J. J. I. E. A. Alitouche, "Evaluation measures for models assessment over imbalanced data sets," in Journal of Information Engineering and Applications, 2013.
34- W. J. J. C. Youden, "Index for rating diagnostic tests," in Cancer, pp. 32-35, 1950.
35- S. Brin, R. Motwani, J. D. Ullman, and S. J. A. S. R. Tsur, "Dynamic itemset counting and implication rules for market basket data," in Proceedings of the 1997 ACM SIGMOD international conference on Management of datavol, pp. 255-264, 1997.
36- S. J. T. J. o. O. S. S. Raschka, "MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack," in Journal of Open Source Software, 2018.
37- J. BRAy and J. CuRTIS, "An ordination of upland forest communities of southern Wisconsin.-ecological Monographs," in journal of Ecological Monographs, 1957.
38- J. L. Fleiss, J. Cohen, and B. S. J. P. B. Everitt, "Large sample standard errors of kappa and weighted kappa," in Psychological Bulletin, p. 323, 1969.
39- M. Felkin, "Comparing classification results between n-ary and binary problems," in Quality Measures in Data Mining: Springer, pp. 277-301, 2007.
40- R. Ranawana and V. Palade, "Optimized Precision-A new measure for classifier performance evaluation," in 2006 IEEE International Conference on Evolutionary Computation, pp. 2254-2261, 2006.
41- V. García, R. A. Mollineda, and J. S. Sánchez, "Index of balanced accuracy: A performance measure for skewed class distributions," in Iberian Conference on Pattern Recognition and Image Analysis, pp. 441-448, 2009.
42- P. Branco, L. Torgo, and R. P. J. A. C. S. Ribeiro, "A survey of predictive modeling on imbalanced domains," in Journal ACM Computing Surveys (CSUR), p. 31, 2016.
43- K. Pearson, "Notes on Regression and Inheritance in the Case of Two Parents," in Proceedings of the Royal Society of London, p. 240-242, 1895.
44- W. J. I. Conover, New York, "Practical Nonparametric Statistics," in John Wiley and Sons, 1999.
45- Yule, G. U, "On the methods of measuring association between two attributes." in Journal of the Royal Statistical Society, pp. 579-652, 1912.
46- Batuwita, R. and Palade, V, "A new performance measure for class imbalance learning. application to bioinformatics problems," in Machine Learning and Applications, pp.545–550, 2009.
47- D. K. Lee, "Alternatives to P value: confidence interval and effect size," Korean journal of anesthesiology, vol. 69, no. 6, p. 555, 2016.
48- M. A. Raslich, R. J. Markert, and S. A. Stutes, "Selecting and interpreting diagnostic tests," Biochemia medica: Biochemia medica, vol. 17, no. 2, pp. 151-161, 2007.
49- D. E. Hinkle, W. Wiersma, and S. G. Jurs, "Applied statistics for the behavioral sciences," 1988.
50- A. Maratea, A. Petrosino, and M. Manzo, "Adjusted F-measure and kernel scaling for imbalanced data learning," Information Sciences, vol. 257, pp. 331-341, 2014.
51- L. Mosley, "A balanced approach to the multi-class imbalance problem," 2013.
52- M. Vijaymeena and K. Kavitha, "A survey on similarity measures in text mining," Machine Learning and Applications: An International Journal, vol. 3, no. 2, pp. 19-28, 2016.
53- Y. Otsuka, "The faunal character of the Japanese Pleistocene marine Mollusca, as evidence of climate having become colder during the Pleistocene in Japan," Biogeograph. Soc. Japan, vol. 6, pp. 165-170, 1936.
54- A. Tversky, "Features of similarity," Psychological review, vol. 84, no. 4, p. 327, 1977.
55- K. Boyd, K. H. Eng, and C. D. Page, "Area under the precision-recall curve: point estimates and confidence intervals," in Joint European conference on machine learning and knowledge discovery in databases, 2013, pp. 451-466: Springer.
56- J. Davis and M. Goadrich, "The relationship between Precision-Recall and ROC curves," in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 233-240: ACM.
57- M. Kuhn, "Building predictive models in R using the caret package," Journal of statistical software, vol. 28, no. 5, pp. 1-26, 2008.
58- V. Labatut and H. Cherifi, "Accuracy measures for the comparison of classifiers," arXiv preprint, 2012.
59- S. Wallis, "Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods," Journal of Quantitative Linguistics, vol. 20, no. 3, pp. 178-208, 2013.
60- D. Altman, D. Machin, T. Bryant, and M. Gardner, Statistics with confidence: confidence intervals and statistical guidelines. John Wiley & Sons, 2013.
61- J. A. Hanley and B. J. McNeil, "The meaning and use of the area under a receiver operating characteristic (ROC) curve," Radiology, vol. 143, no. 1, pp. 29-36, 1982.
62- E. B. Wilson, "Probable inference, the law of succession, and statistical inference," Journal of the American Statistical Association, vol. 22, no. 158, pp. 209-212, 1927.
63- A. Agresti and B. A. Coull, "Approximate is better than “exact” for interval estimation of binomial proportions," The American Statistician, vol. 52, no. 2, pp. 119-126, 1998.
64- C. S. Peirce, "The numerical measure of the success of predictions," Science, no. 93, pp. 453-454, 1884.
65- E. W. Steyerberg, B. Van Calster, and M. J. Pencina, "Performance measures for prediction models and markers: evaluation of predictions and classifications," Revista Española de Cardiología, vol. 64, no. 9, pp. 788-794, 2011.
66- A. J. Vickers and E. B. Elkin, "Decision curve analysis: a novel method for evaluating prediction models," Medical Decision Making, vol. 26, no. 6, pp. 565-574, 2006.
67- D. Knoke, G. W. Bohrnstedt, and A. P. Mee, Statistics for social data analysis. FE Peacock Publishers Itasca, IL, 2002
68- W. M. Rand, "Objective criteria for the evaluation of clustering methods," Journal of the American Statistical association, vol. 66, no. 336, pp. 846-850, 1971.
69- J. M. Santos and M. Embrechts, "On the use of the adjusted rand index as a metric for evaluating supervised classification," in International conference on artificial neural networks, 2009: Springer, pp. 175-184.
70- J. Cohen, "Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit," Psychological bulletin, vol. 70, no. 4, p. 213, 1968.
71- R. Bakeman and J. M. Gottman, Observing interaction: An introduction to sequential analysis. Cambridge university press, 1997.
72- S. Bangdiwala, "A graphical test for observer agreement," in 45th International Statistical Institute Meeting, 1985, vol. 1985, pp. 307-308.
73- K. Bangdiwala and H. Bryan, "Using SAS software graphical procedures for the observer agreement chart," in Proceedings of the SAS Users Group International Conference, 1987, vol. 12, pp. 1083-1088.
74- A. F. Hayes and K. Krippendorff, "Answering the call for a standard reliability measure for coding data," Communication methods and measures, vol. 1, no. 1, pp. 77-89, 2007.
75- M. Aickin, "Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen's kappa," Biometrics, pp. 293-302, 1990.