Semi-automated classification for multi-label open-ended questions
Section 3. Multi-label classification

Consider a set of possible output labels L = { 1, 2, , L } . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaatCvAUfKttLearyat1nwAKfgidfgBSL 2zYfgCOLhaiqGacqWFmbatcaaMe8Uaeyypa0JaaGjbVpaacmqabaGa aGymaiaaiYcacaaMe8UaaGOmaiaaiYcacaaMe8UaeSOjGSKaaGilai aaysW7caWGmbaacaGL7bGaayzFaaGaaiOlaaaa@4CE3@ In multi-label classification, each instance with a feature vector x R d MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWH4bGaaGjbVlabgIGiolaaysW7tC vAUfKttLearyqqSDwzYLwyUbaceaGae8Nuai1aaWbaaSqabeaacaWG Kbaaaaaa@3EAC@ is associated with a subset of these labels. Equivalently, the subset can be described as Y = ( y 1 , y 2 , , y L ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWHzbGaaGjbVlaai2dacaaMe8+aae WabeaacaWG5bWaaSbaaSqaaiaaigdaaeqaaOGaaGilaiaaysW7caWG 5bWaaSbaaSqaaiaaikdaaeqaaOGaaGilaiaaysW7cqWIMaYscaGGSa GaaGjbVlaadMhadaWgaaWcbaGaamitaaqabaaakiaawIcacaGLPaaa caGGSaaaaa@45E2@ where y i = 1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWG5bWaaSbaaSqaaiaadMgaaeqaaO GaaGjbVlaai2dacaaMe8UaaGymaaaa@37DA@ if label i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGPbaaaa@320A@ is associated with the instance, and y i = 0 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWG5bWaaSbaaSqaaiaadMgaaeqaaO GaaGjbVlaai2dacaaMe8UaaGimaaaa@37D9@ otherwise. A multi-label classifier h MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWHObaaaa@320D@ learns from training data to predict h ( x ) = Y ^ = ( y ^ 1 , y ^ 2 , , y ^ L ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWHObWaaeWabeaacaaMi8UaaCiEai aayIW7aiaawIcacaGLPaaacaaMe8UaaGypaiaaysW7ceWHzbGbaKaa caaMe8UaaGypaiaaysW7daqadeqaaiqadMhagaqcamaaBaaaleaaca aIXaaabeaakiaaiYcacaaMe8UabmyEayaajaWaaSbaaSqaaiaaikda aeqaaOGaaGilaiaaysW7cqWIMaYscaGGSaGaaGjbVlqadMhagaqcam aaBaaaleaacaWGmbaabeaaaOGaayjkaiaawMcaaaaa@4FF1@ for a given x . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWH4bGaaiOlaaaa@32CF@

Next, we review some common multi-label algorithms and their relationship to an evaluation criterion, subset accuracy.

3.1  Evaluating multi-label algorithms in semi-automated classification

Evaluating the classification of a text answer into a single label is straightforward: the label is either correct or not and accuracy refers to the percentage of correctly classified answers; equivalently, error refers to the percentage of misclassified answers. For answers that are classified into multiple labels, there are several ways to combine the accuracy of each single label to an overall evaluation measure for the set of multiple labels. These evaluation measures include subset accuracy, Hamming loss, F-measure and log loss. For a predicted set of multiple labels, subset accuracy is 1 if all of the L MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGmbaaaa@31ED@ labels are correctly predicted and 0 otherwise. Hamming loss evaluates the fraction of misclassified labels. F-measure is the harmonic mean of precision and recall and log loss evaluates the uncertainty of the prediction averaged over the labels when a probability score for each label is given.

In this paper we develop a methodology for subset accuracy (equivalently, in terms of loss, 0/1 loss). This is a strict metric because a zero score is given even if all labels are correctly classified except one. However, subset accuracy is appropriate for semi-automated classification because if an algorithm has difficulty classifying even a single label, the entire observation needs to be manually classified. That is, automated classification shall be conducted only if the model is highly confident in the entire predicted label set.

Because subset accuracy requires that all labels are simultaneously correctly classified, we are interested in finding the label set Y * MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGzbWaaWbaaSqabeaacaGGQaaaaa aa@32D5@ that maximizes the joint probability conditional on a text answer x : MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWH4bGaaGjcVlaacQdaaaa@346C@

Y * = argmax Y P ( Y | x ) = argmax Y P ( y 1 , , y L | x ) . MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGzbWaaWbaaSqabeaacaGGQaaaaO GaaGjbVlaai2dacaaMe8UaaeyyaiaabkhacaqGNbGaaeyBaiaabgga caqG4bWaaSbaaSqaaiaahMfaaeqaaOGaamiuamaabmqabaWaaqGabe aacaWHzbGaaGPaVdGaayjcSdGaaGPaVlaahIhaaiaawIcacaGLPaaa caaMe8UaaGypaiaaysW7caqGHbGaaeOCaiaabEgacaqGTbGaaeyyai aabIhadaWgaaWcbaGaaCywaaqabaGccaWGqbWaaeWabeaadaabceqa aiaadMhadaWgaaWcbaGaaGymaaqabaGccaaISaGaaGjbVlablAcilj aacYcacaaMe8UaamyEamaaBaaaleaacaWGmbaabeaakiaaykW7aiaa wIa7aiaaykW7caWH4baacaGLOaGaayzkaaGaaGOlaaaa@633E@

In the next section we discuss common approaches to estimating the joint probability proposed in the machine learning community.

3.2  Multi-label approaches that optimize subset accuracy

Various approaches have been proposed for predicting multi-label outcomes. Since we use subset accuracy as the evaluation measure, we focus on methods that aim to maximize the joint conditional distribution.

The simplest approach, called Binary Relevance (BR), transforms a multi-label problem into separate binary problems. That is, BR constructs a binary classification model for each label independently. For an unseen observation, the prediction set of labels is obtained simply by combining the individual binary results. In other words, the predicted label set is the union of the results predicted from the L MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGmbaaaa@31ED@ binary models. If each of the binary models produces probability outcomes, BR can produce an estimate for P ( y 1 | x ) P ( y 2 | x ) P ( y L | x ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGqbWaaeWabeaadaabceqaaiaadM hadaWgaaWcbaGaaGymaaqabaGccaaMc8oacaGLiWoacaaMc8UaaCiE aaGaayjkaiaawMcaaiaadcfadaqadeqaamaaeiqabaGaamyEamaaBa aaleaacaaIYaaabeaakiaaykW7aiaawIa7aiaaykW7caWH4baacaGL OaGaayzkaaGaeSOjGSKaamiuamaabmqabaWaaqGabeaacaWG5bWaaS baaSqaaiaadYeaaeqaaOGaaGPaVdGaayjcSdGaaGPaVlaahIhaaiaa wIcacaGLPaaacaGGUaaaaa@50FB@ Note that this coincides with the joint probability P ( y 1 , , y L | x ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGqbWaaeWabeaadaabceqaaiaadM hadaWgaaWcbaGaaGymaaqabaGccaaISaGaaGjbVlablAciljaacYca caaMe8UaamyEamaaBaaaleaacaWGmbaabeaakiaaykW7aiaawIa7ai aaykW7caWH4baacaGLOaGaayzkaaaaaa@42BF@ if the labels are independent (conditional on x ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWH4bGaaiykaiaac6caaaa@337C@ This implies that the product of the probabilities obtained by BR will estimate P ( y 1 , , y L | x ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGqbWaaeWabeaadaabceqaaiaadM hadaWgaaWcbaGaaGymaaqabaGccaaISaGaaGjbVlablAciljaacYca caaMe8UaamyEamaaBaaaleaacaWGmbaabeaakiaaykW7aiaawIa7ai aaykW7caWH4baacaGLOaGaayzkaaaaaa@42BF@ accurately only if the labels are conditionally independent. The joint probability may be inaccurate if the labels are substantially correlated given x . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWH4bGaaiOlaaaa@32CF@

Another approach tailored for subset accuracy is Label Powerset learning (LP). This approach transforms a multi-label classification into a multi-class (i.e., multinomial) problem by treating each unique label set Y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWHzbaaaa@31FE@ that exists in the training data as a single class. For example, when L = 3 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGmbGaaGjbVlabg2da9iaaysW7ca aIZaaaaa@36CA@ there could be up to 2 3 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaaIYaWaaWbaaSqabeaacaaIZaaaaa aa@32C2@ classes c i , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGJbWaaSbaaSqaaiaadMgaaeqaaO Gaaiilaaaa@33D8@ ( i = 1, , 8 ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaadaqadeqaaiaadMgacaaMe8UaaGypai aaysW7caaIXaGaaGilaiaaysW7cqWIMaYscaGGSaGaaGjbVlaaiIda aiaawIcacaGLPaaaaaa@3E94@ observed in the training data. Then any algorithm for multi-class problems can be applied using the transformed c i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGJbWaaSbaaSqaaiaadMgaaeqaaa aa@331E@ classes. Training a multi-class classifier takes into consideration dependencies between labels. For a new observation, LP predicts the most probable class (i.e., the most probable label set). If an algorithm for multi-class data gives probabilistic outputs (some algorithms classify without computing probabilities), LP directly estimates the class probabilities (i.e., the joint probability P ( Y | x ) ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGqbGaaiikamaaeiqabaGaaCywai aaykW7aiaawIa7aiaaykW7caWH4bGaaiykaiaacMcacaGGUaaaaa@3B39@ However, this approach cannot estimate the joint probability for any label set unseen in the training data. As a consequence, if the true label set of the new observation is an unseen observation the prediction cannot be correct. Another drawback of LP is that the number of classes in the transformed problem can increase exponentially (up to 2 L MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaaIYaWaaWbaaSqabeaacaWGmbaaaa aa@32D6@ number of classes). This can be problematic when L is large since each combination of labels may be present in just one or a few observations in the training data which makes the learning process difficult.

A third approach to multi-label learning is Classifier Chains (CC) (Read, Pfahringer, Holmes and Frank, 2009, 2011). As in binary relevance, in CC also a binary model is fit for each label. However, CC fits the binary models sequentially and uses the binary label results obtained from previous models as additional predictors in subsequent models. That is, the model for the i th MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGPbWaaWbaaSqabeaacaqG0bGaae iAaaaaaaa@3419@ label y i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWG5bWaaSbaaSqaaiaadMgaaeqaaa aa@3334@ uses x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWH4baaaa@321D@ and y 1 , , y i 1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWG5bWaaSbaaSqaaiaaigdaaeqaaO GaaGilaiaaysW7cqWIMaYscaGGSaGaaGjbVlaadMhadaWgaaWcbaGa amyAaiabgkHiTiaaigdaaeqaaaaa@3C6D@ as features. (For example, the model for y 1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWG5bWaaSbaaSqaaiaaigdaaeqaaa aa@3301@ uses x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWH4baaaa@321D@ as features, the model for y 2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWG5bWaaSbaaSqaaiaaikdaaeqaaa aa@3302@ uses x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWH4baaaa@321D@ and y 1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWG5bWaaSbaaSqaaiaaigdaaeqaaa aa@3301@ as features and so on.) Passing label information between binary classifiers allows CC to take label dependencies into account. In the prediction stage, CC successively predicts the labels one at a time. The prediction results of the previous labels are used for predicting the next label in the chain.

This idea is extended to Probabilistic Classifier Chains (PCC) (Dembczyński et al., 2010). PCC explains CC using a probabilistic model. Specifically, the conditional joint distribution can be described as

P ( y 1 ,..., y L | x ) = P ( y 1 | x ) j = 2 L P ( y j | y 1 , , y j 1 , x ) ( 3.1 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGqbWaaeWabeaadaabceqaaiaadM hadaWgaaWcbaGaaGymaaqabaGccaaISaGaaGOlaiaai6cacaaIUaGa aGilaiaadMhadaWgaaWcbaGaamitaaqabaGccaaMc8oacaGLiWoaca aMc8UaaCiEaaGaayjkaiaawMcaaiaaysW7caaI9aGaaGjbVlaadcfa daqadeqaamaaeiqabaGaamyEamaaBaaaleaacaaIXaaabeaakiaayk W7aiaawIa7aiaaykW7caWH4baacaGLOaGaayzkaaWaaebCaeqaleaa caWGQbGaaGypaiaaikdaaeaacaWGmbaaniabg+GivdGccaaMc8Uaam iuamaabmqabaWaaqGabeaacaWG5bWaaSbaaSqaaiaadQgaaeqaaOGa aGPaVdGaayjcSdGaaGPaVlaadMhadaWgaaWcbaGaaGymaaqabaGcca aISaGaaGjbVlablAciljaacYcacaaMe8UaamyEamaaBaaaleaacaWG QbGaeyOeI0IaaGymaaqabaGccaaISaGaaGjbVlaahIhaaiaawIcaca GLPaaacaaMf8UaaGzbVlaaywW7caaMf8UaaGzbVlaacIcacaaIZaGa aiOlaiaaigdacaGGPaaaaa@78AF@

and PCC estimates the probabilities P ( y 1 | x ) , P ( y 2 | x , y 1 ) , , P ( y L | x , y 1 , y 2 , , y L 1 ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGqbWaaeWabeaadaabceqaaiaadM hadaWgaaWcbaGaaGymaaqabaGccaaMc8oacaGLiWoacaaMc8UaaCiE aaGaayjkaiaawMcaaiaacYcacaaMe8UaamiuamaabmqabaWaaqGabe aacaWG5bWaaSbaaSqaaiaaikdaaeqaaOGaaGPaVdGaayjcSdGaaGPa VlaahIhacaaISaGaaGjbVlaadMhadaWgaaWcbaGaaGymaaqabaaaki aawIcacaGLPaaacaGGSaGaaGjbVlablAciljaacYcacaaMe8Uaamiu amaabmqabaWaaqGabeaacaWG5bWaaSbaaSqaaiaadYeaaeqaaOGaaG PaVdGaayjcSdGaaGPaVlaahIhacaaISaGaaGjbVlaadMhadaWgaaWc baGaaGymaaqabaGccaaISaGaaGjbVlaadMhadaWgaaWcbaGaaGOmaa qabaGccaaISaGaaGjbVlablAciljaacYcacaaMe8UaamyEamaaBaaa leaacaWGmbGaeyOeI0IaaGymaaqabaaakiaawIcacaGLPaaacaGGUa aaaa@6D98@

PCC finds the label set that maximizes the right hand side of equation (3.1). However, there is no closed-form solution for finding the label set. A few different solutions have been suggested. Dembczyński et al. (2010) used an exhaustive search (ES) that considers all possible combinations. However, an exhaustive search may not be practical when L MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGmbaaaa@31ED@ is large, because the number of possible combinations ( 2 L ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaadaqadeqaaiaayIW7caaIYaWaaWbaaS qabeaacaWGmbaaaOGaaGjcVdGaayjkaiaawMcaaaaa@378C@ increases exponentially. To overcome this problem, optimization strategies based on the uniform cost search (UCS) (Dembczyński, Waegeman and Hüllermeier, 2012) and the A * MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGbbWaaWbaaSqabeaacaGGQaaaaa aa@32BD@ algorithm (Mena, Montañés, Quevedo and Del Coz, 2015) have been proposed. First, the estimated joint conditional probability may be represented by a probability binary tree. Then a search algorithm finds the optimal path (in our case, the path that gives the highest joint probability) from the root and the terminal node. Compared with ES, UCS substantially reduces the computational cost for PCC to reach the label set with the highest joint probability (Dembczyński et al., 2012).

In theory, when applying the product rule, the order of the categories y 1 , , y L MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWG5bWaaSbaaSqaaiaaigdaaeqaaO GaaGilaiaaysW7cqWIMaYscaGGSaGaaGjbVlaadMhadaWgaaWcbaGa amitaaqabaaaaa@3AA8@ does not matter. For example, both P ( y 1 | x ) P ( y 2 | y 1 , x ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGqbWaaeWabeaadaabceqaaiaadM hadaWgaaWcbaGaaGymaaqabaGccaaMc8oacaGLiWoacaaMc8UaaCiE aaGaayjkaiaawMcaaiaadcfadaqadeqaamaaeiqabaGaamyEamaaBa aaleaacaaIYaaabeaakiaaykW7aiaawIa7aiaaykW7caWG5bWaaSba aSqaaiaaigdaaeqaaOGaaGilaiaaysW7caWH4baacaGLOaGaayzkaa aaaa@4947@ and P ( y 2 | x ) P ( y 1 | y 2 , x ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGqbWaaeWabeaadaabceqaaiaadM hadaWgaaWcbaGaaGOmaaqabaGccaaMc8oacaGLiWoacaaMc8UaaCiE aaGaayjkaiaawMcaaiaadcfadaqadeqaamaaeiqabaGaamyEamaaBa aaleaacaaIXaaabeaakiaaykW7aiaawIa7aiaaykW7caWG5bWaaSba aSqaaiaaikdaaeqaaOGaaGilaiaaysW7caWG4baacaGLOaGaayzkaa aaaa@4944@ equal to P ( y 1 , y 2 | x ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGqbWaaeWabeaadaabceqaaiaadM hadaWgaaWcbaGaaGymaaqabaGccaaISaGaaGjbVlaadMhadaWgaaWc baGaaGOmaaqabaGccaaMc8oacaGLiWoacaaMc8UaaCiEaaGaayjkai aawMcaaiaac6caaaa@3FFD@ In practice, the two chains may lead to different estimates. This means the performance of PCC may be affected by the order of the labels in the chain.

To alleviate the influence of the category order, an ensembling approach (EPCC) (Dembczyński et al., 2010) that combines multiple probabilistic chains has been proposed. First m MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGTbaaaa@320E@ PCC models are trained where each PCC model is based on a randomized order of the labels. In the prediction stage, the average conditional joint probability over the m MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGTbaaaa@320E@ PCC models is computed for each possible label set. Then the predicted label set is the label set with the highest average predicted probability. Let P ^ j ( Y | x ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaaceWGqbGbaKaadaWgaaWcbaGaamOAaa qabaGcdaqadeqaamaaeiqabaGaaCywaiaaykW7aiaawIa7aiaaykW7 caWH4baacaGLOaGaayzkaaaaaa@3B40@ be the conditional joint probability estimated by the j th MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGQbWaaWbaaSqabeaacaqG0bGaae iAaaaaaaa@341A@ PCC model. The ensemble strategy predicts the label set Y ^ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaaceWHzbGbaKaaaaa@320E@ such that

Y ^ = argmax Y j = 1 m P ^ j ( Y | x ) m . MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaaceWHzbGbaKaacaaMe8UaaGPaVlaai2 dacaaMe8UaaGPaVlaabggacaqGYbGaae4zaiaab2gacaqGHbGaaeiE amaaBaaaleaacaWHzbaabeaakmaalaaabaWaaabmaeaaceWGqbGbaK aadaWgaaWcbaGaamOAaaqabaGcdaqadeqaamaaeiqabaGaaCywaiaa ykW7aiaawIa7aiaaykW7caWH4baacaGLOaGaayzkaaaaleaacaWGQb GaaGypaiaaigdaaeaacaWGTbaaniabggHiLdaakeaacaWGTbaaaiaa i6caaaa@50FB@

Note that EPCC does not combine the predicted label sets but conditional joint probabilities. To find the highest average probability from m MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGTbaaaa@320E@ PCC models, all individual probabilities are required and this forces us to use ES to compute the conditional joint probability for all 2 L MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaaIYaWaaWbaaSqabeaacaWGmbaaaa aa@32D6@ label combinations from all m MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGTbaaaa@320E@ PCC models. Hence, although EPCC reduces the problem of influence of label order, the method will not be useful if the problem deals with a large number of labels or when m MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGTbaaaa@320E@ is large. To reduce the computational cost for combining multiple PCC models, we propose a new approach to ensembling the PCC models in the next section.


Date modified: