Semi-automated classification for multi-label open-ended questions
Section 4. The majority-voted-based ensemble of PCC for
semi-automated classification

Table of contents

The proposed method aims to ensemble multiple PCC models at much less computational cost. As mentioned in Section 3.2, the best label set (with the highest joint probability) for a single PCC can be found by a fast search strategy. In this paper, we use UCS, since the implementation is simple and the algorithm always finds the optimal solution. Using UCS, the proposed method obtains ${\hat{Y}}_{j}$ $(j = 1, \dots, m),$ the label set predicted by the $j^{th}$ PCC model and ${\hat{P}}_{j},$ the estimated probability that ${\hat{Y}}_{j}$ is the true label set. Among the $m$ predicted label sets, the proposed method chooses the most frequent label set for the final prediction. That is, $\hat{Y} = mode ({{\hat{Y}}_{1}, \dots, {\hat{Y}}_{m}}) .$ In case there are ties in the mode, we choose the label set whose averaged probability estimate is the highest.

Semi-automatic classification requires a score that measures how easy/hard the prediction is. Whether a text answer is classified automatically or manually is determined based on this score. Next, a score is proposed: Let $J$ be the set that contains all indices $j$ $(1 \leq j \leq m)$ for which ${\hat{Y}}_{j}$ is the most frequent one $($ i.e., $J = {j : {\hat{Y}}_{j} = \hat{Y}})$ . The proposed score for the prediction is

$\begin{array}{l} θ & = (\frac{\sum_{i \in J} {\hat{P}}_{j}}{| J |}) (\frac{| J |}{m}) (4.1) \\ = \frac{\sum_{i \in J} {\hat{P}}_{j}}{m} . (4.2) \end{array}$

The first factor of equation (4.1) is the average joint probability of the predicted label set. The second factor of equation (4.1) is the fraction of the PCC models that predict the predicted label set among the $m$ models. Multiplying the two components makes sense: a prediction may be more accurate if the (average) probability related to the chosen label set is high (the first factor) and more individual chain models vote for the same label set (the second component). We call this approach Majority-vote-based Ensemble of Probabilistic Classifier Chains (MEPCC). We later show empirically that combining the two factors indeed improves performance over just using a single factor. Table 4.1 illustrates an example for 5 labels $(L = 5)$ and 7 PCC models $(m = 7) .$ The MEPCC approach stores the probability of one label set from each PCC model. Because MEPCC combines over the probabilities corresponding to the best label set from different PCC models, MEPCC can take advantage of the UCS (or any other) strategy. Note that a search strategy like UCS cannot be used for EPCC where all individual probabilities for all label combinations are required. More succinctly, MEPCC combines over the maximal probabilities of each PCC, whereas EPCC maximizes over the average probabilities, requiring evaluation of all individual probabilities. We summarize the procedure of MEPCC in Algorithm 1.

Table 4.1
An example of the MEPCC classification of a single observation with $L =5$ and $m =7$
Table summary
This table displays the results of An example of the MEPCC classification of a single observation with $L =5$ and $m =7$ . The information is grouped by PCC model (appearing as row headers), Prediction and $y_{1}$ , $y_{2}$ , $y_{3}$ , $y_{4}$ , $y_{5}$ and $P (y_{1}, \dots, y_{5} | x)$ (appearing as column headers).
PCC model	Prediction	$y_{1}$	$y_{2}$	$y_{3}$	$y_{4}$	$y_{5}$	$P (y_{1}, \dots, y_{5} \| x)$
1	${\hat{Y}}_{1}$	1	1	0	0	1	0.875
2	${\hat{Y}}_{2}$	1	1	0	0	1	0.921
3	${\hat{Y}}_{3}$	0	0	1	1	0	0.743
4	${\hat{Y}}_{4}$	0	0	0	1	0	0.882
5	${\hat{Y}}_{5}$	0	0	0	1	0	0.643
6	${\hat{Y}}_{6}$	0	1	0	1	0	0.739
7	${\hat{Y}}_{7}$	1	1	0	0	1	0.824
final prediction	$\hat{Y}$	1	1	0	0	1	$θ = \frac{0.875 + 0.921 + 0.824}{7} =0.374$

Table 4.2
Table summary
This table displays the results of Table 4.2. The information is grouped by Algorithme 1 Algorithm 1 The MEPCC algorithm.
Algorithm 1. The MEPCC algorithm
Input: Number of models $m,$ an instance vector $x,$ corresponding PCC models $h_{j},$ the uniform cost search algorithm $U$
for $j =1$ to $m$ do
(a) Using $h_{j}$ and $U,$ obtain ${\hat{Y}}_{j} = {argmax}_{Y} P (Y \| x)$
(b) Store ${\hat{P}}_{j} = P ({\hat{Y}}_{j} \| x)$
end for
Obtain the label set $\hat{Y} = mode ({{\hat{Y}}_{1}, \dots, {\hat{Y}}_{m}})$
Obtain $J = {j : {\hat{Y}}_{j} = \hat{Y}}$
Obtain the score $θ = \frac{\sum_{i \in J} {\hat{P}}_{j}}{m}$
Return $\hat{Y}$ and $θ$

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2020-12-15

Language selection

Search and menus

Search

Semi-automated classification for multi-label open-ended questions
Section 4. The majority-voted-based ensemble of PCC for
semi-automated classification

Semi-automated classification for multi-label open-ended questions Section 4. The majority-voted-based ensemble of PCC for semi-automated classification

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Semi-automated classification for multi-label open-ended questions
Section 4. The majority-voted-based ensemble of PCC for
semi-automated classification