Semi-automated classification for multi-label open-ended questions
Section 4. The majority-voted-based ensemble of PCC for
semi-automated classification
The proposed method aims to ensemble multiple PCC models at much less computational cost. As mentioned in Section 3.2, the best label set (with the highest joint probability) for a single PCC can be found by a fast search strategy. In this paper, we use UCS, since the implementation is simple and the algorithm always finds the optimal solution. Using UCS, the proposed method obtains the label set predicted by the PCC model and the estimated probability that is the true label set. Among the predicted label sets, the proposed method chooses the most frequent label set for the final prediction. That is, In case there are ties in the mode, we choose the label set whose averaged probability estimate is the highest.
Semi-automatic classification requires a score that measures how easy/hard the prediction is. Whether a text answer is classified automatically or manually is determined based on this score. Next, a score is proposed: Let be the set that contains all indices for which is the most frequent one i.e., . The proposed score for the prediction is
The first factor of equation (4.1) is the average joint probability of the predicted label set. The second factor of equation (4.1) is the fraction of the PCC models that predict the predicted label set among the models. Multiplying the two components makes sense: a prediction may be more accurate if the (average) probability related to the chosen label set is high (the first factor) and more individual chain models vote for the same label set (the second component). We call this approach Majority-vote-based Ensemble of Probabilistic Classifier Chains (MEPCC). We later show empirically that combining the two factors indeed improves performance over just using a single factor. Table 4.1 illustrates an example for 5 labels and 7 PCC models The MEPCC approach stores the probability of one label set from each PCC model. Because MEPCC combines over the probabilities corresponding to the best label set from different PCC models, MEPCC can take advantage of the UCS (or any other) strategy. Note that a search strategy like UCS cannot be used for EPCC where all individual probabilities for all label combinations are required. More succinctly, MEPCC combines over the maximal probabilities of each PCC, whereas EPCC maximizes over the average probabilities, requiring evaluation of all individual probabilities. We summarize the procedure of MEPCC in Algorithm 1.
| PCC model | Prediction | ||||||
|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 0 | 1 | 0.875 | |
| 2 | 1 | 1 | 0 | 0 | 1 | 0.921 | |
| 3 | 0 | 0 | 1 | 1 | 0 | 0.743 | |
| 4 | 0 | 0 | 0 | 1 | 0 | 0.882 | |
| 5 | 0 | 0 | 0 | 1 | 0 | 0.643 | |
| 6 | 0 | 1 | 0 | 1 | 0 | 0.739 | |
| 7 | 1 | 1 | 0 | 0 | 1 | 0.824 | |
| final prediction | 1 | 1 | 0 | 0 | 1 |
| Algorithm 1. The MEPCC algorithm | |||
|---|---|---|---|
| Input: Number of models an instance vector corresponding PCC models the uniform cost search algorithm | |||
| for to do | |||
| (a) Using and obtain | |||
| (b) Store | |||
| end for | |||
| Obtain the label set | |||
| Obtain | |||
| Obtain the score | |||
| Return and |
- Date modified: