Semi-automated classification for multi-label open-ended questions
Section 6. Discussion

Using three examples, we have investigated several approaches for automated classification for any desired production rate when data are multi-labeled. In terms of subset accuracy and Hamming loss, the proposed method, MEPCC, achieved the best performance at most production rates in all three data sets.

There were trade-offs between the prediction performance and the production rate for all methods. At low production rates, high subset accuracy and low Hamming loss were achieved for a small number of easy-to-classify answers. However, accuracy (loss) tended to decrease (increase) as more difficult answers were included (i.e., production rate increased).

Either subset accuracy or production rate can be set at a target rate which determines the second measure. For example, targeting 80% minimum subset accuracy for an automated prediction, MEPCC categorizes 39.3% of the Civil data, 42.5% of the Immigrant data, and 27.6% of the Happy data automatically. Such a reduction is considerable. In an applied research environment, reducing the need for manual coding in a data set with 5,000 observations, a reduction by 50% may save several weeks of coding time. If production rate is fix at 80%, MEPCC could achieve a subset accuracy of 70% (Civil), 75% (Immigrant), and 68% (Happy).

The Hamming loss represents the fraction of misclassified labels. Figure 5.4 shows that the improvement of MEPCC over BR was quite noticeable at lower production rates but relatively small at 100% production rate.

MEPCC outperformed PCC at most production rates on all three data. This shows that combining multiple PCC models substantially improves the performance. As can be seen from Figure 5.2, even combining 5 models resulted in a substantial improvement throughout the whole range of production rate. The difference tended to be greater at lower production rates. This means MEPCC is even more preferred for semi-automated classification, where a high accuracy is required rather than a high production rate. The performance of MEPCC converged as m MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9y8qrpq0dc9vqFj0db9qqvqFr0dXdHiVc=b YP0xH8peeu0xXdcrpe0db9Wqpepec9ar=xfr=xfr=tmeaabaqaciGa caGaaeqabaGaaeaadaaakeaacaWGTbaaaa@320E@ increased in all three data sets. The difference between the MEPCC models were negligibly small when should be an equation was larger than 10. This is a desirable result in practice because employing too many PCC models for an ensemble model is unnecessary.

For all three data we found that the proposed method was not sensitive to the choice of the search algorithm for each PCC model (results and figures not shown). That is, the classification results of MEPCC with the uniform cost search were similar to those with the greedy search. While the proposed method uses the uniform cost search, the greedy approach may also be considered especially when the fast prediction time matters.

Figure 5.3 shows LP beats BR for the Civil and Immigrant data sets and BR beats LP for the Happy data set with respect to subset accuracy. We see two reasons: 1) LP performed well when the number of unique label sets was relatively small (Civil: 39, Immigrant: 59). However, the performance of LP was not effective but less well for the Happy data where the number of unique label sets was large (346). 2) BR does not take into account correlations among the labels. BR beat LP where bivariate label correlation were low (Happy data) and LP beat BR where bivariate label correlations were larger (Civil and Immigrant Data). Compared to BR and LP, MEPCC seems to be robust to those aspects (the number of unique label sets and the magnitude of label correlations).

The semi-automatic procedure introduced here works best in repeated survey questions where results from previous waves have been labeled or for one-off questions where the sample size is large. How large should the training data be? We have used 5-fold cross-validation to evaluate the algorithm, but cross-validation is not appropriate in a production environment. If the question was asked in a previous wave, train the algorithm on all labeled data from all previous waves. If not, set a “sufficiently large” number of texts aside for labeling and training, and use the semi-automatic procedure on the remainder of the data. How large “sufficiently large” is depends on the task at hand. For single labeling tasks we have found that often 500 training samples are sufficient (Schonlau and Couper, 2016). There is a tradeoff: a larger data set predicts more accurately but also reduces the scope for time savings as fewer unlabeled observations remain. Under reasonable assumptions, Schonlau and Couper (2016) suggested human coding time savings for a single-label semi-automatic coding procedure attempting to code 1,000 (9,500) texts might be 14 (133) hours. 133 hours is equivalent to 16.6 eight-hour working days. Whether those time savings are large enough to warrant implementation of a semi-automatic procedure may be best decided with knowledge of the specific task and in the context of the specific production environment.

If some label combinations cannot occur in individual data sets, such constraints on label combinations may be added. For example, for the Happy data, if the label “nothing” is turned on all other labels must be turned off. Knowing that “nothing” is incompatible with other labels requires some domain expertise. It would be straightforward to modify the algorithm to accommodate this constraint. Of course, all methods except BR already exploit dependencies between labels; implementing this constraint may not affect performance very much. We did not implement such constraints in this article to avoid the appearance of the algorithms heavily relying on the constraints.

Limitations of this work include that the experimental study was conducted using three text data sets only. While there is no guarantee that performance will be equally good on other data sets, data used in this paper consider different topics in different languages, which increases the appeal of MEPCC. Also, all of the multi-label algorithms in this article used the same base learner (SVM) for classification. While SVM is one of the best performing approaches, other learning methods that produce probability outcomes could be chosen.

In conclusion, we investigated semi-automated classification for open-ended questions when the data are multi-labelled using existing multi-label algorithms. We have proposed a new algorithm for semi-automatic classification that effectively combines multiple PCC models. The experimental results on three different example data show that the proposed approach outperforms BR, LP and PCC in terms of subset accuracy and Hamming loss at most production rates. Although we focused on survey data from open-ended questions, the proposed approach can also be applied to other types of multi-label data when semi-automated classification is desired. A comprehensive analysis encompassing a variety of data in the context of semi-automated classification deserves further investigation.

References

Behr, D., Braun, M., Kaczmirek, L. and Bandilla, W. (2014). Item comparability in crossnational surveys: Results from asking probing questions in cross-national web surveys about attitudes towards civil disobedience. Quality & Quantity, 48(1), 127-148.

Braun, M., Behr, D. and Kaczmirek, L. (2013). Assessing cross-national equivalence of measures of xenophobia: Evidence from probing in web surveys. International Journal of Public Opinion Research, 25(3), 383-395.

Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.

Dembczyński, K., Cheng, W. and Hüllermeier, E. (2010). Bayes optimal multilabel classification via probabilistic classifer chains. Proceedings of the 27th International Conference on Machine Learning, 279-286.

Dembczyński, K., Waegeman, W. and Hüllermeier, E. (2012). An analysis of chaining in multi-label classification. In Frontiers in Artificial Intelligence and Applications, (Eds., L. De Raedt, C. Bessiere, D. Dubois, P. Doherty, P. Frasconi, F. Heintz and P. Lucas), 242, 294-299. IOS Press.

Guenther, N., and Schonlau, M. (2016). Support vector machines. The Stata Journal, 16(4), 917-937.

Gweon, H., Schonlau, M., Kaczmirek, L., Blohm, M. and Steiner, S. (2017). Three methods for occupation coding based on statistical learning. Journal of Official Statistics, 33(1), 101-122.

ISSP Research Group (2012). International social survey programme: Citizenship - ISSP 2004. GESIS data archive, Cologne. ZA3950 data file version 1.3.0, https://doi.org/10.4232/1.11372.

Manning, C., Raghavan, P. and Schütze, H. (2008). Introduction to Information Retrieval, Chapter 2.2. Cambridge, England: Cambridge University Press.

Matthews, P., Kyriakopoulos, G. and Holcekova, M. (2018). Machine learning and verbatim survey responses: Classification of criminal offences in the crime survey for England and Wales. Paper presented at BigSurv18, Barcelona, Spain.

Mena, D., Montañés, E., Quevedo, J.R. and Del Coz, J.J. (2015). Using A* for inference in probabilistic classifer chains. Proceedings of the 24th International Conference on Artificial Intelligence, 3707-3713. AAAI Press.

Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. and Leisch, F. (2014). e1071: Misc Functions of The Department of Statistics, TU Wien. http://CRAN.R-project.org/package=e1071.

Niculescu-Mizil, A., and Caruana, R. (2005). Predicting good probabilities with supervised learning. Proceedings of the 22nd International Conference on Machine Learning, New York, NY, U.S.A., 625-632. ACM.

Platt, J. (2000). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, (Eds., A. Smola, P. Bartlett, B. Schölkopf and D. Schuurmans), 61-74. MIT Press.

R Core Team (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.

Read, J., Pfahringer, B., Holmes, G. and Frank, E. (2009). Classifier chains for multi-label classification. In Machine Learning and Knowledge Discovery in Databases, (Eds., W. Buntine, M. Grobelnik, D. Mladenić and J. Shawe-Taylor), 254-269. Springer.

Read, J., Pfahringer, B., Holmes, G. and Frank, E. (2011). Classifier chains for multi-label classification. Machine Learning, 85(3), 333-359.

Schonlau, M., and Couper, M.P. (2016). Semi-automated categorization of open-ended questions. Survey Research Methods, 10(2), 143-152.

Schonlau, M., Guenther, N. and Sucholutsky, I. (2017). Text mining using ngram variables. The Stata Journal, 17(4), 866-881.

Schonlau, M., Gweon, H. and Wenemark, M. (2019). Automatic classification of open-ended questions: Check-all-that-apply questions. Social Science Computer Review. First published online August 20, 2019 (to appear in a future issue). https://journals.sagepub.com/doi/full/10.1177/0894439319869210.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.

Vapnik, V.N. (2000). The Nature of Statistical Learning Theory. 2nd Edition. Springer.

Wenemark, M., Borgstedt-Risberg, M., Garvin, P., Dahlin, S., Jusufbegovic, J., Gamme, C., Johansson, V. and Bjrn, E. (2018). Psykisk hlsa i sydstra sjukvrdsregionen: En kartlggning av sjlvskattad psykisk hlsa i jnkping. Kalmar och stergtlands ln hsten 2015/16. Retrieved from https://vardgivarwebb.regionostergotland.se/pages/285382/Psykisk_halsa_syostra_sjukvarsregionen.pdf.

Ye, C., Medway, R. and Kelley, C. (2018). Natural language processing for open-ended survey questions. Paper presented at BigSurv18, Barcelona, Spain.


Date modified: