Semi-automated classification for multi-label open-ended questions
Section 5. Experiments

5.1 Data

We evaluated the performance of the MEPCC algorithm on three different data sets: Civil disobedience, Immigrant and Happy data (the Happy data are available upon request by contacting Marika Wenemark marika.wenemark@liu.se. The Immigrant and Civil Disobedience data are available from the GESIS Datorium http://dx.doi.org/10.7802/1795). For each data set, an open-ended question was asked to the respondents and their answers have been coded manually with possibly multiple labels.

The Civil data set was collected to study cross-cultural equivalence about Civil disobedience. Behr, Braun, Kaczmirek and Bandilla (2014) first asked respondents a closed-ended question from the ISSP (ISSP Research Group, 2012) How important is it that citizens may engage in acts of Civil disobedience when they oppose government actions? (Not at all important 1 − Very important 7). The respondents were then asked: What ideas do you associate with the phrase “Civil disobedience”? Please give examples. Answers were classified into 12 labels: non-productive, violence, disturbances, peaceful, listing activities, breadth of actions, breaking law, breaking rules, government:dissatisfaction, government:deep rift, copy/paste from the Internet, other. The survey data were collected in different languages and we use a merged data set (Spanish, German and Danish) that contains 1,029 observations.

The Immigrant data set was collected to study cross-national equivalence of measures of xenophobia. In the 2003 International Social Survey Program (ISSP) on National Identity, the questionnaire contained four statements regarding beliefs on Immigrants such as Immigrants take jobs from people who were born in Germany. After rating each statement, respondents were asked to answer to an open-ended question: Which type of Immigrants were you thinking of when you answered the question? The previous statement was: [text of the corresponding item]. Braun, Behr and Kaczmirek (2013) classified answers into 14 labels: non-productive, positive, negative, neutral/work, general, Muslim countries, eastern European, Asia, ex-Yugoslavia, EU15, sub Sahara, Sinti/Roma, legal/illegal, other. In this article, we use 1,006 observations from the German survey.

The Happy data set was collected to study the relationship between positive factors and mental health and care needs. Wenemark, Borgstedt-Risberg, Garvin, Dahlin, Jusufbegovic, Gamme, Johansson and Bjrn (2018) asked respondents “Name some positive things in your life, that are uplifting or make you Happy: (you may write several things)”. Answers were classified into 13 labels: nothing, relationships (family or romantic), working/studying, health, self-esteem, joy/happiness, well-being: drinking/eating/drugs/sex, spirituality, money, nature, hobbies, culture, and exercise. The data set contains 2,350 observations.

Table 5.1 contains summary statistics about the three data sets.

Table 5.1
Summary statistics of data sets: number of total observations, features and labels and average number of relevant labels, and percentage of observations that are associated with more than one label $(P_{| L | > 1})$
Table summary
This table displays the results of Summary statistics of data sets: number of total observations. The information is grouped by Data (appearing as row headers), > $#$ observations, > $#$ features, L, av. > $#$ of labels and $P_{| L | > 1}$ (appearing as column headers).
Data	# observations	# features	L	av. # of labels	$P_{\| L \| > 1}$
Civil	1,029	305	12	1.15	13.80%
Immigrant	1,006	273	14	1.19	13.72%
Happy	2,350	492	13	2.77	87.40%

5.2 Experimental setup

We compared the proposed MEPCC method against BR and LP and PCC. For PCC, we used the uniform search to reach a predicted label set and the estimated probability of equation (3.1) for the confidence score of the prediction. EPCC was not included in the comparison because its computational cost makes it infeasible for prediction for our data sets. (In our experiment on the Immigrant data with 14 labels, running the exhaustive search for PCC $(m = 1)$ for a single prediction took a single computer (Intel Core i7 CPU with 8GB RAM) over 30 minutes. This implies that predicting 200 observations using EPCC $(m = 10)$ would take more than 1,000 hours.) Support vector machines (SVM) (Vapnik, 2000) were used as the base classifier on unscaled variables with a linear kernel and tuning parameter $C = 1.$ For probabilistic output, the SVM scores were converted into probabilities using Platt’s method (Platt, 2000). The analysis was conducted in $R$ (R Core Team, 2014) using the $e 1071$ package (Meyer, Dimitriadou, Hornik, Weingessel and Leisch, 2014) for SVM.

For each data set, 5-fold cross validation (CV) was performed. That is, we randomly divided the data into five equal-sized parts and used the first four parts as the training data and the last part as the test data. Performance evaluation is only made on the test data. Each of the five parts were used as test data and the results were averaged.

5.3 Performance of the MEPCC approach

We first investigated the performance of the MEPCC. The score in equation (4.1) has two components. To demonstrate that both components are helpful, we evaluate the proposed score as well as two different scores where one of the components is missing. That is, we compared the MEPCC with three different scores $θ,$ $θ_{1}$ and $θ_{2} :$

$\begin{array}{l} (MEPCC) θ = (\frac{\sum_{i \in J} P_{j}}{| J |}) (\frac{| J |}{m}) \\ (MEPCC - 1) θ_{1} = (\frac{\sum_{i \in J} P_{j}}{| J |}) \\ (MEPCC - 2) θ_{2} = (\frac{| J |}{m}) . \end{array}$

Prioritizing the text answers based on $θ_{2}$ results in many ties. The tied answers were randomly reordered to be able to calculate subset accuracy at each production rate. Figure 5.1 shows the subset accuracy of each approach as a function of the production rate. The text answers with higher scores were classified first. For example, production rate 0.2 means only 20% of the test data with the highest scores were classified automatically by the models. When the production rate equals 1, there was no difference between the MEPCC models because the predicted label sets are always the same. The difference is how they prioritize the text answers from the easiest-to-classify to the hardest-to-classify answers. When the production rate was less than 1, MEPCC outperformed MEPCC-1 and MEPCC-2 for all three data. The results show that both components in equation (4.1) were helpful for prioritizing the observations.

Figure 5.1 Subset accuracy of three variations on MEPCC as a function of production rate

Description for Figure 5.1

Figure presenting three graphs, one for each of the following data sets: Civil, Immigrant and Happy. The subset accuracy is on the y-axis, ranging from 0.6 to 1.0. The percentage of automated categorization is on the x-axis, ranging from 0.2 to 1.0. Each graph compares three approaches: MEPCC, MEPCC -1 and MEPCC -2. When the production rate is 1, there is no difference between the MEPCC models. When the production rate is lower than 1, MEPCC outperforms MEPCC-1 and MEPCC-2 for the three data sets.

5.4 Effect of the number of PCC models

We then investigated to what extent the number of PCC models affects the predictive performance of MEPCC. Figure 5.2 shows the performance of MEPCC for different number of PCC models $(m) .$ When $m$ was low, increasing $m$ led to huge improvement of the subset accuracy of MEPCC. However, once there were enough PCC models (e.g., $m = 10),$ adding more PCC models did not improve the subset accuracy. The empirical results show that MEPCC does not require many PCC models for performing well.

Figure 5.2 The effect of the number of PCC models (m) used for MEPCC

Description for Figure 5.2

Figure presenting three graphs, one for each of the following data sets: Civil, Immigrant and Happy. They show the performance of MEPCC for different number of PCC models. The subset accuracy is on the y-axis, ranging from 0.6 to 1.0. The percentage of automated categorization is on the x-axis, ranging from 0.2 to 1.0. Each graph compares five numbers of PCC models: 1, 5, 10, 20 and 30. When the number of PCC models is low, increasing it leads to huge improvement of the subset accuracy of MEPCC. However, once there are enough PCC model (e.g., $m = 10),$ adding more PCC models doesn’t improve the subset accuracy. The empirical results show that MEPCC doesn’t require many PCC models for performing well.

5.5 Comparison with other methods

At last we investigated the performance of MEPCC $(m = 10)$ compared to the established methods (BR, LP and PCC). For all methods, a production rate of x% refers to the x% of the data that have the highest score. MEPCC used $θ$ as a score, while each of the other approaches used the probability of the predicted label set estimated by that method. Note when $m = 1,$ MEPCC and PCC are identical; the score $θ$ coincides with the probability of the label set predicted by PCC.

Figures 5.3 and 5.4 illustrate the respective subset accuracy and Hamming loss for the different methods as a function of the production rate on the Happy, Immigrant and Civil data. For the Immigrant and Happy data, the highest subset accuracy at most production rates was obtained by MEPCC. For the Civil data, MEPCC and LP performed the best. In terms of Hamming loss, MEPCC achieved the lowest error at most production rates for all data.

Figure 5.3 Semi-automated result (subset accuracy) for the three data from the 5-fold cross validation

Description for Figure 5.3

Figure presenting three graphs, one for each of the following data sets: Civil, Immigrant and Happy. The subset accuracy is on the y-axis, ranging from 0.5 to 1.0 for the first two data sets and from 0.4 to 1.0 for the last one. The percentage of automated categorization is on the x-axis, ranging from 0.2 to 1.0. Each graph compares four methods: BR, LP, PCC and MEPCC. For the Immigrant and Happy data sets, the highest subset accuracy at most production rate was obtained by MEPCC. For the Civil data, MEPCC and LP performed the best.

Figure 5.4 Semi-automated result (Hamming loss) for the three data from the 5-fold cross validation

Description for Figure 5.4

Figure presenting three graphs, one for each of the following data sets: Civil, Immigrant and Happy. The Hamming loss is on the y-axis, ranging from 0.00 to 0.08 for the first two data sets and from 0.00 to 0.10 for the last one. The percentage of automated categorization is on the x-axis, ranging from 0.2 to 1.0. Each graph compares four methods: BR, LP, PCC and MEPCC. MEPCC achieves the lowest error at most production rates for the three data sets.

Next, we consider the performance of each method given target predicted accuracy values. To decide the fraction of automatic categorization, a practitioner will typically set a threshold probability above which texts are coded automatically. For MEPCC, the relationship between true accuracy and the confidence score $(θ)$ were estimated via cross-validation on the training data. We used Platt’s scaling to convert the confidence scores into probability outputs. Since Platt’s scaling could improve the level of calibration (Niculescu-Mizil and Caruana, 2005), the same technique was also applied to BR, LP and PCC.

Table 5.2 illustrates the tradeoff between the percentages of automated prediction and the corresponding subset accuracy of each method as a function of different thresholds. The threshold refers to the minimum predicted subset accuracy required for automated prediction. The minimum predicted subset accuracy helps us decide which text answers should be classified automatically and which should be classified manually. For example, if the client decides that at least 80% accuracy is required for automated classification, then approximately 39.3% of the Civil data, 42.5% of the Immigrant data, and 27.6% of the Happy data can be classified automatically by MEPCC with subset accuracy 0.891, 0.916 and 0.857, respectively. Note that this is a huge improvement compared to applying BR that could only automatically classify 9.3% of the Civil data, 12.8% of the Immigrant data, and 8.7% of the Happy data with lower subset accuracies. Table 5.3 shows the relationship between predicted and actual accuracy by aggregating to ranges of predictions for each method and data set. For MEPCC the actual accuracy is within the range of the predicted accuracy in most cases, much better than for the other methods.

Table 5.2
Semi-automated result for the three data at different decision thresholds. P represents the percentage of automated predictions and SA represents the subset accuracy for the automated prediction results
Table summary
This table displays the results of Semi-automated result for the three data at different decision thresholds. P represents the percentage of automated predictions and SA represents the subset accuracy for the automated prediction results. The information is grouped by Data (appearing as row headers), Threshold, BR, LP, PCC and MEPCC (appearing as column headers).
Data	Threshold	BR		LP		PCC		MEPCC
Data	Threshold	P	SA	P	SA	P	SA	P	SA
Civil	0.9	0.7%	0.667	16.5%	0.967	0.0%	NA	13.0%	0.978
	0.8	9.3%	0.893	34.3%	0.898	15.1%	0.787	39.3%	0.891
	0.7	18.4%	0.846	46.6%	0.852	36.4%	0.817	45.8%	0.860
	0.6	25.4%	0.768	50.6%	0.831	52.1%	0.771	52.9%	0.820
Immigrant	0.9	3.7%	0.858	11.1%	0.959	1.3%	0.558	31.5%	0.947
	0.8	12.8%	0.779	30.4%	0.890	27.7%	0.859	42.5%	0.916
	0.7	26.6%	0.743	38.6%	0.863	42.4%	0.829	55.1%	0.862
	0.6	41.7%	0.715	53.6%	0.806	50.5%	0.795	62.7%	0.839
Happy	0.9	1.3%	0.592	8.9%	0.850	0.1%	0.750	1.0%	0.830
	0.8	8.7%	0.734	14.3%	0.802	7.2%	0.726	27.6%	0.857
	0.7	32.8%	0.776	17.7%	0.793	29.9%	0.767	43.7%	0.817
	0.6	53.2%	0.745	22.2%	0.761	49.2%	0.744	52.0%	0.790

Table 5.3
Semi-automated result for the three data at different ranges of thresholds. P represents the percentage of automated predictions and SA represents the subset accuracy for the automated prediction results
Table summary
This table displays the results of Semi-automated result for the three data at different ranges of thresholds. P represents the percentage of automated predictions and SA represents the subset accuracy for the automated prediction results. The information is grouped by Data (appearing as row headers), Predicted accuracy, BR, LP, PCC and MEPCC (appearing as column headers).
Data	Predicted accuracy	BR		LP		PCC		MEPCC
Data	Predicted accuracy	P	SA	P	SA	P	SA	P	SA
Civil	$[0.9, 1.0]$	0.7%	0.667	16.5%	0.967	0.0%	NA	13.0%	0.978
	$[0.8, 0.9)$	8.7%	0.896	17.8%	0.834	15.1%	0.787	26.2%	0.846
	$[0.7, 0.8)$	9.0%	0.769	12.2%	0.710	21.3%	0.828	6.5%	0.681
	$[0.6, 0.7)$	7.0%	0.566	4.1%	0.584	15.7%	0.655	7.1%	0.563
Immigrant	$[0.9, 1.0]$	3.7%	0.858	11.1%	0.959	1.3%	0.558	31.5%	0.947
	$[0.8, 0.9)$	9.1%	0.750	19.3%	0.843	26.4%	0.869	11.0%	0.829
	$[0.7, 0.8)$	13.8%	0.710	8.2%	0.747	14.7%	0.757	12.5%	0.688
	$[0.6, 0.7)$	15.1%	0.602	15.0%	0.659	8.1%	0.623	7.7%	0.670
Happy	$[0.9, 1.0]$	1.3%	0.592	8.9%	0.850	0.1%	0.750	1.0%	0.830
	$[0.8, 0.9)$	7.4%	0.755	5.4%	0.717	7.1%	0.730	26.5%	0.858
	$[0.7, 0.8)$	24.0%	0.792	3.4%	0.751	22.7%	0.779	16.2%	0.749
	$[0.6, 0.7)$	20.4%	0.693	4.6%	0.615	19.3%	0.703	8.3%	0.647

Table 5.4 shows the runtime of each method for training the model and predicting all instances in test data (Intel Core i7 CPU with 8GB RAM). Unsurprisingly, the runtime of MEPCC at $m = 10$ is roughly 10 times of that of PCC in both of the training and prediction stages.

Table 5.4
Runtime (in seconds) of each method for the three data
Table summary
This table displays the results of Runtime (in seconds) of each method for the three data. The information is grouped by Data (appearing as row headers), Stage, BR, LP, PCC and MEPCC (appearing as column headers).
Data	Stage	BR	LP	PCC	MEPCC
Civil	Train	1.688	0.641	1.128	11.787
Civil	Prediction	0.269	0.044	37.142	374.611
Immigrant	Train	1.363	0.510	0.894	8.724
Immigrant	Prediction	0.200	0.056	35.369	334.075
Happy	Train	11.160	16.164	7.371	78.293
Happy	Prediction	0.567	3.691	177.847	1,746.529

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: Semi-annual

Ottawa

Date modified:: 2020-12-15

Language selection

Search and menus

Search

Semi-automated classification for multi-label open-ended questions
Section 5. Experiments

5.1 Data

5.2 Experimental setup

5.3 Performance of the MEPCC approach

5.4 Effect of the number of PCC models

5.5 Comparison with other methods

Semi-automated classification for multi-label open-ended questions Section 5. Experiments

5.1 Data

5.2 Experimental setup

5.3 Performance of the MEPCC approach

5.4 Effect of the number of PCC models

5.5 Comparison with other methods

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

Semi-automated classification for multi-label open-ended questions
Section 5. Experiments