Semi-automated classification for multi-label open-ended questions
Section 5. Experiments
5.1 Data
We
evaluated the performance of the MEPCC algorithm on three different data sets:
Civil disobedience, Immigrant and Happy data (the Happy data are available upon
request by contacting Marika Wenemark marika.wenemark@liu.se. The Immigrant and
Civil Disobedience data are available from the GESIS Datorium http://dx.doi.org/10.7802/1795). For each data set, an open-ended question was
asked to the respondents and their answers have been coded manually with
possibly multiple labels.
The
Civil data set was collected to study cross-cultural equivalence about Civil
disobedience. Behr, Braun, Kaczmirek
and Bandilla (2014) first asked respondents a closed-ended question from
the ISSP (ISSP Research Group, 2012)
How important is it that citizens may engage in acts of Civil disobedience when
they oppose government actions? (Not at all important 1 − Very important 7).
The respondents were then asked: What ideas do you associate with the phrase “Civil
disobedience”? Please give examples. Answers were classified into 12 labels:
non-productive, violence, disturbances, peaceful, listing activities, breadth
of actions, breaking law, breaking rules, government:dissatisfaction,
government:deep rift, copy/paste from the Internet, other. The survey data were
collected in different languages and we use a merged data set (Spanish, German
and Danish) that contains 1,029 observations.
The
Immigrant data set was collected to study cross-national equivalence of
measures of xenophobia. In the 2003 International Social Survey Program (ISSP)
on National Identity, the questionnaire contained four statements regarding beliefs
on Immigrants such as Immigrants take jobs from people who were born in
Germany. After rating each statement, respondents were asked to answer to an
open-ended question: Which type of Immigrants were you thinking of when you
answered the question? The previous statement was: [text of the corresponding
item]. Braun, Behr and Kaczmirek (2013)
classified answers into 14 labels: non-productive, positive, negative,
neutral/work, general, Muslim countries, eastern European, Asia, ex-Yugoslavia,
EU15, sub Sahara, Sinti/Roma, legal/illegal, other. In this article, we use 1,006
observations from the German survey.
The
Happy data set was collected to study the relationship between positive factors
and mental health and care needs. Wenemark,
Borgstedt-Risberg, Garvin, Dahlin, Jusufbegovic, Gamme, Johansson and Bjrn (2018)
asked respondents “Name some positive things in your life, that are uplifting
or make you Happy: (you may write several things)”. Answers were classified
into 13 labels: nothing, relationships (family or romantic), working/studying,
health, self-esteem, joy/happiness, well-being: drinking/eating/drugs/sex,
spirituality, money, nature, hobbies, culture, and exercise. The data set
contains 2,350 observations.
Table 5.1
contains summary statistics about the three data sets.
Table 5.1
Summary statistics of data sets: number of total observations, features and labels and average number of relevant labels, and percentage of observations that are associated with more than one label
Table summary
This table displays the results of Summary statistics of data sets: number of total observations. The information is grouped by Data (appearing as row headers), > observations, > features, L, av. > of labels and
(appearing as column headers).
| Data |
# observations |
# features |
L |
av. # of labels |
|
| Civil |
1,029 |
305 |
12 |
1.15 |
13.80% |
| Immigrant |
1,006 |
273 |
14 |
1.19 |
13.72% |
| Happy |
2,350 |
492 |
13 |
2.77 |
87.40% |
5.2 Experimental setup
We compared the proposed MEPCC method against BR and LP
and PCC. For PCC, we used the uniform search to reach a predicted label set and
the estimated probability of equation (3.1) for the confidence score of the
prediction. EPCC was not included in the comparison because its computational
cost makes it infeasible for prediction for our data sets. (In our experiment
on the Immigrant data with 14 labels, running the exhaustive search for PCC
for a single prediction took a single computer
(Intel Core i7 CPU with 8GB RAM) over 30 minutes. This implies that predicting
200 observations using EPCC
would take more than 1,000 hours.) Support
vector machines (SVM) (Vapnik, 2000)
were used as the base classifier on unscaled variables with a linear kernel and
tuning parameter
For probabilistic output, the SVM scores were
converted into probabilities using Platt’s method (Platt, 2000). The analysis was conducted in
(R
Core Team, 2014) using the
package (Meyer, Dimitriadou, Hornik, Weingessel and Leisch, 2014) for
SVM.
For each data set, 5-fold cross validation (CV) was
performed. That is, we randomly divided the data into five equal-sized parts
and used the first four parts as the training data and the last part as the
test data. Performance evaluation is only made on the test data. Each of the
five parts were used as test data and the results were averaged.
5.3 Performance of the MEPCC approach
We first investigated the performance of the MEPCC. The
score in equation (4.1) has two components. To demonstrate that both components
are helpful, we evaluate the proposed score as well as two different scores
where one of the components is missing. That is, we compared the MEPCC with
three different scores
and
Prioritizing
the text answers based on
results
in many ties. The tied answers were randomly reordered to be able to calculate
subset accuracy at each production rate. Figure 5.1 shows the subset
accuracy of each approach as a function of the production rate. The text
answers with higher scores were classified first. For example, production rate
0.2 means only 20% of the test data with the highest scores were classified
automatically by the models. When the production rate equals 1, there was no
difference between the MEPCC models because the predicted label sets are always
the same. The difference is how they prioritize the text answers from the
easiest-to-classify to the hardest-to-classify answers. When the production
rate was less than 1, MEPCC outperformed MEPCC-1 and MEPCC-2 for all three
data. The results show that both components in equation (4.1) were helpful for
prioritizing the observations.

Description for Figure 5.1
Figure presenting three graphs, one
for each of the following data sets: Civil, Immigrant and Happy. The subset
accuracy is on the y-axis, ranging from 0.6 to 1.0. The percentage of automated
categorization is on the x-axis, ranging from 0.2 to 1.0. Each graph compares
three approaches: MEPCC, MEPCC -1 and MEPCC -2. When the production rate is 1,
there is no difference between the MEPCC models. When the production rate is
lower than 1, MEPCC outperforms MEPCC-1 and MEPCC-2 for the three data sets.
5.4 Effect of the number of PCC models
We then investigated to what extent the number of PCC
models affects the predictive performance of MEPCC. Figure 5.2 shows the
performance of MEPCC for different number of PCC models
When
was low, increasing
led to huge improvement of the subset accuracy
of MEPCC. However, once there were enough PCC models (e.g.,
adding more PCC models did not improve the
subset accuracy. The empirical results show that MEPCC does not require many
PCC models for performing well.

Description for Figure 5.2
Figure presenting three graphs, one
for each of the following data sets: Civil, Immigrant and Happy. They show the
performance of MEPCC for different number of PCC models. The subset accuracy is
on the y-axis, ranging from 0.6 to 1.0. The percentage of automated
categorization is on the x-axis, ranging from 0.2 to 1.0. Each graph compares five
numbers of PCC models: 1, 5, 10, 20 and 30. When the number of PCC models is
low, increasing it leads to huge improvement of the subset accuracy of MEPCC.
However, once there are enough PCC model (e.g.,
adding more PCC models doesn’t improve the
subset accuracy. The empirical results show that MEPCC doesn’t require many PCC
models for performing well.
5.5 Comparison with other methods
At last we investigated the performance of MEPCC
compared to the established methods (BR, LP
and PCC). For all methods, a production rate of x% refers to the x% of the data
that have the highest score. MEPCC used
as a score, while each of the other approaches
used the probability of the predicted label set estimated by that method. Note
when
MEPCC and PCC are identical; the score
coincides with the probability of the label
set predicted by PCC.
Figures 5.3 and 5.4 illustrate the respective
subset accuracy and Hamming loss for the different methods as a function of the
production rate on the Happy, Immigrant and Civil data. For the Immigrant and
Happy data, the highest subset accuracy at most production rates was obtained
by MEPCC. For the Civil data, MEPCC and LP performed the best. In terms of
Hamming loss, MEPCC achieved the lowest error at most production rates for all
data.

Description for Figure 5.3
Figure presenting three graphs, one
for each of the following data sets: Civil, Immigrant and Happy. The subset
accuracy is on the y-axis, ranging from 0.5 to 1.0 for the first two data sets
and from 0.4 to 1.0 for the last one. The percentage of automated categorization
is on the x-axis, ranging from 0.2 to 1.0. Each graph compares four methods:
BR, LP, PCC and MEPCC. For the Immigrant and Happy data sets, the highest
subset accuracy at most production rate was obtained by MEPCC. For the Civil
data, MEPCC and LP performed the best.

Description for Figure 5.4
Figure presenting three graphs, one
for each of the following data sets: Civil, Immigrant and Happy. The Hamming
loss is on the y-axis, ranging from 0.00 to 0.08 for the first two data sets
and from 0.00 to 0.10 for the last one. The percentage of automated
categorization is on the x-axis, ranging from 0.2 to 1.0. Each graph compares
four methods: BR, LP, PCC and MEPCC. MEPCC achieves the lowest error at most
production rates for the three data sets.
Next, we
consider the performance of each method given target predicted accuracy values.
To decide the fraction of automatic categorization, a practitioner will
typically set a threshold probability above which texts are coded
automatically. For MEPCC, the relationship between true accuracy and the
confidence score
were
estimated via cross-validation on the training data. We used Platt’s scaling to
convert the confidence scores into probability outputs. Since Platt’s scaling
could improve the level of calibration (Niculescu-Mizil and Caruana, 2005), the same technique was also applied to BR, LP
and PCC.
Table 5.2 illustrates the tradeoff between the
percentages of automated prediction and the corresponding subset accuracy of
each method as a function of different thresholds. The threshold refers to the
minimum predicted subset accuracy required for automated prediction. The
minimum predicted subset accuracy helps us decide which text answers should be
classified automatically and which should be classified manually. For example,
if the client decides that at least 80% accuracy is required for automated
classification, then approximately 39.3% of the Civil data, 42.5% of the
Immigrant data, and 27.6% of the Happy data can be classified automatically by
MEPCC with subset accuracy 0.891, 0.916 and 0.857, respectively. Note that this
is a huge improvement compared to applying BR that could only automatically
classify 9.3% of the Civil data, 12.8% of the Immigrant data, and 8.7% of the
Happy data with lower subset accuracies. Table 5.3 shows the relationship
between predicted and actual accuracy by aggregating to ranges of predictions
for each method and data set. For MEPCC the actual accuracy is within the range
of the predicted accuracy in most cases, much better than for the other
methods.
Table 5.2
Semi-automated result for the three data at different decision thresholds. P represents the percentage of automated predictions and SA represents the subset accuracy for the automated prediction results
Table summary
This table displays the results of Semi-automated result for the three data at different decision thresholds. P represents the percentage of automated predictions and SA represents the subset accuracy for the automated prediction results. The information is grouped by Data (appearing as row headers), Threshold, BR, LP, PCC and MEPCC (appearing as column headers).
| Data |
Threshold |
BR |
LP |
PCC |
MEPCC |
| P |
SA |
P |
SA |
P |
SA |
P |
SA |
| Civil |
0.9 |
0.7% |
0.667 |
16.5% |
0.967 |
0.0% |
NA |
13.0% |
0.978 |
| 0.8 |
9.3% |
0.893 |
34.3% |
0.898 |
15.1% |
0.787 |
39.3% |
0.891 |
| 0.7 |
18.4% |
0.846 |
46.6% |
0.852 |
36.4% |
0.817 |
45.8% |
0.860 |
| 0.6 |
25.4% |
0.768 |
50.6% |
0.831 |
52.1% |
0.771 |
52.9% |
0.820 |
| Immigrant |
0.9 |
3.7% |
0.858 |
11.1% |
0.959 |
1.3% |
0.558 |
31.5% |
0.947 |
| 0.8 |
12.8% |
0.779 |
30.4% |
0.890 |
27.7% |
0.859 |
42.5% |
0.916 |
| 0.7 |
26.6% |
0.743 |
38.6% |
0.863 |
42.4% |
0.829 |
55.1% |
0.862 |
| 0.6 |
41.7% |
0.715 |
53.6% |
0.806 |
50.5% |
0.795 |
62.7% |
0.839 |
| Happy |
0.9 |
1.3% |
0.592 |
8.9% |
0.850 |
0.1% |
0.750 |
1.0% |
0.830 |
| 0.8 |
8.7% |
0.734 |
14.3% |
0.802 |
7.2% |
0.726 |
27.6% |
0.857 |
| 0.7 |
32.8% |
0.776 |
17.7% |
0.793 |
29.9% |
0.767 |
43.7% |
0.817 |
| 0.6 |
53.2% |
0.745 |
22.2% |
0.761 |
49.2% |
0.744 |
52.0% |
0.790 |
Table 5.3
Semi-automated result for the three data at different ranges of thresholds. P represents the percentage of automated predictions and SA represents the subset accuracy for the automated prediction results
Table summary
This table displays the results of Semi-automated result for the three data at different ranges of thresholds. P represents the percentage of automated predictions and SA represents the subset accuracy for the automated prediction results. The information is grouped by Data (appearing as row headers), Predicted accuracy, BR, LP, PCC and MEPCC (appearing as column headers).
| Data |
Predicted accuracy |
BR |
LP |
PCC |
MEPCC |
| P |
SA |
P |
SA |
P |
SA |
P |
SA |
| Civil |
|
0.7% |
0.667 |
16.5% |
0.967 |
0.0% |
NA |
13.0% |
0.978 |
|
|
8.7% |
0.896 |
17.8% |
0.834 |
15.1% |
0.787 |
26.2% |
0.846 |
|
|
9.0% |
0.769 |
12.2% |
0.710 |
21.3% |
0.828 |
6.5% |
0.681 |
|
|
7.0% |
0.566 |
4.1% |
0.584 |
15.7% |
0.655 |
7.1% |
0.563 |
| Immigrant |
|
3.7% |
0.858 |
11.1% |
0.959 |
1.3% |
0.558 |
31.5% |
0.947 |
|
|
9.1% |
0.750 |
19.3% |
0.843 |
26.4% |
0.869 |
11.0% |
0.829 |
|
|
13.8% |
0.710 |
8.2% |
0.747 |
14.7% |
0.757 |
12.5% |
0.688 |
|
|
15.1% |
0.602 |
15.0% |
0.659 |
8.1% |
0.623 |
7.7% |
0.670 |
| Happy |
|
1.3% |
0.592 |
8.9% |
0.850 |
0.1% |
0.750 |
1.0% |
0.830 |
|
|
7.4% |
0.755 |
5.4% |
0.717 |
7.1% |
0.730 |
26.5% |
0.858 |
|
|
24.0% |
0.792 |
3.4% |
0.751 |
22.7% |
0.779 |
16.2% |
0.749 |
|
|
20.4% |
0.693 |
4.6% |
0.615 |
19.3% |
0.703 |
8.3% |
0.647 |
Table 5.4 shows the runtime of each method for
training the model and predicting all instances in test data (Intel Core i7 CPU
with 8GB RAM). Unsurprisingly, the runtime of MEPCC at
is roughly 10 times of that of PCC in both of
the training and prediction stages.
Table 5.4
Runtime (in seconds) of each method for the three data
Table summary
This table displays the results of Runtime (in seconds) of each method for the three data. The information is grouped by Data (appearing as row headers), Stage, BR, LP, PCC and MEPCC (appearing as column headers).
| Data |
Stage |
BR |
LP |
PCC |
MEPCC |
| Civil |
Train |
1.688 |
0.641 |
1.128 |
11.787 |
| Prediction |
0.269 |
0.044 |
37.142 |
374.611 |
| Immigrant |
Train |
1.363 |
0.510 |
0.894 |
8.724 |
| Prediction |
0.200 |
0.056 |
35.369 |
334.075 |
| Happy |
Train |
11.160 |
16.164 |
7.371 |
78.293 |
| Prediction |
0.567 |
3.691 |
177.847 |
1,746.529 |
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2020
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa