Semi-automated classification for multi-label open-ended questions
Section 3. Multi-label classification
Consider a set of possible output labels
In multi-label classification, each instance
with a feature vector
is associated with a subset of these labels.
Equivalently, the subset can be described as
where
if label
is associated with the instance, and
otherwise. A multi-label classifier
learns from training data to predict
for a given
Next, we review some common multi-label algorithms and
their relationship to an evaluation criterion, subset accuracy.
3.1 Evaluating multi-label algorithms in
semi-automated classification
Evaluating the classification of a text answer into a
single label is straightforward: the label is either correct or not and
accuracy refers to the percentage of correctly classified answers;
equivalently, error refers to the percentage of misclassified answers. For
answers that are classified into multiple labels, there are several ways to
combine the accuracy of each single label to an overall evaluation measure for
the set of multiple labels. These evaluation measures include subset accuracy,
Hamming loss, F-measure and log loss. For a predicted set of multiple labels,
subset accuracy is 1 if all of the
labels are correctly predicted and 0
otherwise. Hamming loss evaluates the fraction of misclassified labels.
F-measure is the harmonic mean of precision and recall and log loss evaluates
the uncertainty of the prediction averaged over the labels when a probability
score for each label is given.
In this paper we develop a methodology for subset
accuracy (equivalently, in terms of loss, 0/1 loss). This is a strict metric
because a zero score is given even if all labels are correctly classified
except one. However, subset accuracy is appropriate for semi-automated
classification because if an algorithm has difficulty classifying even a single
label, the entire observation needs to be manually classified. That is,
automated classification shall be conducted only if the model is highly
confident in the entire predicted label set.
Because subset accuracy requires that all labels are
simultaneously correctly classified, we are interested in finding the label set
that maximizes the joint probability
conditional on a text answer
In the next
section we discuss common approaches to estimating the joint probability
proposed in the machine learning community.
3.2 Multi-label approaches that optimize subset
accuracy
Various approaches have been proposed for predicting
multi-label outcomes. Since we use subset accuracy as the evaluation measure,
we focus on methods that aim to maximize the joint conditional distribution.
The simplest approach, called Binary Relevance (BR),
transforms a multi-label problem into separate binary problems. That is, BR
constructs a binary classification model for each label independently. For an
unseen observation, the prediction set of labels is obtained simply by
combining the individual binary results. In other words, the predicted label
set is the union of the results predicted from the
binary models. If each of the binary models
produces probability outcomes, BR can produce an estimate for
Note that this coincides with the joint
probability
if the labels are independent (conditional on
This implies that the product of the
probabilities obtained by BR will estimate
accurately only if the labels are
conditionally independent. The joint probability may be inaccurate if the
labels are substantially correlated given
Another approach tailored for subset accuracy is Label
Powerset learning (LP). This approach transforms a multi-label classification
into a multi-class (i.e., multinomial) problem by treating each unique label
set
that exists in the training data as a single
class. For example, when
there could be up to
classes
observed in the training data. Then any
algorithm for multi-class problems can be applied using the transformed
classes. Training a multi-class classifier
takes into consideration dependencies between labels. For a new observation, LP
predicts the most probable class (i.e., the most probable label set). If an
algorithm for multi-class data gives probabilistic outputs (some algorithms
classify without computing probabilities), LP directly estimates the class
probabilities (i.e., the joint probability
However, this approach cannot estimate the
joint probability for any label set unseen in the training data. As a
consequence, if the true label set of the new observation is an unseen observation the prediction cannot be correct. Another drawback of LP is that
the number of classes in the transformed problem can increase exponentially (up
to
number of classes). This can be problematic
when L is large since each combination of labels may be present in just one or
a few observations in the training data which makes the learning process
difficult.
A third approach to multi-label learning is Classifier Chains
(CC) (Read, Pfahringer, Holmes and
Frank, 2009, 2011). As in binary relevance, in CC also a binary model is
fit for each label. However, CC fits the binary models sequentially and uses
the binary label results obtained from previous models as additional predictors
in subsequent models. That is, the model for the
label
uses
and
as features. (For example, the model for
uses
as features, the model for
uses
and
as features and so on.) Passing label information
between binary classifiers allows CC to take label dependencies into account.
In the prediction stage, CC successively predicts the labels one at a time. The
prediction results of the previous labels are used for predicting the next
label in the chain.
This idea is extended to Probabilistic Classifier Chains
(PCC) (Dembczyński et al., 2010). PCC explains
CC using a probabilistic model. Specifically, the conditional joint
distribution can be described as
and PCC estimates
the probabilities
PCC finds the label set that maximizes the right hand
side of equation (3.1). However, there is no closed-form solution for finding
the label set. A few different solutions have been suggested. Dembczyński et al. (2010) used an
exhaustive search (ES) that considers all possible combinations. However, an
exhaustive search may not be practical when
is large, because the number of possible
combinations
increases exponentially. To overcome this
problem, optimization strategies based on the uniform cost search (UCS) (Dembczyński, Waegeman and Hüllermeier, 2012) and
the
algorithm (Mena, Montañés, Quevedo
and Del Coz, 2015) have been proposed. First, the estimated joint
conditional probability may be represented by a probability binary tree. Then a
search algorithm finds the optimal path (in our case, the path that gives the
highest joint probability) from the root and the terminal node. Compared with
ES, UCS substantially reduces the computational cost for PCC to reach the label
set with the highest joint probability (Dembczyński et al., 2012).
In theory, when applying the product rule, the order of
the categories
does not matter. For example, both
and
equal to
In practice, the two chains may lead to
different estimates. This means the performance of PCC may be affected by the
order of the labels in the chain.
To alleviate the influence of the category order, an
ensembling approach (EPCC) (Dembczyński et al., 2010) that combines
multiple probabilistic chains has been proposed. First
PCC models are trained where each PCC model is
based on a randomized order of the labels. In the prediction stage, the average
conditional joint probability over the
PCC models is computed for each possible label
set. Then the predicted label set is the label set with the highest average
predicted probability. Let
be the conditional joint probability estimated
by the
PCC model. The ensemble strategy predicts the
label set
such that
Note that EPCC does not combine the predicted label sets but conditional joint
probabilities. To find the highest average probability from
PCC
models, all individual probabilities are required and this forces us to use ES
to compute the conditional joint probability for all
label
combinations from all
PCC
models. Hence, although EPCC reduces the problem of influence of label order,
the method will not be useful if the problem deals with a large number of
labels or when
is
large. To reduce the computational cost for combining multiple PCC models, we
propose a new approach to ensembling the PCC models in the next section.
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Her Majesty the Queen in Right of Canada as represented by the Minister of Industry, 2020
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: Semi-annual
Ottawa