Analysis of Convolution Neural Network-Based Algorithm for Annotation Cost Estimation

Analysis of Convolution Neural Network-Based Algorithm for Annotation Cost Estimation in the Field of Supervised Machine Learning

Abstract

Active learning is an important machine learning process of selectively querying the users to label or annotate examples with the goal of reducing the overall annotation cost. Although most existing convolution neural network (CNN) work are based on a simple assumption that the cost of annotation for each labeling query is the same or fixed, the assumption may not be realistic. That is, in fact, the cost of annotation may vary between instances of data. In this work, I have studied and presented various annotation cost-sensitive active learning algorithms, which need to estimate the utility and cost of each query simultaneously. The goal is to build or merge different models of machine learning and reduce the total cost of labelling to train the model. Hence , I propose a technique for combining Latent Semantic Indexing (LSI) and Word Mover’s Distance (WMD) methods to come up with an efficient architecture which can work on different set of datasets, thus reducing the overall labelling/annotation cost in the field of supervised machine learning and validate that the proposed method is generally superior to other annotation cost-sensitive algorithms.

Keywords: active learning, annotation, CNN, labelling, supervised learning

Table of Contents

Introduction

Architecture and Proposed Procedure

Task Analysis

Task1: Data Understanding

Task2: Data Preparation

Task3: Modelling

Task4: Evaluation of Results

Project Roadmap and Timeline

Credentials

Conclusion

Visual Graphics

References & Citations

Introduction

Traditional machine learning algorithms use any data labeled to induce a model. By contrast, an active learning algorithm can select which instances are labeled and added to the training set. A learner typically starts with a small set of labeled instances, selects a few informative instances from a pool of unlabeled data, and queries from an oracle (e.g., a human annotator) for labels. The objective is to reduce the overall annotation cost to train a model. The notion of annotation costs must be better understood and incorporated into the active learning process in order to genuinely reduce the labeling costs required to build an accurate model. Hence, I propose a technique for combining Latent Semantic Indexing (LSI) and Word Mover’s Distance (WMD) methods to come up with an efficient architecture which can work on different set of datasets, thus reducing the overall labelling/annotation cost in the field of supervised machine learning.

Active learning is a machine learning setup that enables machines to strategically “ask questions” to label the oracle (Settles, 2010) in order to reduce the cost of labeling. With regard to the number of examples, annotation costs have traditionally been measured, but it has been widely recognized that different examples may require different annotation efforts (Settles et al., 2008).

Vast quantities of unlabeled instances can be easily acquired in many machine learning scenarios, yet high-quality labels are expensive to obtain. For example, a massive number of experiments and analyzes are needed in fields such as medicine (Liu, 2004)) or biology (King et al., 2004) to label a single instance, while collecting samples is a relatively easy task. In setting up cost-sensitive active learning, there are some variations. In (Margineantu, 2005), it is assumed that the cost of labeling for all data instances is known before querying, while in (Settles et al., 2008), the cost of a data instance can only be bought after querying its label. In this work, I concentrate on the later setup that closely matches the real-world human annotation scenario. Existing works (Haertel et al., 2008) must therefore simultaneously estimate the utility and cost of each instance in the setup and select instances with a high utility and low cost.

Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service

The idea of sampling uncertainty (Lewis and Gale, 1994) is to query the data instance label with the classifier’s highest uncertainty. For example, in a support vector machine (SVM), (Tong and Koller, 2001) propose to query the data instance closest to the decision boundary; (Holub et al., 2008) selects data instances to be queried from a probabilistic classifier based on the entropy of label probabilities.

In Kang et al., 2004,Data instances closest to each cluster’s centroid are searched before using any other section criteria; (Huang et al., 2010) measures the representativeness of each data instance from both the unlabeled data in-stances cluster structure and the labeled data class assignments , and (Xu et al., 2003) clusters those data instances close to the SVM decision boundary and queries data instance labels close to the center of each cluster. In (Nguyen and Smeulders, 2004) clustering is used to estimate the probability of unlabeled data instances labeling, which is the key component in the measurement of data instance utilities.

There are various works targeting on annotation cost sensitive active learning with different problem settings, such as the querying target (Greiner et al., 2002), the number of the labelers (Donmez and Carbonell, 2008) the targeting classification problem (Yan and Huang, 2018) and the applied data domain (Vijayanarasimhan and Grauman, 2011).

In order to discuss cost-sensitive active learning with unknown costs, the first question to be answered is whether the cost of human annotation can be estimated accurately. In (Arora et al., 2009), Various unsupervised models are proposed to estimate the cost of annotation for corpus datasets, while (Settles et al., 2008) shows that the cost of annotation can be estimated accurately using a supervised model of learning.

Active learning is widespread framework with the ability to automatically select the most informative unlabeled examples for annotation. The motivation behind the sampling of uncertainty is to find some unlabeled examples closest to the data set labeled (nearest neighbor) and use them to assign the label. To achieve this, I am creating document classification using CNN for any unknown target label input article and doing a cosine similarity to finding the most similar documents as neighbors for the document in the training set without labels. It allows to assume fairly that the closest similar document can be labeled the same, this will facilitate the labeling of the oracle with a smaller set of inputs.

The architecture is combination of two major components, first is to collect and preprocess them and will explain the similarity measures and develop the related models. The architecture’s second part captures unlabeled data and uses different models to perform similarity checks. The output of the system is to use effective models to identify neighboring documents / articles. I am evaluating multiple models in this work to improve document similarity in order to reduce the overall labeling effort. For similarity score, I am using Word2Vec.Based on the Vector Space Model, two similarity measures based on word2vec (“Centroids” and “Word Mover’s Distance (WMD)”) will be studied and compared with the commonly used Latent Semantic Indexing (LSI). Also 20 newsgroups datasets will be used to compare the document similarity measures.

The following figure gives an overview of the methodology:

Figure 1: Overall Architecture

To implement this design modularly, I have divided the project into four independent tasks:

Data Understanding

Data Preparation – Prepare the data for machine learning algorithm

Modelling – Select model and train models

Evaluation of Results

Task1: Data Understanding

In order to conduct the testing, I have to assess data situation, obtain data (Access), once data is available it needs to be explored. I used data pipeline ETL tool PowerCenter Informatica for building data warehouse. It was deployed in Virtual Machine with following specifications:

Operating System: Windows Server 2012 R2 Standard

RAM :32 GB

CPU Cores :8 Core 2.40 GHz Processor

Kernel Version 9.3.9600.18821

Task2: Data Preparation

Data Preparation is the process of gathering, cleaning and consolidating data into a single file or data table, primarily for analysis purposes. I used Datawatch Monarch is the industry’s leading solution for self-service data preparation. Recommendation specification for using Datawatch Monarch are as follows:

Windows 10 – 8 GB memory

5 GB disk space

2GHz or faster processor

Google Chrome

.NET Framework 4.5.2

Microsoft Access Database Engine 2010 version

Microsoft SQLServer

Task3: Modelling

I went with Scikit-Learn, the Python programming language for machine learning library to implement some models quickly during this project. To get the data ready for machine learning, I have to take some basic steps: missing value imputation, encoding of categorical variables, and optionally feature selection if the input dimension is too large. Scikit-learn library requires following dependencies:

Python (>= 2.7 or >= 3.4)

NumPy (>= 1.8.2)

SciPy (>= 0.13.3)

Task4: Evaluation of Results

As part of testing, we compared the three methods (LSI, Centroid and WMD). First, a local analysis on a single example is done to get a sense of how well the methods work. Then a global analysis is done with a clustering task. A lemmatization step has been done, and duplicates are removed to make the table readable. The quality of the clustering task for each method is given by the following Normalized Mutual Information (NMI) values in Table 1.

Data Set

LSI

Centroid

WMD

ng20

0.40

0.31

snippets

0.25

0.46

0.39

Table 1: Normalized Mutual Information

Finally, I compared the overall performance of the methods considered to common discrete methods of representation such as K-medoids, K-Means, Complete, Ward and DBSCAN.

Figure 2: Top Score Overall Comparison

Coming up with this distributed architecture as explained in above sections would require six steps as mentioned in timeline section below:

The first step involved reading and analyzing various relevant research papers and documents. This initial part would take around two weeks.

For the next three steps I have selected various existing algorithm and I am going to test and record results for each algorithm. Testing and recording results of LSI Algorithm will take a week.

In this step I will test, and record results of Centroid Algorithm. This part will take a week.

In this step I will test, and record results of WMD Algorithm. This part will also take a week.

In this step results of various algorithms as determined in above mentioned steps are compared across various matrices and identifying the bottle necks, this step will take one week.

The Final step involved combining of LSI and WMD algorithms and applying various optimizations steps to address the issues identified so that final algorithm reduces the total labelling/annotation cost in the field. This step will take two weeks.

The following chart explains the details steps and timelines to accomplish the proposed study:

Figure 3: Gantt Chart for the steps and timeline

I am a lead member of technical staff at Salesforce with over 15 years of experience in software industry. I am responsible for building Test framework/harness design, development and execution for unit testing of Java based cloud services. I have also developed java-based tools for load & performance for applications within large-scale Linux Clustered. I have professional level experiences in following technologies: Java, Python, Big Data Technologies, Functional Testing, automation and performance engineering. I am leading Sales Cloud prediction quality team in Salesforce Prior to Salesforce I was working with Intuit Inc as Staff Engineer.

The existing literature provided well defined explanation and comparison of various algorithms calculating annotation/labeling costs in field of supervised machine learning, however it did not include any improvement like combining various models to come up with an architecture which can work on different set of datasets uniformly considering behavior and volume of the data, thus I managed to demonstrate for long texts corresponding to the 20 Newsgroups dataset, LSI is the best method; MD and the Centroid method both involve better clustering than LSI for the Web snippets dataset. and main focus in future work would be to investigate cost-sensitive active learning strategies that are more robust when given approximate, predicted annotation costs.

Figure 4: Steps involved for Analysis

Figure 5: Histogram for Annotation Time

Settles B (2010). Active learning literature survey. University of Wisconsin, Madison 52(55-66):11

Settles B, Craven M, Friedland L (2008). Active learning with real annotation costs. In: Proceedings of the NIPS workshop on cost-sensitive learning, pp 1–10

Liu Y (2004). Active learning with support vector machine applied to gene expression data for cancer classification. Journal of chemical information and computer science.

King RD, Whelan KE, Jones FM, Reiser PG, Bryant CH, Muggleton SH, Kell DB, Oliver SG (2004). Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427(6971):247–252

Margineantu DD (2005). Active cost-sensitive learning. In: Proceedings of International Joint Conference on Artificial Intelligence, pp 1622–1623

Lewis DD, Gale WA (1994). A sequential algorithm for training text classifiers. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Springer-Verlag New York, Inc., pp 3–12

Tong S, Koller D (2001). Support vector machine active learning with applications to text classification. Journal of machine learning research 2(Nov):45–66

Holub A, Perona P, Burl MC (2008). Entropy-based active learning for object recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, IEEE, pp 1–8

Kang J, Ryu KR, Kwon HC (2004). Using cluster-based sampling to select initial training set for active learning in text classification. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 384–388

Huang SJ, Jin R, Zhou ZH (2010). Active learning by querying informative and representative examples. In: Advances in neural information processing systems, pp 892–900

Xu Z, Yu K, Tresp V, Xu X, Wang J (2003). Representative sampling for text classification using support vector machines. In: European Conference on Information Retrieval, Springer, pp 393–407

Nguyen HT, Smeulders A (2004). Active learning using pre-clustering. In: Proceedings of the 21th international conference on Machine learning, ACM, p 79

Donmez P, Carbonell JG (2008). Proactive learning: cost-sensitive active learning with multiple imperfect oracles. In: Proceedings of the 17th ACM conference on Information and knowledge management, ACM, pp 619–628

Guillory A, Bilmes J (2009). Average-case active learning with costs. In: International Conference on Algorithmic Learning Theory, Springer, pp 141–155

Cuong N, Xu H (2016). Adaptive maximization of pointwise submodular functions with budget constraint. In: Advances in Neural Information Processing Systems, pp 1244–1252

Vijayanarasimhan S, Grauman K (2011). Cost-sensitive active visual category learning. International Journal of Computer Vision 91(1):24–44

Arora S, Nyberg E, Ros ́e CP (2009). Estimating annotation cost for active learning in a multi-annotator environment. In: Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing, Association for Computational Linguistics, pp 18–26

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Analysis of Convolution Neural Network-Based Algorithm for Annotation Cost Estimation ”

Get high-quality paper

NEW! AI matching with writer