=Paper=
{{Paper
|id=Vol-1670/paper-25
|storemode=property
|title=Assessing the Quality of Unstructured Data: An Initial Overview
|pdfUrl=https://ceur-ws.org/Vol-1670/paper-25.pdf
|volume=Vol-1670
|authors=Cornelia Kiefer
|dblpUrl=https://dblp.org/rec/conf/lwa/Kiefer16
}}
==Assessing the Quality of Unstructured Data: An Initial Overview==
Assessing the Quality of Unstructured Data: An Initial Overview Cornelia Kiefer Graduate School of Excellence Advanced Manufacturing Engineering Nobelstr. 12, 70569 Stuttgart, Germany cornelia.kiefer@gsame.uni-stuttgart.de http://www.gsame.uni-stuttgart.de Abstract. In contrast to structured data, unstructured data such as texts, speech, videos and pictures do not come with a data model that enables a computer to use them directly. Nowadays, computers can in- terpret the knowledge encoded in unstructured data using methods from text analytics, image recognition and speech recognition. Therefore, un- structured data are used increasingly in decision-making processes. But although decisions are commonly based on unstructured data, data qual- ity assessment methods for unstructured data are lacking. We consider data analysis pipelines built upon two types of data consumers, human consumers that usually come at the end of the pipeline and non-human / machine consumers (e.g., natural language processing modules such as part of speech tagger and named entity recognizer) that mainly work intermediate. We define data quality of unstructured data via (1) the similarity of the input data to the data expected by these consumers of unstructured data and via (2) the similarity of the input data to the data representing the real world. We deduce data quality dimensions from the elements in analytic pipelines for unstructured data and char- acterize them. Finally, we propose automatically measurable indicators for assessing the quality of unstructured text data and give hints towards an implementation. Keywords: quality of unstructured data, quality of text data, data quality dimensions, data quality assessment, data quality metrics 1 Introduction In recent years the methods for knowledge extraction from unstructured data have improved and unstructured data sources such as texts, speech, videos and pictures have gained importance. Nowadays, sentiment analysis of social media data leads to decisions in marketing campaign design, images are classified au- tomatically and unstructured information can be retrieved easily using search engines [6, 19]. But methods which determine the quality of the data are lack- ing. To be able to make good decisions, the quality of the underlying data must be determined. Similar to the concepts, frameworks and systems developed for structured data we need means to ensure high quality of unstructured data. We 2 Assessing the Quality of Unstructured Data: An Initial Overview focus on data consumers of unstructured data and define them as humans or non-humans / machines (e.g. algorithms) that are using or processing data. The quality of the data consumed by the final consumer such as a human who needs to derive a decision from the data, depends on the quality assessed for earlier consumers. This is especially true for unstructured data, which is analyzed in a pipeline. The remainder of this paper is organized as follows: First we motivate re- search in assessing the quality of unstructured data in section 2. In section 3 we define data quality of unstructured data. Furthermore, we describe the data quality dimensions interpretability, relevance and accuracy. Based on this, in sec- tion 4 we present data quality indicators for unstructured text data. In section 5 we discuss related work and finally conclude the work and highlight future work in section 6. 2 Motivation Low data quality is dangerous because it can lead to wrong or missing decisions, strategies and operations. It can slow down innovation processes, and losses for organizations caused by low data quality are estimated to lie over billions of dollars per year [8]. Bad data is a huge problem: 60% of enterprises su↵er from data quality issues, 10-30% of data in organizational databases is inaccurate and individual reports of incomplete, inaccurate and ambiguous organizational data are numerous [13, 18]. The most important information sources in organizations, such as the work- ers, managers and customers produce unstructured data. About 90% of all data outside of organizations and still more than 50% inside are estimated to be un- structured [20]. In the era of Big Data the amount of data is increasing immensely and filtering relevant and high quality data gets more and more important. Or- ganizations need to leverage the information hidden in unstructured data to stay competitive [14]. Therefore, the quality of texts, pictures, videos and speech data needs to be ensured. But while the need for data quality assessment and improve- ment strategies for unstructured data was recognized (e.g. [2, 23]) no concrete approach to assessing the quality of unstructured data was suggested yet. We fill this gap and provide data quality dimensions and executable indicators for unstructured data. By focusing on automatically calculable indicators of data quality, we aim to support real time analytics of stream data (such as social media data) with real time data quality assessment techniques, both running concurrently. 3 Definition of Data Quality and of Data Quality Dimensions for Unstructured Data The definitions of data quality in [24, 30] focus on structured data which is consumed by humans. They define data quality via the similarity of the data Assessing the Quality of Unstructured Data: An Initial Overview 3 D to the data set D’ which is expected by the data consumer [24] and via the fitness for use by the data consumer [30]. We extend the meaning of these existing definitions by pointing out that machine consumers and many di↵erent consumers in a pipeline need to be considered as well as human end consumers in the case of unstructured data. Furthermore, data quality needs to be defined in terms of accuracy. Accuracy describes the similarity between the input data and the data which would be representing the real world. This definition of Accuracy is equal to exiting ones, e.g. [11]. The quality of data has a multi-faceted nature and many lists of data quality dimensions and indicators for structured data exist (see 5). All of the dimensions that were found to be relevant in the literature, such as completeness, timeliness and accuracy are relevant to structured as well as unstructured data. From these dimensions we selected three dimensions which are relevant to mining processes on unstructured data. We deduce the dimensions from the elements involved in mining processes on unstructured data: The input data, the real world, data consumers, a task and the knowledge extracted. Based on these elements, the quality of data D can be determined by comparing it to three classes of ideal data sets: the data as expected by the current data consumer DC (we will call this the Interpretability dimension), the data as it would be optimal for the task DT (Relevancy) and the data set which is representing the real world DW (Accuracy). The deduced dimensions are also in line with the data quality definitions stated above. In Fig. 1, we illustrate the three data sets in the context of an ideal mining process on unstructured data. Ideally, D would match the real world DW and would be exactly the same as the data expected by the first data consumer. Since un- structured data is analyzed in a pipeline, the output of the first data consumer is input to the second and should therefore match the data expected by the second data consumer and so on (as indicated in Fig. 1 with the analysis pipeline). An ideal result of the mining process can be DT (which is still bound to D, DW and DC and is usually equal to the data expected by the final consumer). By basing the data quality dimensions on the elements involved in a mining process on unstructured data, we focus on the quality of unstructured data which is analyzed automatically in analytics pipelines. In the following, we describe the deduced data quality dimensions in more detail: Interpretability can be assessed as the degree of similarity between D and DC . For example, consider a statistical preprocessor which is used to segment a text into sentences. If it was trained on Chinese texts and is used to segment English texts, D and DC are not similar and data quality is low. Since often many di↵erent data consumers are involved in interpreting unstructured data, this dimension is crucial for unstructured data. Relevancy can be assessed as the similarity between D and DT . Usually DT will be very similar to the DC of the end consumer (which we will call DCE ) who wants to use the data to accomplish the task. While di↵erences between DT and the data expected by the end consumer DCE indicate problems, these 4 Assessing the Quality of Unstructured Data: An Initial Overview Fig. 1. The three ideal data sets DC , DT and DW in the context of an ideal mining process on unstructured data are not related to data quality and we will therefore assume DT and DCE to be equivalent. As an example for relevancy, consider a worker on the shop floor who is searching for a solution for an urgent problem with a machine in a knowledge base. If he only finds information on the price of the machine, the data quality of the result is low because it does not help him with his task of solving the problem. We assess the Interpretability and Relevancy of a data set D by its sim- ilarity to the data set DC and DCE which is expected by the data consumers. Expectations di↵er from human to machine consumers. What a human data con- sumer expects, depends on factors such as his knowledge, experiences and goals. Expectations of machine consumers are very precise and depend on the algo- rithm, training data, statistical models, rules and knowledge resources available. This holds for all types of unstructured data. As illustrated in Fig. 2, unstruc- tured data such as textual documents may be consumed by machines or humans and the data set DC or DCE depends on factors such as the native language of the human and the statistical language models available to the machine. For example, a human data consumer expects a manual for a machine to be in his native language or in a language he knows. He also expects the manual to explain the machine in a way he understands with his technical expertise. When a ma- chine consumes unstructured data, similar factors influence the interpretability and more precisely the similarity of the input data and the data expected. The knowledge of a machine consumer can be represented by machine-readable do- main knowledge encoded in semantic resources (such as taxonomies), by training data, statistical models or by rules. As an example, imagine a machine consumer that uses a simple rule-based approach to the extraction of proper names from German text data, where all uppercased words are extracted. This machine con- sumer expects a data set DC with correct upper and lowercased words. If D is Assessing the Quality of Unstructured Data: An Initial Overview 5 all lower-cased, DC and D are not similar and the data is not fit for use by that data consumer. Fig. 2. Machine and human data consumer and factors that influence the data expected Unstructured data is usually consumed by many di↵erent data consumers with many di↵erent data sets DC expected. In an analytics pipeline, the raw data is consumed and processed by several consumers in a row and the output of the previous consumer is the input to the next consumer and so on. Data quality problems at intermediate consumers may be automatically propagated to following consumers. By considering all intermediate (machine and/or human) consumers, the exact points for data quality improvement can be determined. In Fig. 3 we illustrate an analytics pipeline involving three machine consumers and one human end consumer of the data. Machine consumers are in this illustration represented by three high level machine consumers which are present in many analytic pipelines of unstructured data: preprocessors, classifiers and visualizers. For example, as depicted in Fig. 3, the output of the preprocessor is input to automatic classification and the results are then visualized. The visualizations are finally the input to a human consumer of the data, who e.g., derives decisions from it. Fig. 3. Assessing and Improving data quality for each data consumer on the way from e.g., raw text documents to final consumer As for structured data, the Accuracy of data and information is a very important data quality dimension. It is hard to measure, because the data set 6 Assessing the Quality of Unstructured Data: An Initial Overview DW , which represents the real world, is often not known and creating it involves the work of human experts, is time-consuming, costly or even impossible. The solution is usually to abstract away from details e.g., by using rules to check general conformance of data points with expected patterns (e.g., e-mail addresses containing an @ sign) or to built DW manually for a part of the data set only (see [28, 29]). DW may be represented by a so-called gold standard data set with the accurate values annotated manually by human experts. For example, statistical classifiers are evaluated by comparing the prediction of the statistical classifier with those in a gold standard with manually annotated classes. Since DW is not known for all data sets D, many statistical classifiers can not be evaluated and the number of problems with accuracy in big data bases can only be approximated. 4 Data Quality Indicators for Unstructured Text Data A data quality dimension can be measured by exploitation of data quality indi- cators. Data quality indicators must be transferable to a number in the interval [0,1] where 0 indicates low data quality and 1 indicates high data quality (this is similar to the standard characterizations of data quality metrics, such as in [1]). Therefore, indicators can e.g., be represented by yes/no-questions, proportions of data items which have a certain characteristic or by evaluation metrics. The standard approaches to more concrete indicators for the quality of structured data involve counting the number of missing values, wrong values or the number of outliers. For the case of unstructured data, di↵erent indicators are needed. We compiled an extensive list of indicators for all three dimensions. The definition of indicators is based on the dimensions discussed in the previous section and on related work in natural language processing, information retrieval, automated assessment and machine learning (see section 5.2). Here, we limit the indicators presented to those which are (1) automatically measurable and (2) applicable to unstructured text data. Furthermore, we selected indicators, which we al- ready implemented or which are straightforward to implement (since libraries with good documentations are available), so that the indicators can be verified in experiments in near future work. In table 1, we describe each dimension with these more concrete indicators of data quality. While the concept behind the indicators confidence, precision, accuracy and quality of gold annotations are applicable to all types of unstructured data which are processed by statistical machine learning components, the remaining indica- tors are text specific. With a di↵erent definition of noisy data and fit of training data, the concepts may be transferred to other data types as well, e.g. measuring the similarity between input pictures and training data pictures or measuring the percentage of noisy data, defined as the percentage of background noise, in speech. In the following we describe the indicators in more detail and give hints towards possible implementations: Assessing the Quality of Unstructured Data: An Initial Overview 7 Table 1. Indicators for the quality of unstructured text data Dimension Indicator Fit of training data Interpretability Confidence Noisy data Frequent keywords Relevancy Specificity Precision Accuracy Accuracy Quality of gold annotations The first indicator fit of training data directly follows from the definition for Interpretability we gave in section 3, when considering statistical classifiers as data consumers. The quality of text data with respect to a machine consumer, can be measured by calculating the similarity of the input text data and the data expected by the data consumer. In the case of statistical classifiers such as a part of speech tagger (which automatically assigns parts of speech to each token such as a word in a text) or sentiment classifier (which automatically detects opinions in texts and assigns e.g., the classes positive, negative and neutral to texts), DC may be represented by the training data. For the case of unstructured text data the similarity can be measured using text similarity measures. For example, consider the situation where Twitter data is consumed by a statistical classifier such as a part of speech tagger that was trained on newspaper texts. By the definition of interpretability used in this work, data quality is lower than for another tagger that was trained on text data from Twitter as well. Examples for measures for this indicator are text similarity measures such as Cosine Similarity and Greedy String Tiling which are e.g. implemented in the DKPro Similarity package (see [7]). Using the DKPro Similarity library in Java two lists of tokens can be easily compared and a similarity score in the interval [0,1] can be calculated, following the instructions on the web site1 . The second indicator, confidence, also focuses on data quality of text data as perceived from the point of view of a statistical classifier. A statistical classifier estimates the probabilities for each class from a fixed list of classes, given the data. These probabilities are also called confidence values (for more details, see [12]). If the probability of a classification decision is very high, confidence of the statistical classifier is said to be high. Confidence is expressed as a number in the interval [0,1] and may be used for measuring data quality. For example, confidence measures are available and can be retrieved for the natural language processing tools in OpenNLP2 (such as the tokenizer and part of speech tagger), a Java library for natural language processing which is heavily used in industry applications because it has an Apache license. To get these confidence values, follow the documentation of the OpenNLP library (see footnote 2, e.g., for the 1 https://dkpro.github.io/dkpro-similarity/ 2 https://opennlp.apache.org/ 8 Assessing the Quality of Unstructured Data: An Initial Overview part of speech tagger, just call the probs method which will return an array of the probabilities for all tagging decisions). The third indicator in the interpretability dimension is the percentage of noisy data. This is a relevant indicator for human and machine consumers, since reading a text is more difficult for a human if it is full of misspelled words, non- grammatical sentences and abbreviations. Since most machine consumers of text data expect clean text data such as newspaper texts, the degree of noisy data also measures data quality from the viewpoint of such standard machine con- sumers. The percentage of noisy data may be measured as the percentage of sen- tences which cannot be parsed by an automatic syntax parser, unknown words, punctuation, very long/short sentences, incorrect casing, special signs, urls, mail addresses, emoticons, abbreviations, pause filling words, rare words or by the percentage of spelling mistakes (the latter as already suggested by [26]). Non- parsable sentences can be identified using an automatic syntax parser such as the parser implemented in natural language processing libraries such as OpenNLP (see footnote 2) or the Natural Language Processing Tool Kit NLTK3 . The num- ber of punctuation and of unknown words (e.g., defined as words unknown to a standard part of speech tagger) may be e.g., calculated using the standard part of speech tagger implemented in NLTK (which has individual classes for punctuation and unknown words). Very long/short sentences can be identified using a tokenizer and a sentence segmenter from a natural language processing library and by counting the automatically determined tokens and sentences. In- correct casing may be detected using supervised machine learning methods, such as suggested in [17]. Regular expressions can be used to automatically identify the percentage of special signs, urls, mail adresses, emoticons, abbreviations and pause filling words in texts. Rare words can be identified internally by counting all words that occur less than a specified number of times in the text corpus, by counting words that are not found in a standard dictionary or a generated dictio- nary (such as a dictionary generated from a very encompassing text corpus from the domain). The number of spelling mistakes in a text corpus may be calculated using the Python implementation PyEnchant4 or any other spelling correction module. Most of the measures suggested for the indicator noisy data can be im- plemented using the NLTK library which comes with very good documentation and an active community (see footnote 3). But it is not sufficient if data is interpretable only. Interpretable data, which is not relevant to the end data consumer and his goal is of low quality. Therefore, it’s Relevancy need to be calculated. For text data this can be done following approaches already developed for information retrieval systems. The relevance metric used in information retrieval systems determines the relevance of search results with respect to the information need of the searcher. The information need is captured via keywords or documents first and can then be compared e.g., to the frequent keywords in the input texts (see [16] for the relevance met- ric in information retrieval). Again, textual similarity measures such as cosine 3 http://www.nltk.org/ 4 http://pythonhosted.org/pyenchant/ Assessing the Quality of Unstructured Data: An Initial Overview 9 similarity are used to determine the similiarity of the information need and a text (as implemented in [7] and accessible via the well-documented DKPro Sim- ilarity library, see footnote 4). Besides the frequent keywords, also specificity can indicate the relevance of unstructured text data for the task a certain end con- sumer wants to accomplish. The specificity of language in texts and speech can be determined via the coverage of a domain-specific semantic resource which contains all relevant technical terms. In the simplest version this would be a text file with all domain words listed which is used to determine the percentage of domain words in a corpus. Coverage of domain specific taxonomies may be e.g., calculated with a concept matcher such as the one presented in [22]. If the data is interpretable and relevant, the remaining question is whether it reflects the real world or not, that is whether it is accurate. The Accuracy of unstructured text data may be indicated by evaluation metrics such as precision and accuracy. These metrics compare the automatically annotated data to parts of the data which represent the real world, such as manually annotated gold standard corpora. Statistical classifiers are evaluated by comparing them to gold standards and by determining how many of the classified entities really belong to a class (precision) and the percentage of classification decisions that were correct (accuracy), see [16]. The metrics precision and accuracy were already suggested as indicators for text data quality by [26] and [23]. Furthermore, the quality of gold annotations of training and test data is an indicator in the accuracy dimension. These can be calculated according to [10] by measuring the inter- rater agreement which measures the number of times one or more annotators agree. Evaluation metrics and inter-rater metrics are e.g. implemented in NLTK (see footnote 3). In this section we presented automatically measurable indicators for text data which are executable. Not all indicators presented here are relevant and applicable in all cases. Only few out of the many statistical tools give access to the confidence metric and only with access to gold test data precision and accuracy can be calculated. 5 Related Work While research on the quality of structured data is numerous, the quality of unstructured data has hardly been considered yet. We present related work in the field of data quality in section 5.1 and list isolated methods useful in assessing unstructured text data quality in section 5.2. 5.1 Related Work in Data Quality Many frameworks and data quality dimensions dedicated to the quality of struc- tured data have been suggested (e.g. [24, 30]) and also special frameworks and dimensions for social media data and big data were developed [5, 21]. In these works, data quality dimensions are defined from a human end consumer’s point of view and no automatic measures for the assessment of unstructured data are 10 Assessing the Quality of Unstructured Data: An Initial Overview given. Several sources [2, 23, 26] address the need for data quality measures on unstructured data but none of them gives executable dimensions and indicators. In these works, interesting starting points for quality dimensions and indicators are defined, such as: – The quality of technologies used to interpret unstructured data and the author’s expertise [23] – Accuracy, readability, consistency and accessibility [2] – Precision and spelling quality [26] No hints towards possible implementations of these dimensions and indicators are suggested, though. As demanded in [26], we also support the view that tex- tual data quality needs to be measured for both, human consumers and machine consumers. We have furthermore motivated the need to measure data quality at every stage. This is also demanded in [15, 27]. A closely related idea is also expressed in the concept of data provenance which aims at collecting the infor- mation on all data sources and transformation or merging steps of data (see [4]). 5.2 Isolated Methods for Data Quality Assessment of Unstructured Text Data In the definition of the quality indicators in this article we focused on unstruc- tured text data. Therefore, we limit the list of isolated methods to those rele- vant for the assessment of textual data. For example, quite some work in the field of natural language processing focuses on the interaction between textual data characteristics and the performance of Natural Language Processing (NLP) tools. In [3] the authors consider factors that a↵ect the accuracy of automatic text-based language identification (such as the size of the text fragment and the amount of training data). Furthermore, work on correcting upper and lowercas- ing of words in texts (re-casing), spelling correction, abbreviation expansion and text simplification is related to our work (e.g., [17]). In the context of search engines, the quality of the search results and of the data basis is discussed as well [9]. In automated assessment, methods to automatically assess the quality of hand-written essays and short answers (e.g., student essays and answers to free text questions) are developed (for a good overview, see [31]). Work on training data selection in machine learning, which is on choosing subsets of training data which fit best to the domain of the test set (e.g. [25]) is also related to our work. The idea expressed in these works is similar to the idea behind the indicator fit of training data, which we added to our list of indicators for unstructured text data quality. However, we are the first to suggest the fit of training data as a data quality indicator. Furthermore, we do not suggest to use it for parts of training data, as suggested in these works, but to choose from di↵erent text corpora. 6 Conclusion and Future Work We listed dimensions and indicators for determining the quality of unstructured data based on the basic elements of mining processes on unstructured data. Assessing the Quality of Unstructured Data: An Initial Overview 11 The indicators proposed are executable and easily transfer into a data quality metric in the interval [0,1]. In future work we will determine the most suitable implementations for the indicators and validate them in experiments. We will furthermore explore how indicators may be combined to measure the overall data quality of unstructured data and how the improvement of data quality as perceived by intermediate consumers influences data quality from a rather end consumer viewpoint. Acknowledgments. The authors would like to thank the German Research Foundation (DFG) for financial support of this project as part of the Gradu- ate School of Excellence advanced Manufacturing Engineering (GSaME) at the University of Stuttgart. Moreover, we thank B. Mitschang and L. Kassner for important feedback. References 1. C. Batini, D. Barone, F. Cabitza, and S. Grega. A data quality methodology for heterogeneous data. International Journal of Database Management Systems (IJDMS), 3(1):60–79, 2011. 2. C. Batini and M. Scannapieco. Data and Information Quality. Springer Interna- tional Publishing, Cham, 2016. 3. G. R. Botha and E. Barnard. Factors that a↵ect the accuracy of text-based lan- guage identification. Computer Speech & Language, 26(5):307–320, 2012. 4. P. Buneman and S. B. Davidson. Data provenance – the foundation of data quality. 2010. 5. L. Cai and Y. Zhu. The challenges of data quality and data quality assessment in the big data era. Data Science Journal, 14(0):2, 2015. 6. F. Camastra and A. Vinciarelli. Machine learning for audio, image and video anal- ysis: Theory and applications. Advanced Information and Knowledge Processing. Springer, London, second edition edition, 2015. 7. Daniel Bär, Torsten Zesch, and Iryna Gurevych. Dkpro similarity: An open source framework for text similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (System Demonstrations) (ACL 2013), pages 121–126, Stroudsburg, PA, USA, 2013. Association for Computational Lin- guistics. 8. D. Dey and S. Kumar. Reassessing data quality for information products. Man- agement Science, 56(12):2316–2322, 2010. 9. C. Feilmayr. Decision guidance for optimizing web data quality - a recommendation model for completing information extraction results. 24th International Workshop on Database and Expert Systems Applications, pages 113–117, 2013. 10. Fleiss and Levin. The measurement of interrater agreement. In J. L. Fleiss, B. Levin, and M. C. Paik, editors, Statistical methods for rates and proportions, Wiley series in probability and statistics, pages 598–626. J. Wiley, Hoboken, N.J., 2003. 11. C. Fox, A. Levitin, and T. Redman. The notion of data and its quality dimensions. Inf. Process. Manage., 30(1):9–19, 1994. 12 Assessing the Quality of Unstructured Data: An Initial Overview 12. S. Gandrabur, G. Foster, and G. Lapalme. Confidence estimation for nlp applica- tions. ACM Transactions on Speech and Language Processing (TSLP), 3(3):1–29, 2006. 13. J. Han, K. Chen, and J. Wang. Web article quality ranking based on web commu- nity knowledge. Computing, 97(5):509–537, 2015. 14. K. Hartl and O. Jacob. Determing the business value of business intelligence with data mining methods. The Fourth International Conference on Data Analytics, pages 87–91, 2015. 15. A. Immonen, P. Paakkonen, and E. Ovaska. Evaluating the quality of social media data in big data architecture. IEEE Access, (3):1, 2015. 16. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, New York, 2008. 17. C. Niu, W. Li, J. Ding, and R. K. Srihari. Orthographic case restoration using supervised learning without manual annotation. International Journal on Artificial Intelligence Tools, (13), 2003. 18. J. R. Nurse, S. S. Rahman, S. Creese, M. Goldsmith, and K. Lamberts. Information quality and trustworthiness: A topical state-of-the-art review. International Con- ference on Computer Applications and Network Security (ICCANS 2011), 2011. 19. B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. In Proceedings of EMNLP, pages 79–86, 2002. 20. P. Russom. Bi search and text analytics: New additions to the bi technology stack. 2007. 21. M. Schaal, B. Smyth, R. M. Mueller, and R. MacLean. Information quality di- mensions for the social web. In Proceedings of the International Conference on Management of Emergent Digital EcoSystems, pages 53–58. ACM, 2012. 22. M. Schierle and D. Trabold. Multilingual knowledge-based concept recognition in textual data. In A. Fink, B. Lausen, W. Seidel, and A. Ultsch, editors, Advances in Data Analysis, Data Handling and Business Intelligence, Studies in Classifica- tion, Data Analysis, and Knowledge Organization, pages 327–336. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. 23. A. Schmidt, C. Ireland, E. Gonzales, M. Del Pilar Angeles, and D. D. Burdescu. On the quality of non-structured data, 2012. 24. L. Sebastian-Coleman. Measuring data quality for ongoing improvement: A data quality assessment framework. Elsevier Science, Burlington, 2013. 25. Y. Song, P. Klassen, F. Xia, and C. Kit. Entropy-based training data selection for domain adaptation. Proceedings of COLING 2012, 2012. 26. D. Sonntag. Assessing the quality of natural language text data. In GI Jahresta- gung, pages 259–263, 2004. 27. I.-G. Todoran, L. Lecornu, A. Khenchaf, and J.-M. Le Caillec. A methodology to evaluate important dimensions of information quality in systems. Journal of Data and Information Quality, 6(2-3):1–23, 2015. 28. T. Vogel, A. Heise, U. Draisbach, D. Lange, and F. Naumann. Reach for gold. Journal of Data and Information Quality, 5(1-2):1–25, 2014. 29. H. Wang, M. Li, Y. Bu, J. Li, H. Gao, and J. Zhang. Cleanix. ACM SIGMOD Record, 44(4):35–40, 2016. 30. R. Y. Wang and D. M. Strong. Beyond accuracy: what data quality means to data consumers. J. Manage. Inf. Syst., 12(4):5–33, 1996. 31. R. Ziai, N. Ott, and D. Meurers. Short answer assessment: Establishing links between research strands. In Proceedings of the 7th Workshop on Innovative Use of NLP for Building Educational Applications (BEA7), Montreal, Canada, 2012. Association for Computational Linguistics.