Measuring the 3V’s of Big Data: A Rigorous Approach1 Olga Ormandjieva1 [0000-0001-5641-0976], Mandana Omidbakhsh2 [0000-0003-0845-6339] and Sylvie Trudel2 [0000-0002-4983-1679] 1 Concordia University, Montreal, Canada 2 Université du Québec à Montréal, Montreal, Canada Abstract. Although the success of big data technologies depends highly on the quality of underlying data, no standard measurement model has been yet es- tablished for assessing quantitatively the quality of big data. This research aims at investigating thoroughly the quality of big data and laying rigorous foundations for its theoretically valid measurement. We recently proposed a quality measure- ment hierarchy for methodically selected 10V’s of big data, based on the existing ISO/IEC standards, and NIST (National Institute of Standards and Technology) definitions and taxonomies. In this paper, pursuant to our latest research, we de- rive measurement information model for the most widely used 3V’s of big data: Volume, Velocity and Variety. The proposed 3V’s measures, declined into a hi- erarchy of 3 indicators, 2 derived measures and 4 base measures, are validated theoretically based on the representational theory of measurement. Our future research will enhance the theoretical findings presented in this paper with empir- ical evidences through evaluation of these measures with open-access data. Keywords: Big Data Quality, Measurement Information Model, ISO/IEC Stand- ards, Volume, Velocity, Variety, Representational Theory of Measurement. Introduction Big data refers to the vast amount of digital data stored and originated from different sources of digital and physical environments. Businesses, organizations and more re- cently governments rely remarkably on the interpretation and analysis of big data to enhance their domain knowledge, make efficient decisions and in consequence, im- prove their profitability, productivity and performance robustly [1]. Although big data analysis and interpretation depend highly on the quality of the underlying data, there has been no standard measurement model to evaluate the quality of big data. In this paper, we derive a new measurement information model that aims to lessen this gap by proposing valid measures specifically for these 3V’s of big data characteristics: (i) Vol- ume, (ii) Variety and (iii) Velocity. 1 Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Motivation of this research. Big data has been evolving through three phases along with the generation of the core components ranging from Database Management Sys- tem (DBMS)-based structured content to web-based, unstructured content and then, in the last decade, to mobile and sensor-based content. Firstly, Relational DBMS (RDBMS) and data warehousing, Extract-Transfer-Loading (ETL), online analytical processing, data mining and statistical analysis have been established. Secondly, infor- mation retrieval and extraction, opinion mining, question answering, web/intelligence analytics, social media/network analytics, social network analysis and spatial temporal analysis have been originated. And thirdly, location aware analytics, person centered analysis; content relevant analysis, mobile utilization and human computer interaction have been developed [2]. Why 3V’s? The increase of Volume, Velocity and Variety of data contributed majorly to big data evolution; they formed primarily the 3V’s of big data characteristics. With the addition of veracity, valence, value, volatility, vitality, validity and vincularity char- acteristics, the 10v’s of big data were shaped. Challenges. One of the main challenges that research and industry face nowadays is the lack of visibility and transparency of big data’s quality: even though researchers and practitioners invest largely in the most sophisticated technologies such as deep learning, big data analytics, training of personal, etc., the statistical and other research findings obtained from big data content are still as good as the underlying data. Approach. The aim of this research is to bridge the gap between the industrial usage of big data and the underlying big data quality in order to verify its quality for the intended purposes, by bringing practical goal-driven solutions and assessing quantita- tively and thus objectively the big data 3V’s. The rest of the paper is organized as follows: in section 2, the background work, in- cluding our published hierarchical model for quality measurement of the big data, is briefly described. We define the quality measurements for the big data characteristics of Volume, Velocity and Variety on the basis of NIST (National Institute of Standards and Technology) definitions and taxonomies, and in accordance with ISO/IEC/IEEE Std. 15939 guidelines. In section 4, our quality assessment of the above-mentioned 3V’s measures is illustrated on real data. The proposed measures are validated theoret- ically based on the representational theory of measurement in section 5. The conclusion and future work directions are outlined in section 6. 2 Background and Related Work 2.1 The 3V’s of Big Data Although big data has been commonly characterized by many different criteria known as V’s of big data, there are three main characteristics that are generally agreed upon, known as 3V’s of big data: Volume, Velocity and Variety, as illustrated in Fig. 1. 3 Fig. 1. The 3V’s of Big Data, extracted from [7]. Volume. It refers to the magnitude of data [8]. Velocity. It refers to the speed at which data is being generated, which can be done in the following ways: • Real-time is when the data is generated immediately such as streaming, radar sys- tems, customer service systems, and bank ATMs. There is the continual input, con- stant processing and steady output. • Near real-time is when the speed is important but the data is not generated imme- diately, mostly for the production of operational intelligence, which is a combina- tion of data processing and CEP (Complete Event Processing) combining data from multiple sources in order to detect patterns such as: sensor data processing, IT sys- tems monitoring, financial transaction processing. • Batch is when the data is generated in delays such as payroll, billing, data analysis from operational data, historical and archived data, data from social media, service data, etc. Variety. It refers to the ever-increasing different forms of data as in text, images, voice, and geospatial. The different types of data are categorized as structured, unstructured and semi-structured data. • Structured data refers to the data stored in databases in an ordered manner. The data in library catalogues (such as: date, author, place, subject, etc.), and economic data (such as: GDP, PPI, ASX) are considered structured. 4 • Unstructured data refers to any data with unknown form/structure. The data in the form of media (mp3, digital photos, audio, video), text (word processing, spread- sheets, presentations) and social media (data from Facebook, twitter, LinkedIn) are considered unstructured. • Semi-structured data refers to a form of structured data that does not conform with the formal structure of data models as in RDBMS or other data tables, but contains tags or other markers which separate semantic elements and enforce hierarchies of records and fields within the data. It is also known as self-describing structure. An example of this form of data can be the personal data stored in an xml file as: Harry Female 23 2.2 Our Proposed Quality Measurement Model Big data analysis and interpretation depend highly on the quality of data as an eminent factor for its maturity [3]. Although NIST has developed taxonomy towards to the standardization of big data technology [4], in which the characteristics of data are piv- oted at different levels of granularity; there has been no standard measurement model in order to assess quantitatively the quality of big data, its analysis and interpretation techniques. Therefore, we proposed recently our new hierarchical goal-driven measurement model for the 10V’s of big data characteristics [4]. We adopted the NIST taxonomy [4], in which a hierarchy of roles/actors and activities including data elements as the smallest level, records as groups of elements, dataset as groups of records and finally multiple datasets, are provided at different levels and defined as below: “Data elements are individual data elements with the same definition in the big data paradigm and are populated by their actual value, constrained by its data type definition (e.g.: numeric, string, date) and chosen data format. The actual value can be constrained by a set of allowed values or a specific standard vocabulary for interoperability with others in the field. For example, in the context of unstructured text, a data element would refer to a single token such as a word. Records are groups of data elements that describe a specific entity or event or transaction. Records have structure and they are grouped as structured, semi-structured and unstructured, as with the increase of mobile and web data (e.g.: online texts, images and videos) more emphasis is on unstructured data, as so a data record can refer to phrase or sentence or entire document data in context of unstructured data. Records can be grouped to form datasets. In the context of unstructured text, a record could be a sentence, paragraph, or section and the dataset could refer to the complete data. Multiple datasets are group of datasets with the em- phasis on the integration and fuse of data. The variety characteristic of big data is con- cerned at this level.” [11] 5 In our quality measurement model, we related each of the 10V’s of big data character- istics to their corresponding levels in the NIST taxonomy, namely: data element, record, dataset and multiple dataset. We adapted the ISO/IEC DIC 25024 international stand- ard’s data quality measures in order to define the quality model on the basis of the 10V’s of big data. Fig. 2 shows an overview of our approach to eliciting measurements for assessing the big data characteristics (the 10V’s). Big Data V's of Big ISO Quality Taxonomy Data for Data Fig. 2. Overview of our Approach to Big Data Quality Measurement [4] The quality model is tailored in a way that facilitates the evaluation of such systems in terms of ISO/IEC standard 25024’s measures of data’s characteristics Availability, Ac- curacy, Accessibility Credibility, Completeness, Compliance, Currentness, Efficiency, Portability, Traceability and Understandability. The validity of the proposed big data quality measurement model is rooted in the standardization of: i) the NIST big data taxonomy, and ii) data measurements defined in ISO/IEC. For a more detailed expla- nation of our new hierarchical quality model designed specifically for the purpose of measurement of quality for the selected 10v’s of big data, please refer to [4]. According to our research findings in [4], for the big data characteristics of Volume, Velocity and Variety (the 3V’s), there is no related measurements found in the existing ISO/IEC standards. In this paper, we propose a new measurement information model for quality assessment of the above-mentioned 3V’s. 3 Quality Measurements for 3V’s of Big Data The goal of this section is to derive theoretically valid measures for 3V’s characteristics of big data, namely, Volume, Velocity and Variety. 3.1 Mathematical Modeling of the Measurement Entities Theoretically valid measure is founded on mathematical modeling of the entities of interest. According to the ISO/IEC/IEEE Std. 15939, an object that is to be character- ized by measuring its attributes is named “entity” [10]. In this work, the entities of 6 interest correspond to the hierarchical levels of the NIST hierarchy (data element, rec- ord, dataset, and multiple datasets). We undertake set-theoretical approach to modeling these NIST hierarchy elements, as described next: Data Elements. As explained in section 2, the data elements in big data originate in heterogeneous nature, including attributes from traditional databases, and newer, for instance, text from social media and sensors data. To be able to model a collection of heterogeneous data elements as a set, we first label each data element with a unique identifier (UIDE). This also can be traced to our view of Variety discussed later on in section 3.2. We state as universe a fixed set of all distinct data elements in the multiple datasets and form a set we refer as DE of the UIDEs of all distinct data elements. Every reference to a data element below is to be interpreted as an indication of its UIDE. Record. Data elements are stored in records. Informally, a record can be seen as a collection of data elements. Every record is referred to by a unique record ID (UIDR). As explained in section 2, records in big data originate in heterogeneous sources, in- cluding traditional databases, and newer, less structured sources like social media and etc. therefore record can refer to a phrase or entire document data in context of unstruc- tured data. We model mathematically a record r as a multiset, which may be formally defined as a 2-tuple (DEr, m) where DEr is the underlying set of the multiset formed from its distinct elements (DEr Í DE). The multiplicity m: DEr ®N + is a function from DEr to the set of the positive integers, giving the number of occurrences of each element elÎDEr as the number m(el). Dataset. The term dataset refers to a collection of one or more records. We model a dataset DS as a set of records’ unique identifiers (UIDR). Every unique dataset is re- ferred to by a unique dataset ID (UIDDST). Multiple datasets. A Big data is viewed as multiple datasets and thus can be formally modeled as a set of datasets MDS (in mathematical terms, as a set of multisets). The aim of this section 3.1 was to propose a mathematical model based on which to define the 3V’s measurement hierarchy. Such approach is justified by the fact that mathematical models would greatly simplify the automation of the measurement pro- cedures. It is to be noted that the automation of the 3V’s measurements is out of the scope of this paper and will be tackled in our future work. The measurements built hierarchically upon the mathematical model are described in section 3.2. 3.2 Proposed 3V’s Measurement Information Model In section 3.2 we closely follow the terminology and guidelines outlined in the Interna- tional ISO/IEC/IEEE Standard 15939 on measurement processes in software engineer- ing [10]. Fig. 3 illustrates the hierarchical relationships among the key components of 7 the proposed measurement information model. The model defines three types of measures: base measures, derived measures, and indicators as detailed below. Base Measures. A base measure is defined in ISO/IEC/IEEE Std. 15939 as function- ally independent of other measures [10]. The base measures in the 3V’s measurement model are depicted at the lowest level of the hierarchy (see Fig. 3). Their definitions are provided next: Number of distinct data elements (Ndde). Ndde reflects the Variety of the data elements, as stated in section 3.1 (see the data element mathematical modeling). The measurement method for the base measure Ndde is counting of the number of distinct labels (UIDE) of data elements in the set DE, formally defined as Ndde (DE) = |DE| (cardinality of set DE, which is the total number of UIDE in multiple datasets). The scale type of the measurement is absolute ([9]) and the measurement unit here is a UIDE. Number of records in a dataset (Nrec). Nrec assesses the Variety of a dataset in terms of diversity of records and their sources. This base measure is defined formally as Nrec (DS) = |DS| (cardinality of set DS formed by unique identifiers of records (UIDR), that is, the number of records in the dataset DS). The scale type of the measurement is ab- solute ([9]) and the measurement unit of Nrec is a UIDR; the corresponding measure- ment method is counting. Number of datasets in big data (Nds). Nds reflects an aspect of Variety in terms of diversity of datasets in multiple datasets (MDS). The measurement method for the base measure Nds is counting the total number of unique identifiers UIDDST of datasets in the multiple datasets. Formally, Nds (MDS) = |MDS|. The scale type of the measure- ment is absolute ([9]) and the measurement unit is a UIDDST. Time (T). T models the absolute time. It is required to accompany the measurement of the Volume at a specific time in order to calculate Velocity and understand big data growth over time. Derived Measures and Indicators. Derived measure is defined as a measurement function of two or more values of base and derived measures [10]. According to ISO/IEC/IEEE Std. 15939, the indicators of the 3V’s are defined here as measures that provide an evaluation of the big data characteristics of Volume, Velocity and Variety derived from the big data measurement needs stipulated earlier (see the introduction section). These indicators serve as a basis for the analysis of big data quality and deci- sion-making based on an interpretation of the results of measurement. The derived measures and indicators defined in this research are specified next: Length of big data (Lbd). Lbd is defined informally in this work as the total number of records in MDS. The measurement formula is as follows: 8 Lbd (MDS) = å " DS Î MDS Nrec (DS) The measurement unit of Lbd is a UIDR. Big Data Volume (Mvol). In this research we define informally Volume in terms of the information content of multiple datasets, that allows us to apply information-theoretic measuring of Volume in mathematical bits, and thus make it possible to define Velocity measure of big data. In information theory, the definition of the amount of information is often expressed with the binary logarithm, corresponding to making the bit the fun- damental unit of information. Assume that there are n elements in the set DE (that is, Ndde (DE)=n). Each record may, or may not, include each individual element from the set DE, thus there are n binary decisions of Yes/No to be made for each record in terms of selecting data ele- ments. This is equivalent to 2n choices per record. Each binary decision can be repre- sented by an information bit. Thus, binary logarithms can be used to calculate the length of the representation of the number of information bits needed to encode a record in information theory. In general, in engineering applications we do not take the logarithm of a dimensioned number, only of dimensionless quantities [13]. To avoid this problem, we define a func- tion bin_dec: Ndde (DE) ® ℕ to convert Ndde (DE) measured in UIDE, into a dimen- sionless quantity represented by a Natural number. This approach allows us to apply information-theoretic measuring of Volume in infor- mation bits, which is independent from the specific technologies used to store the masses of raw data. Mvol of multiple datasets (MDS). It is defined formally as: Mvol (MDS) = Lbd (MDS) log2 (bin_dec (Ndde (DE))) Mvol measures the number of information bits across all records required to specify the information content of multiple datasets. The measurement unit is an information bit; thus, the measurement results of Volume allow to compare objectively multiple datasets in terms of their information content. Hence, the Volume measure proposed in this work gives organizations the ability to objectively assess and compare multiple datasets in terms of their information content, expressed in information bits. Trend of Mvol depicts graphically Volume of big data over time. Big Data Velocity (Mvel). We define informally the notion of Velocity of big data (set of multiple datasets) in terms of a relative growth of big data over a period of Time (T), that is, the speed of increase in big data Volume. 9 The Velocity measure function is defined as: Mvel (MDS) = ((Mvol (MDST1) - Mvol (MDST2)) / Mvol (MDST1) *100 where MDST1 is the multiple datasets at time T1 and MDST2 represents the multiple datasets at time T2 in which T2>T1. Therefore, Mvol (MDS) is expressed as percentage of volume growth over interval of time T2-T1 along with its adequate unit of measure (seconds, minutes, hours, weeks, etc.). Velocity can be used to assess organizations in understanding the relative growth of their big data’s information content over a period of time. Negative values indicate that the data has been archived or removed from the source, or is no longer relevant. In other words, negative values of Velocity show that data’s life span is over. Big Data Variety (Mvar). In our work we consider a 3-fold root cause of Variety: i) the diversity of data elements in multiple datasets (MDS), ii) the diversity of data records in MDS, and iii) the diversity of datasets in MDS. Mvar (MDS) is defined as 3-tuple (Ndde (DE), Lbd (MDS), Nds (MDS)) that reflects correspondingly the diversity of unique data elements, diversity of records, and diver- sity of datasets in MDS. As mentioned earlier, the diversity of data elements is inher- ently captured in the DE set where the data elements are mapped to UIDE, thus ab- stracted from their diverse sources. Mvar allows to objectively compare different mul- tiple datasets in terms of the above three objective measures. The measurement information model we proposed in section 3.2 is based on the ISO/IEC/IEEE Std. 15939 terminology and guidelines. The objective of the next sec- tion is to present graphically this new hierarchical measurement model tailored specif- ically to the 3V’s of big data Volume, Velocity and Variety. 3.3 Hierarchy of the 3V’s Measures The measurement information model proposed in section 3.2 is a hierarchical structure linking information needs to the relevant entities and attributes of concern, such as number of distinct data elements, number of distinct records, number of distinct da- tasets, or length. It defines how the relevant attributes are quantified and converted to indicators that provide a basis for decision making. In our approach, the 3V’s charac- teristics were decomposed through three layers as depicted in Fig. 3. 10 Fig. 3. Hierarchy of Big Data 3V’s Measures. In section 4 we demonstrate the measurement approach with a short example used for illustrative purposes only. 4 Measurement Illustration A sample dataset of Facebook users is provided in Fig. 4 to illustrate the measurement data collection and analysis procedures. We first assign unique identifiers to the distinct data elements that needs to be distin- guished. For instance, Age21, Age19, Age9, Age33, Age22, Age25, Age5, Age26, Age27, Age33, Age35 will represent the different data elements in the Age category, total of 11. Nationalities can be represented by N_S (for Saudi), N_P (for Pakistan), and N_Y (for Yemen) (3 in total). The 4 different values in the column Own Pic/week can be represented by the labels O7, O2, O3 and O8 correspondingly. The 3 different types of exposure are mapped to E_H, E_M and E_L. The collection St_3, St_12, St_7, St_2, St_8, St_22, St_33 labels the different Status/week data elements, 7 in total. Ratio distinct values can be mapped to R0.1, R5, R4, R3, R1, R7, R2 correspondingly (7 in total). The union of all unique identifiers forms the set DE={Age21, Age19, Age9, Age33, Age22, Age25, Age5, Age26, Age27, Age33, Age35, N_S, N_P, N_Y, O7, O2, O3, O8, E_H, E_M, E_L, St_3, St_12, St_7, St_2, St_8, St_22, St_33, R0.1, R5, R4, R3, R1, R7, R2}. Thus, the value of the base measure Ndde (DE) is 35 UIDs. Similarly, the 20 records listed in the sample dataset 1 can be labeled as REC1 ... REC20. Thus, the number of the distinct records in this dataset (Nrec) is 20 UIDR. The number of the datasets in this example is 1 (Nds = 1 UIDDST). The corresponding value of length of big data (Lbd) is 20 UIDR (the total number of records in MDS). 11 Fig. 4. Sample Data (dataset 1) Big Data Volume Indicator (Mvol) is calculated as follows: Mvol (MDS T1) = Lbd (MDST1) log2 (bin_dec (Ndde (DET1))) = 20 log 2 (35) which is 103 information bits. In order to illustrate the quantification of Velocity of big data (Mvel) indicator we need to compare data in two different instances of time, T1 and T2. Assume that the follow- ing records were added up to dataset 1 (see Figure 6) by time T2, in addition to the dataset 1 records (see Fig. 5) that were measured in time T1 (T2>T1): Fig. 5. Additional Entries to the Sample Data 12 Lbd (MDS T2) has increased to 24 records, while the set DE remains the same: (DET1 = DET2). The Volume of big data after the change is: Mvol (MDST2) = Lbd (MDST2) log2 (bin_dec (Ndde (DE T2))) = 24 log 2 (35) which is 124 information bits. Therefore, Velocity of big data (Mvel) shows an increase of: Mvel (MDS) = (124 - 103)/ 103 *100 = 20.4%. In this example, the Variety Indicator of big data Mvar (MDS) is (35 data elements, 20 records, 1 dataset). The major requirement in measurement is that the measurements are defined consist- ently with the entity's real-world behavior to ensure they accurately measure those at- tributes they purpose to quantify. Our view of measurement validation is explained in section 5. 5 Validation of the 3V’s Measures Validation is critical to the success of big data measurement. Measurement validation is “the act or process of ensuring that (a measure) reliably predicts or assesses a quality factor” [12]. In other words, a given measure is valid if it reflects the real meaning of the concept under consideration and is based on the representational theory of meas- urement. Two approaches to validation have been prescribed and practiced in software engineer- ing: (a) theoretical validation, and (b) empirical validation [9]. These two types of val- idation are respectively used to demonstrate that a measure is really measuring the at- tribute it is purporting to measure. 5.1 Theoretical Validation vs. Empirical Validation The 3V’s measures are theoretically validated with respect to Tracking and Con- sistency criteria introduced in [12], as described below: The Tracking Criterion. This criterion assesses whether or not a measurement is ca- pable of tracking changes in product or process quality over the life cycle of that prod- uct or process [12]. A change in the attributes at different times should be accompanied by a corresponding change in the measurement data. It can be expressed formally as follows: If a measure M is directly related to a quality characteristic F, for a given product or process, then a change in a quality characteristic value from FT1 to FT2, at times T1 and 13 T2, shall be accompanied by a change in the measurement value from MT1 to MT2.This change shall be in the same direction (e.g. if F increases, M increases). If M is inversely related to F, then a change in F shall be accompanied by a change in M in the opposite direction (e.g. if F increases, M decreases). The Consistency Criterion. This criterion assesses whether or not there is a con- sistency between the ranks of the characteristics of big data quality (3V’s) and the ranks of the measurement values of the corresponding indicator for the same set. It is used to determine whether or not a measurement can accurately rank, by quality, a set of prod- ucts or processes [12]. The change of ranks should be in the same direction in both quality characteristics and measurement values, that is, the order of preference of the 3V’s will be preserved in the measurement data and can be expressed as follows: If quality characteristic values F1, F2, …, Fn, corresponding to multiple datasets 1 … n, have the relationship F1≻ F2≻… ≻ Fn, then the corresponding indicator values shall have the relationship M1>M2 > … > Mn. This preservation of the relationship means that the measure must be objective and sub- jective at the same time: objective in that it does not vary with the measurer, but sub- jective in that it reflects the intuition of the measurer [9]. Tracking and consistency are a way to validate the representational condition without collecting and analyzing large amounts of measurement data, thus can be done manu- ally. Empirical validation is a process for establishing software measurement accuracy by empirical means. Ultimately, it is clear that both theoretical and empirical validation, are necessary and complementary. In this paper, we target manual theoretical validation of the 3V’s measurements. The reason is that the empirical validation of big data’s measurements would require large amount of measurement data that cannot be collected manually. As stated in the intro- duction section of this paper, the implementation of the proposed measurement data collection and the empirical validation of the proposed measurement procedures will be tackled in our future work. 5.2 Theoretical validation of the 3V’s Indicators Mvol. Based on the meaning of Volume, the more information big data contains (that is, the higher the information content of the big data), the larger the Mvol indicator value. The perception of ‘more’ should be preserved in the mathematics of the measure: if we increase the information content of big data, its Volume will increase, as expected. 14 For instance, Mvol (MDST1) =103 information bits. In the case of MDST2, the dataset is larger than that of MDST1 and consequently Mvol (MDST2) is higher (124 > 103). Mvel. The above expectation is also confirmed by the indicator Mvel, which shows the anticipated increase (20.4%). Mvar. The Variety of MDST2 is not expected to differ much: there was no change in the set DE in time T2, also, there is still one dataset only. The is reflected by the indi- cator values: Mvar (MDST2) is (35 UIDE, 24 UIDR, 1 UIDDST), which shows an increase in the MDS length parameter only, as compared to Mvar (MDST1). These calculations establish the theoretical validity of the big data 3V’s measures, as required by the representational theory of measurement. In our future work, the 3V’s measures will be validated empirically through controlled experiments. 6 Conclusion and Future Work In this paper, we proposed a new measurement information model to quantify three aspects of Big Data – Volume, Variety, and Velocity, known as the 3V’s indicators. Four levels of entities have been considered, derived from the NIST hierarchy: data element, record, dataset, and multiple datasets. The model elements are compliant with ISO/IEC/IEEE Std. 15939 guidelines for their definitions, where four base measures are first defined, assembled into two derived measures, evolving into three indicators, thus, the 3V’s. Theoretical validation of these 3V’s have been demonstrated. The model is suitable for big data in any forms of structured, unstructured, and semi-structured. As anyone may inquire about the relevance of such model for the industry, we can respond with simple examples of usage of these measures and indicators: • Velocity (Mvel): The data owner can oversight the growth of volume over time. A slower growth than expected might trigger investigation as a source of data could be damaged or unavailable. A faster growth than expected could also be an indica- tion of a corrupted source resending data multiple times. Both cases would affect data quality. • Volume (Mvol) and its trend: Volume is useful to objectively compare multiple datasets in terms of their information content, as well as to oversight their growth over time. • Variety (Mvar): Variety reflects the structural heterogeneity at different levels (data elements, records and datasets) thus allows to easily and objectively compare multiple datasets in terms of their Ndde (number of distinct data elements), Nrec (number of records) and Nds (number of datasets) values. Our future research will enhance the theoretical findings presented in this paper with empirical evidences through evaluation of these measures with open-access data and 15 industry data. The automation of the 3V’s measurements is out of the scope of this paper and will be tackled in our future work. References 1. Agbo, B. et al.: Big data: the Management Revolution), , https://hbr.org/2012/10/big-data- the-management-revolution (2018) 2. Mcfee, A., Brynjolfsson, E.: of The Second Machine Age, https://www.bigdataframe- work.org/short-history-of-big-data/), October 2012 issue of Harvard Business Re- view.https://www.amazon.com/Second-Machine-Age-Prosperity-Technolo- gies/dp/0393350649 (2012). 3. Taleb, I., Serhani, M., Dssouli, R.: Big Data Quality: A Survey. 10.1109/BigDataCon- gress.2018.00029 (2018). 4. Omidbakhsh, M., Ormandjieva, O.: Toward A New Quality Measurement Model for Big Data, DATA 2020, The 9th International Conference on Data Science, Technology and Ap- plications (2020). 5. Rouse, M., Agenda, I.: The 4th Industrial Revolution: Hidden Threats to Human and Cyber- security, Servamus Community-based Safety and Security Magazine (2019). 6. Thompson, K., Mattalo, B., The Internet on Things: Guidance, Regulation and the Canadian Approach, CyberLex (2016). 7. Hilbert, M., Lopez, P.: The World’s Technological Capacity to Store, Communicate and Compute Information, Science, 332(6025), 60-65 (2011). 8. Gandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. Inter- national journal of information management. 35(2):137-44 (2015). 9. Fenton, N., Bieman, J.: Software Metrics: A Rigorous and Practical Approach, 3rd edn. CRC Press. https://doi.org/10.1201/b1746 (2014). 10. ISO/IEC/IEEE 15939, 2017-04, Systems and Software Engineering-Measurement Process (2017). 11. NIST U.S. Department of Commerce, Special Publication 1500-1, NIST Big Data Interop- erability Framework: Volume1, Definitions. Volume2, Big Data Taxonomies (2018). 12. IEEE Std 1061. IEEE Standard for a Software Quality Metrics Methodology (1998). 13. Al Qutaish, R.E. and Abran, A.: An analysis of the design and definitions of Halstead’s metrics, In 15th Int. Workshop on Software Measurement (IWSM’2005). Shaker-Verlag, 337-352 (2005).