Understanding Big Data – Definition, Characteristics, And Hadoop

What is Big Data?

With emergence of new and faster technologies it has become easier to record data for the efficiency of the business but with collection growing bigger and bigger the business faces the problem of data storage (Chen, Mao & Liu, 2014). This growing data with time is becoming petabyte form terabyte in size. It is difficult to describe or define it in exact terms yet Big data can be defined as the growing voluminous data which haves the potential to be mined in search for information from which the business can gain much. Big data can be explained as huge amount of structured, unstructured, semi-structured, real-time data, meta data or data at rest (Kitchin, 2014). This term includes only those set of data which is possibly very difficult to be handled or processed by the tools traditionally used for analysis with help of a single computer.

The term is also hard to define as for different set of organization or business Big Data may differ. For a small organization using a bit simple tools its data of size Terabyte may be considered Big Data while for Companies having highly efficient tools for processing the Data of terabyte also might not be considered as Big Data.

In modern terminology Big Data can be defined as those set of large data which is incredibly high in volume, in velocity and even vary greatly and can only be processed with help of new and highly efficient technologies (Jagadish et. al., 2014). This technologies and techniques help in capturing, storing, processing and analyzing the data so as to enhance the ways decision making, optimizing and supporting the organization with the insight that it provides.

Such data are present all around us and are being collected by many of the businesses and organization to gain profit and advantage in market. Such data can be our last purchase details or it may even be bookmarked URL at over PCs. All the data being collected until the the term Big Data was coined, were only sets of data being filed in the company’s server. However, today after the term Big Data has been conceived each and every data from our regular purchase to our search history in personal computers all have turned into Big Data (Hashem et. al., 2015). The main objective of Big Data or its collection is to provide a much better services to the clients and consumers. It is essentially used to capture and analyze the results for making future predictions and projection of company’s benefit or loses.

When the size of data increases enormously analysis becomes immensely difficult and requires advanced technological help. The size of the data cannot be estimated neither it can be manipulated using tools on the normal spreadsheets from the database. Thus it requires some specialized tools which provide the most appropriate project and predictions.

The three V’s of the Big Data can be described as the characteristics which play a role in making the Big Data vary from other processing methods used for Data. These three V’s can help in defining the Big Data more accurately and precisely.

Characteristics of Big Data – The Three V’s

Velocity: Another path in which enormous information varies essentially from other information frameworks is the speed that data travels through the framework (George, Haas, & Pentland, 2014). Information is much of the time streaming into the framework from different sources and is frequently anticipated that would be prepared progressively to pick up bits of knowledge and refresh the present comprehension of the framework.

This attention on close moment input has pushed numerous huge information professionals far from a bunch arranged approach and more like an ongoing spilling framework. Information is continually being included, rubbed, prepared, and broke down to stay aware of the deluge of new data and to surface important data early when it is generally pertinent (Hu et. al., 2014). These thoughts require hearty frameworks with very accessible parts to make preparations for disappointments along the information pipeline.

Volume: The ratio of the increasing scale of Information being processed every now and then. Such datasets can be requests of greatness bigger than customary datasets, which urges for more thought at every single phase of the handling and capacity life cycle. Frequently, in light of the fact that the work prerequisites surpass the capacities of a solitary PC, this turns into a test of pooling, dispensing, and planning assets from clusters of PCs (Assunção et. al., 2015). Cluster or Bunch algorithms and management fit for breaking errands into little pieces turn out to be progressively imperative. Thus velocity defines the characteristics of frequency of the Data.

Variety: Enormous data issues are frequently extraordinary as a result of the extensive variety of all the sources being handled and their relative quality.

Ingestion of data can happen from inside frameworks like server logs and application, from web-based social networking bolsters and other outside APIs, from physical gadget sensors, and from different suppliers. Enormous data tries to handle conceivably helpful data paying little mind to what kind of opinion it’s maintaining by solidifying all data into a solitary framework.

The organizations and sorts of media can differ fundamentally too. Rich media for example pictures, video documents, and sound recordings are ingested close by content documents, organized logs, and so forth (Wamba et. al., 2015). Whereas more customary data preparing frameworks may anticipate that data may get into the the pipeline effectively named, arranged, and sorted out, enormous data frameworks generally acknowledge and store data nearer to its crude state. In a perfect world, any changes or changes to the crude data will occur in memory at the season of handling.

Hadoop is such innovation, and it is by and large the product most regularly connected with Big Data. Apache calls it “a structure that takes into account the appropriated handling of substantial informational collections crosswise over groups of PCs utilizing basic programming models.” Just as Big Data can be both a thing and a verb, Hadoop includes something that is and something that does – particularly, information stockpiling and information preparing (Demchenko, De Laat & Membrey, 2014). Both of these happen in an appropriated design to enhance effectiveness and results. An arrangement of errands known as MapReduce directions the handling of information in various portions of the bunch then separates the outcomes to more sensible lumps which are condensed.

Overview of Hadoop

The Hadoop biological community incorporates both authority Apache open source ventures and an extensive variety of business instruments and arrangements. A portion of the best-known open source illustrations incorporate Spark, Hive, Pig, Oozie and Sqoop. Business Hadoop offerings are much more different and incorporate stages and bundled circulations from merchants, for example, apparatuses for particular Hadoop advancement, generation, and upkeep errands.

Spark is both a programming model and a processing model. It gives a portal to in-memory processing for Hadoop, which is a major explanation behind its prominence and wide reception. Spark gives a contrasting option to MapReduce that empowers workloads to execute in memory, rather than on plate. Spark gets to information from HDFS yet sidesteps the MapReduce preparing system, and in this way disposes of the asset escalated circle operations that MapReduce requires (Kumar et. al., 2014). By utilizing as a part of memory processing, Spark workloads normally keep running in the vicinity of 10 and 100 times quicker contrasted with plate execution.

Figur 1: Spark Computing System

Source: (“Introduction to Apache Spark Part 1 – Mammatus”, 2017)

Spark can be utilized freely of Hadoop. Nonetheless, it is utilized most usually with Hadoop as another option to MapReduce for information handling. Spark can without much of a stretch exist together with MapReduce and with other biological community segments that perform different undertakings.

Hive is information warehousing programming that locations how information is organized and questioned in circulated Hadoop bunches. Hive is likewise a prominent improvement condition that is utilized to compose inquiries for information in the Hadoop condition. It gives devices to ETL operations and conveys some SQL-like abilities to the earth (Polato et. al., 2014). Hive is a decisive dialect that is utilized to create applications for the Hadoop condition, notwithstanding it doesn’t bolster continuous questions.

Figure 2: Hive Data Warehouse

Source: (“Hadoop Hive Architecture, Data Modeling & Working Modes”, 2017)

Hive likewise permits MapReduce-good mapping and diminishment programming to perform more refined capacities. In any case, Hive does not permit push level updates or support for constant queries, and it is not proposed for OLTP workloads.

Pig is a procedural programming for creating parallel preparing applications for huge informational collections in Hadoop condition. Pig is another option to Java programming for MapReduce, and consequently creates MapReduce capacities. Pig incorporates Pig Latin, which is a scripting dialect. Pig makes an interpretation of Pig Latin scripts into MapReduce, which can then keep running on YARN and process information in the HDFS group (Landset et.al., 2015). Pig is well known on the grounds that it mechanizes a portion of the intricacy in MapReduce improvement.

Figure3: Pig: Analytics tool

Source: (Spideropsnet.com, 2017)

Pig is regularly utilized for complex utilizes cases that require different information operations. It is even more a preparing dialect than an inquiry dialect. Pig creates applications that total and sort information and backings various data sources and fares. It is exceedingly adaptable, in light of the fact that clients can compose their own capacities utilizing their favored scripting dialect.

References

Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., & Buyya, R. (2015). Big Data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 79, 3-15.

Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2), 171-209.

Demchenko, Y., De Laat, C., & Membrey, P. (2014, May). Defining architecture components of the Big Data Ecosystem. In Collaboration Technologies and Systems (CTS), 2014 International Conference on (pp. 104-112). IEEE.

George, G., Haas, M. R., & Pentland, A. (2014). Big data and management. Academy of Management Journal, 57(2), 321-326.

Hadoop Hive Architecture, Data Modeling & Working Modes. (2017). A4academics.com. Retrieved 29 April 2017, from https://a4academics.com/tutorials/83-hadoop/836-hadoop-hive

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, 98-115.

Hu, H., Wen, Y., Chua, T. S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652-687.

Introduction to Apache Spark Part 1 – Mammatus. (2017). Mammatustech.com. Retrieved 29 April 2017, from https://www.mammatustech.com/introduction-to-apache-spark

Jagadish, H. V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J. M., Ramakrishnan, R., & Shahabi, C. (2014). Big data and its technical challenges. Communications of the ACM, 57(7), 86-94.

Kitchin, R. (2014). The real-time city? Big data and smart urbanism. GeoJournal, 79(1), 1-14.

Kumar, R., Gupta, N., Charu, S., & Jangir, S. K. (2014). Architectural Paradigms of Big Data. In National Conference on Innovation in Wireless Communication and Networking Technology–2014, Association with THE INSTITUTION OF ENGINEERS (INDIA).

Landset, S., Khoshgoftaar, T. M., Richter, A. N., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, 2(1), 24.

Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1-25.

Spideropsnet.com. Retrieved 29 April 2017, from https://spideropsnet.com/site1/wp-content/uploads/2014/09/pigFlow.png

Wamba, S. F., Akter, S., Edwards, A., Chopin, G., & Gnanzou, D. (2015). How ‘big data’can make big impact: Findings from a systematic review and a longitudinal case study. International Journal of Production Economics, 165, 234-246.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Understanding Big Data – Definition, Characteristics, And Hadoop ”

Get high-quality paper

NEW! AI matching with writer