Describe about the Critical Evaluation and Literature Research of Advanced Database Systems?
The research study aims to find the utilization of data analytics and data management tool which is specifically used for incremental computation purposes. The paper focuses on describing the architecture and infrastructural framework of a data manipulation system (Garcia, 2013). The significance of conducting a thorough research in this field is necessary as more and more organizations face the growing need to manage the large pool of data originating from different functions and operations of business.
The research is based on the evaluation of database architectures and implementation of a generic framework namely Incoop. The significance of this particular research lies in the larger field of the database management and data analytics to manage the ever growing amount of business data and information as an integral part of organization’s services and operations in general (Zhang et al., 2015). The present study focuses on an incremental computation system Incoop that incorporates designs and computations in order to automatically respond to inputs and updates of data by using and reusing intermediate outcomes of previous runs of the program.
The research has undertook a thorough process of devising a system in order to achieve transparency and efficiency in the field of data processing and analytics utilized for meeting business goals. It aims to resolve issues in the computation of input data by the means of using algorithms and constructing programs based on large scale incremental parallel data processing (Liu and Li, 2015). The researcher conducted this particular project to develop a framework that significantly improves the efficiency of incremental programs.
The framework helps in processing large data sets generated by organizations in a distributed computing environment. It establishes efficient approaches in incremental computations. The design of Incoop uses the fundamental aspects of Hadoop based data analytics framework for storing, processing clusters of data and handle very large data sets with a facility of massive storage (Liu and Li, 2015). The terminologies used for the specific purpose is divided into three components of computations such as Map tasks, Reduce tasks and Contraction tasks.
The core design specification of Incoop involves incremental HDFS and a memorization server or memorization aware scheduler. To be more specific, the three phases are considered in the process of implementing the system in a distributed database environment (Qian et al., 2012). The incremental map is targeted to store the intermediate results between iterative runs as mentioned earlier. Thereafter, these results are stored in the memorization server using hashing techniques.
Therefore, Incoop provides a memorization based scheduling technique in order to enhance the efficiency and transparency of large scale distributed data processing.
The current research over MapReduce paradigm has emphasized on huge data blocks and data processing workflow. The existing MapReduce programs are mainly an execution system for other frameworks (Lam et al. 2012). The researchers have overviewed two workflows with efficient incremental process in the Incoop context. The background of the study deals with those two workflows namely as Incremental Log Processing and Incremental Query Processing.
Figure 3.1: Speedup results for Incremental Log Processing
(Source: Bhatotia et al., 2011, pp. 7)
The Incremental Log Processing is necessary for Internet Service Provider (ISP) organizations. The logs of data are analyzed with respect to several ways daily. The click-log area and the web several logs are collectively stored in an inventory and the data is processed for several purposes as the counting clicks statistics checking, creation of session for clicking (Yan et al. 2012). The Incremental Log Processing is performed with Apache Flume, with distributed, reliable data collecting and aggregating service with large block of data serving. The process summarizes data and stores them into Inc-HDFS data store. The Incoop starts the data analysis process with incrementing the storage locations and dumping the intermediate results.
The performance evaluation process with Flume is performed with runtime comparison between Incoop and Hadoop. The performance evaluation process was based on the Incremental Log Processing constraint over two separate frameworks. The test is conducted with some initial log document analysis and later compiling some new entries of logs in the document. Later the document is processed with incremental approach with larger data collection. The research provides result of speedup for Incoop with a speedup factor of 4 to 2.5 with comparison to Hadoop framework (Figure 3.1). The Hadoop processes incremental log and compiles at a size of 5% to 25% from initial log input size.
The researchers have analyzed another workflow as Incremental Query Processing providing significant benefits of Incoop. The workflow is important in view of ISP companies with same query processing for changing data set. The Incoop integration with Pig is performed for feasibility analysis of query processing (Doulkeridis and Norvag, 2014). The Pig is the platform for large data block analysis built from Hadoop framework. Pig is the high-level query language with similarity over SQL. Pig provides easy coding with larger data analysis and the helps ISP companies for information analysis.
The Pig programs are appended with multi-staged MapReduce process, with underpinning execution process of Pig applications. The applications are word count and PigMix for benchmarking of effectiveness (Kalavri and Vlassov, 2013). The runtime is estimated with 15% of the first run and the speedup is estimated as a factor of three for unmodified input with incremental rum. The results are identified as follows (Table 6.2). The Word count application checks with Group_by and Order_by filter shows speedup of 2.84 with 15.65% of overhead for performance. On the other hand, PigMix benchmark checks with Group_by feature and identifies speedup of 3.33 with 14.5% of overhead comparison.
Application |
Features |
M/R stages |
Overhead |
Speedup |
Word count |
Group_by and Order_by filter |
3 |
15.65% |
2.84 |
PigMix benchmark for scalability |
Group_by feature |
1 |
14.5% |
3.33 |
Table 6.2: Results from Incremental Query Processing
(Source: Bhatotia et al., 2011, pp. 7)
There are significant disadvantages to the Incoop methodology as they leave from the worldview of MapReduce programming as well as consequently oblige transforms to the substantial accessible foundation of MapReduce projects (Kalavri and Vlassov, 2013). In addition, major issue is that they necessitate software engineering to formulate a dynamic calculation keeping in mind the end goal to process information productively in an incremental way. Along these lines, there exist different strategies that beat the real negative parts of this innovation as said underneath:
i2MapReduce (Incremental Iterative MapReduce): As a rule, the progressions affect just a little part of the information sets, and the recently iteratively joined state is entirely near the beforehand focalized state. Along these lines, i2MapReduce abuses this perception to spare re-calculation by beginning from the already united state, and by performing incremental upgrades on the evolving information (Zhang et al., 2015). There are different components identified with the improvement of this system as:
Iterative Processing: A progression of dispersed structures has as of late developed for expansive scale iterative calculation in the cloud subsequently there are systems that enhance MapReduce. Hadoop, a changed adaptation of Hadoop, enhances the proficiency of iterative calculations by employing so as to make the assignment scheduler circle mindful and storing systems.
One-time Incremental Processing: MapReduce results in an incremental manner by adjusting view upkeep methods, which gives a general answer for the incremental support of MapReduce projects that process self-viable totals. As opposed to one-time calculation, i2MapReduce addresses the test of supporting incremental preparing for iterative calculation.
MadLINQ: The MadLINQ addresses the accompanying two critical exploration issues: the requirement for a profoundly versatile, proficient and issue tolerant network calculation framework that is additionally simple to develop along with the consistent combination of typical particular engines for execution in a universally useful information parallel figuring framework. MadLINQ uncovered a bound together model of programming to both lattice calculation as well as application engineers (Doulkeridis and Norvag, 2014). MadLINQ embeds an arrangement of area particular dialect develops into a broadly useful programming dialect (C#), like the methodology undertaken by DryadLINQ and FlumeJava for programming of parallel information. Thus, the embedding permits to uncover a model of unified programming for creating both grid calculations along with applications. The components of the MadLINQ project are outlined as underneath:
Programmability: The MadLINQ uses Tile algorithm in modern language and has high expressiveness with regards to experimental algorithms
Execution Model: The Dataflow at tile level performs with block-level pipelining across tile execution
Scalability: No limitation of problem size; presentation bounded by tile-level parallelism, enhanced with block-level pipelining
Handling of failures: There is exact re-computation at granularity of blocks
The present accentuation by the framework group on adaptable motors, for example, MapReduce, DryadLINQ as well as Hive are not unintentional. The mentioned frameworks speak to with scale-out a subset of the mainly helpful social variable based math APIs.
The limitations of the study involve lack of sufficient comparisons with the other similar types of incremental frameworks. Furthermore, the study does not contain proper explanation as to how the methods are implemented in order to generate the perceptual incremental changes. To be more specific, the process of garbage collection utilized in the programming process has been criticized by many other researchers. The main area considered in the research is content based chunking which helps detecting the incremental changes in the input data (Gupta, P., Kumar, P. and (Gopal, 2015). However, MapReduce programming framework necessitates developing multiple splits in accordance with the number of blocks with respect to the map tasks. In this particular research, the researcher tried to parallelize the powerful tools of data processing but despite that it has the following main limitations identified:
State-less: after completion of the map and reduce tasks, the outputs are written onto a distributed file system and the memorization scheduler is informed thereafter (Holmes, 2012). Therefore, the intermediate results are deleted by a particular cleanup method. This process requires the system to create a new job each time a new input data arrives. For this reason, it is referred to as stateless.
Stage independent: The two stages of this particular process are map stage and reduce stage both of which are independent of the execution of the other. The map stage focuses on executing map method in terms of input split allocations (Tan, Meng and Zhang, 2012). The reduce stage further considers fetching input data from local nodes. Therefore, in this technique tasks involved in both map and reduce phases executes without being dependent on the other.
Singe step: The order of execution for map and reduce tasks is only maintained once for a particular job. Map tasks are required to be completed at different times whereas reduce tasks are focused on copying the intermediate outputs once there is a successful completion of the map tasks.
The study mainly undertook a project targeted to implement large scale data processing in order to achieve significant performance improvements in the field of incremental computations. Since the lunching of Hadoop MapReduce systems, there have been a significant number of research and further step forwards to uncover the unlimited advantages and utilizations in this particular area (Sakr, Liu and Fayoumi, 2013). To be more specific, the issue regarding fault tolerance is a significant one that needs to be addressed in the future scope of research in this field. Even though fault tolerance can help in gaining improved performance, however MapReduce can be taken into next higher levels by properly addressing this issue. One way to do this is to balance or quantify the tradeoff between performance and fault tolerance.
For this purpose, adequate study in Hadoop framework can be conducted which can unveil capabilities that provide automatic fault tolerance techniques and adjustment methods depending on the cluster characteristics and application programs. Another issue can be addressed in the future studies of this topic (Ghuli et al., 2015). It is the lack of standard benchmark which in turn can effectively compare the different implementations of Hadoop framework. The different systems are to be analyzed based on separate data sets, set of applications and deployments.
As proposed by numerous specialists, business DBMSs have embraced “one size fits all” procedure as well as are not suitable for explaining to a great degree vast scale information preparing assignments. There has been an interest for extraordinary reason information preparing instruments that are customized for such issues (Lam et al., 2012). While MapReduce is alluded to as another method for preparing enormous information in server farm registering, it is likewise scrutinized as a “noteworthy step in reverse” in parallel information handling in correlation with DBMS.
This study shows an unmistakable tradeoff in the middle of effectiveness and adaptation to non-critical failure. MapReduce expands the adaptation to non-critical failure of long-lasting examination by regular checkpoints of finished assignments and information replication. In any case, the successive I/Os required for adaptation to internal failure lessen proficiency. Parallel DBMS goes for productivity instead of adaptation to non-critical failure. DBMS effectively abuses pipelining middle of the road results between inquiry administrators. In any case, it causes a potential peril that many operations need be revamped when a disappointment happens. With this central distinction, the advantages and disadvantages of the MapReduce system can be sorted as beneath:
Usability: The MapReduce model is basic however expressive. From the perspective of MapReduce, a software engineer characterizes the employment with just Map along with Reduce capacities, exclusive of specifying substantial conveyance of the individuals’ occupation crosswise over hubs.
Adaptable: This computation technology does not have any reliance on information model as well as blueprint (Bhatotia et al., 2014). The assistances of MapReduce helps a software engineer to manage unpredictable or free information more effortlessly than can be done with the help of DBMS.
Independent of Storage: The MapReduce is fundamentally autonomous as of basic accumulative layers. Accordingly, MapReduce can effort with diverse application layers, for example, BigTable and others.
Adaptation to non-critical failure: MapReduce is exceptionally blamed tolerant. For instance, it is accounted for that MapReduce can keep on working regardless of a normal of 1.2 disappointments for each investigation work at Google (Alam and Ahmed, 2014).
High versatility: The best point of interest of utilizing MapReduce is high adaptability. The site Yahoo! states that their Hadoop rigging could extent in 2008 out to more than 4,000 hubs.
No High Level Language: There is none backing for any high level language as SQL in DBMS and techniques for optimization of query in MapReduce. Clients ought to code their executions in Map along with Reduce capacities.
No schema and index: The MapReduce is free of schema and index. A MR employment can exert directly after its data is stacked into its accumulative layer. Nevertheless, this spontaneous preparing discards the advantages of information demonstrating (Dittrich and Quiane-Ruiz, 2012). MapReduce necessitates parsing everything at reading the input and changing it into information objects for information preparing, bringing about execution debasement.
A Single unaltered dataflow: MapReduce gives the convenience with a basic deliberation, however in a settled dataflow. Consequently, numerous unpredictable calculations are difficult to execute with Map along with Reduce just in a MR work. Additionally, a few calculations that require different contributions are not all around strengthened following the dataflow of MapReduce is initially intended to peruse a solitary info and produce a solitary yield.
Low efficiency: With adaptation to non-critical failure and versatility as its essential objectives, MapReduce executions are not generally upgraded for Input and Output proficiency. A move to the following stage cannot be made until every one of the undertakings of the present stage is done. Therefore, pipeline parallelism ought not to be misused. Additionally, piece level starts over, a coordinated rearranging technique, and a straightforward runtime planning can likewise bring down the productivity per hub (Zhang and Chen, 2013). This framework does not have particular arrangements for execution and does not enhance procedures like those that DBMS follows to minimize information exchange crosswise over hubs. Along these lines, MapReduce regularly demonstrates poorer execution than DBMS. Likewise, the MapReduce structure has an inertness issue that originates from its intrinsic group preparing nature. All of inputs for a MR Job ought to be arranged ahead of time for handling.
Conclusion
The MapReduce framework supports incremental processing of input data based on storing of intermediate results or preservation of intermediate states. The map and reduce phases and their corresponding functions dynamically resolves the issues with data processing efficiency and speed. The purpose of this research was to follow the API parameters and submit the jobs as they arrive, process them without modifying the application programming details and algorithms. The paper carries out a thorough evaluation of the effectiveness in the overall performance of Incoop system with respect to transparency and efficiency. The efficiency is based on providing transparency so as to create abstraction that does not need to the users to have knowledge about the methods used to process the incremental data. Therefore, the study has successfully evaluated the aspects and functionalities of MapReduce as a programming model to process and analyze massive data sets used by industries. In the present study, Incoop is a design model which establishes an incremental approach to it. Several other frameworks exist one of which is Apache Hadoop framework. Incoop inputs are also considered to be hadoop based.
References
Ahmad, F., Chakradhar, S.T., Raghunathan, A. and Vijaykumar, T.N., (2012), March. Tarazu: optimizing MapReduce on heterogeneous clusters. In ACM SIGARCH Computer Architecture News (Vol. 40, No. 1, pp. 61-74). ACM.
Alam, A. and Ahmed, J., (2014), March. Hadoop Architecture and Its Issues. In Computational Science and Computational Intelligence (CSCI), 2014 International Conference on (Vol. 2, pp. 288-291). IEEE.
Bhatotia, P., Wieder, A., Acar, U.A. and Rodrigues, R., (2014). 4 Incremental MapReduce. Large Scale and Big Data: Processing and Management, p.127.
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A. and Pasquin, R., (2011), October. Incoop: MapReduce for incremental computations. In Proceedings of the 2nd ACM Symposium on Cloud Computing (p. 7). ACM.
Dittrich, J. and Quiane-Ruiz, J.A., (2012). Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment, 5(12), pp.2014-2015
Doulkeridis, C., and Norvag, K. (2014). A survey of large-scale analytical query processing in MapReduce. The VLDB Journal—The International Journal on Very Large Data Bases, 23(3), 355-380.
Garcia, C. (2013). Demystifying MapReduce. Procedia Computer Science, 20, pp.484-489.
Ghuli, P., Shukla, A., Kiran, R., Jason, S. and Shettar, R., (2015). Multidimensional Canopy Clustering on Iterative MapReduce Framework Using Elefig Tool. IETE Journal of Research, 61(1), pp.14-21.
Gupta, P., Kumar, P. and Gopal, G. (2015). Sentiment Analysis on Hadoop with Hadoop Streaming.International Journal of Computer Applications, 121(11), pp.4-8.
Holmes, A., (2012). Hadoop in practice. Manning Publications Co..
Kalavri, V. and Vlassov, V., (2013), July. Mapreduce: Limitations, optimizations and open issues. In Trust, Security and Privacy in Computing and Communications (TrustCom), 2013 12th IEEE International Conference on (pp. 1031-1038). IEEE.
Lam, W., Liu, L., Prasad, S. T. S., Rajaraman, A., Vacheri, Z., and Doan, A. (2012). Muppet: MapReduce-style processing of fast data. Proceedings of the VLDB Endowment, 5(12), 1814-1825.
Liu, Q. and Li, X. (2015). A New Parallel Item-Based Collaborative Filtering Algorithm Based on Hadoop. JSW, 10(4), pp.416-426.
Markonis, D., Schaer, R., Eggel, I., Muller, H. and Depeursinge, A., (2012), September. Using MapReduce for large-scale medical image analysis. In2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology (p. 1). IEEE.
Qian, Z., Chen, X., Kang, N., Chen, M., Yu, Y., Moscibroda, T. and Zhang, Z., (2012), April. MadLINQ: large-scale distributed matrix computation for the cloud. In Proceedings of the 7th ACM european conference on Computer Systems (pp. 197-210). ACM.
Sakr, S., Liu, A. and Fayoumi, A.G., (2013). The family of MapReduce and large-scale data processing systems. ACM Computing Surveys (CSUR),46(1), p.11.
Schildgen, J., Jorg, T., Hoffmann, M. and Debloch, S., (2014), June. Marimba: A Framework for Making MapReduce Jobs Incremental. In Big Data (BigData Congress), 2014 IEEE International Congress on (pp. 128-135). IEEE.
Song, J., Guo, C., Zhang, Y., Zhu, Z. and Yu, G., (2015). Research on MapReduce Based Incremental Iterative Model and Framework. IETE Journal of Research, 61(1), pp.32-40.
Tan, J., Meng, X. and Zhang, L., (2012), June. Coupling scheduler for mapreduce/hadoop. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (pp. 129-130). ACM.
Varian, H.R., (2014). Big data: New tricks for econometrics. The Journal of Economic Perspectives, pp.3-27.
Wang, L., Tao, J., Ranjan, R., Marten, H., Streit, A., Chen, J. and Chen, D., (2013). G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Generation Computer Systems, 29(3), pp.739-750.
Yan, C., Yang, X., Yu, Z., Li, M. and Li, X., (2012), June. Incmr: Incremental data processing based on mapreduce. In Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on (pp. 534-541). IEEE.
Yao, H., Xu, J., Luo, Z. and Zeng, D., (2015). MEMoMR: Accelerate MapReduce via reuse of intermediate results. Concurrency and Computation: Practice and Experience.
Yin, J., Liao, Y., Baldi, M., Gao, L. and Nucci, A., (2013), June. Efficient analytics on ordered datasets using MapReduce. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing (pp. 125-126). ACM.
Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S. and Stoica, I., 2012. Job scheduling for multi-user MapReduce clusters.
Zhang, Q., Gao, Y., Chen, Z. and Zhang, X. (2015). Scheduling Optimization Algorithm Based on Hadoop. JACN, 3(3), pp.197-200.
Zhang, Y. and Chen, S., (2013), August. i 2 MapReduce: incremental iterative MapReduce. In Proceedings of the 2nd International Workshop on Cloud Intelligence (p. 3). ACM.
Zhang, Y., Chen, S., Wang, Q. and Yu, G., (2015). i2MapReduce: Incremental MapReduce for Mining Evolving Big Data.
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Contact Essay is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Essay Writing Service Works
First, you will need to complete an order form. It's not difficult but, in case there is anything you find not to be clear, you may always call us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download