Data Warehouse Vs. Data Lake Vs. Data Mart: A Comparison

What is a Data Warehouse?

Task 1.1

Data Warehouse: A data warehouse is a kind of relational database designed for specific query and data analysis. It is not used for regular transactional processing of the data. Historical data collected from different sources are collected from different transactional data and other sources. A data warehouse helps the organization to separate out the analysis workload from the transactional workload of the servers (Kimball 2013). Apart from the analysis capabilities, a data warehouse also has the capability to do data extraction, transportation, data transformation and data loading solutions. It also includes an online analytical processing engine (OLAP), client data analysis tools and applications, which are used to process the gathering of information and to deliver it to the users. A data warehouse is designed to help in analyzing of the information collected from the sources. To learn more about a department of an organization, they can invest in a data warehouse, which will analyses the information collected from the department (Vaisman and Zimányi 2014). The ability to analyses the information in a section wise manner helps the warehouse to be subject oriented in nature. A data warehouse has the properties of being subject oriented in nature, has data integration procedures, stores time variant information, and has storage for nonvolatile information. To implement a correct data warehouse the organizations must follow correct design mechanism.

Data Lake: A data lake is a new generation of data storage procedure that has been developed to meet the new emerging trends in data analysis. It can be defined as a temporary storage area for data being collected from the online resources for the analysis of the organization. The data collected is just dropped into the data lake accompanied by a unique identifier. This identifier can be used to identify the data that it holds. The identifier ca be compared to being a metadata tag of the information collected (Miloslavskaya and Tolstoy 2016). When data analysis is done of the information, the identifiers are called upon by using a query. The relevant information is collected and the result is returned. The data fetched is analyzed and a compact decision is provided. The term Data Lake is coined it the Hadoop oriented object storage. Using a data lake can provide effective information during data analysis or when data mining is done on the organization (Fang 2015). The concept of a data lake is a new trend in the digital world and is being slowly accepted. As a data lake is a large storage of information there is no need to follow any schema for designing the storage facility of the database.

Data Mart: Data mart is a small version of data warehouse that is used by a certain class of workers to store their data analysis information. The term is often misused with data warehouse, but they are very different terms (Ramos, Alturas and Moro 2017). However, they might to the same work but the working environment is different. For a larger organization, there is always the option of using a data warehouse. However, the use of a data mart concept in new it is slowly being accepted into the digital world (Golfarelli and Rizzi 2013).

Task 1.2

What is a Data Lake?

Data Warehouse: Data has been stored in a data warehouse at a very granular level of details. During analysis, all information related to the query is extracted, changed and loaded. This means that the information is first extracted from the sources and changed into a common format for the warehouse to read it (Ross et al. 2014). The revised information is then loaded into the database to continue analyzing. When a query is sent to the data warehouse, it first locates the information from the warehouse and retrieves the data. It then presents the information in an integrated view for the user to view. A warehouse provides a better form of query support than the traditional database. The warehouse has access to enhanced spreadsheet functions, structured and faster query processing, and data mining and efficient viewing. The enhanced spreadsheet function helps the organization to view the analyzed data in a better view. An organization should have a data warehouse for doing competitive and comparative historical data analysis, to get real time analysis of financial information of the organization, to simplify the data processing methods, to identify the competitive market trends and to reduce the cost in the operations of the organization (Kimball and Ross 2013). Most of the organizations can benefit from the use of a data warehouse.

Data Lake: A data lake helps an organization to analyze data and information of different variety and volume of the data (O’Leary 2014). To implement a successful data lake implementation an organization has to use different tools to collect the information from multiple data sources. They also have to keep in mind they need to do the data collection in a domain specific information. Searching of information in different department would cause confusion, as the only identifier of the data is a metadata tag. There should also be an implementation of an automated management of the metadata information. The data lake should have the ability to scan out the new incoming information into categories, tag them and store them in the database (Roski, Bo-Linn and Andrews 2014). Following these steps, an organization will be able to implement a data lake in their organization. The schema which a traditional a database follows is absent in such a data lake which make the implementation easier. Data analysis on an experimental basis can also be done on the data stored in the lake.

Data Mart: A data mart is targeted for a department in an organization; data analysis is easier on the information stored in the data mart. A large organization can save resources and time by analyzing the information department wise (Rahman, Riyadi and Prasetyo 2015). The final analysis data can be clubbed to form a better-detailed information. The data mart use the OLAP feature of the data warehouse to do data analysis of the information. Using a data mart in an organization is helpful because the load of analyzing a data warehouse is shared between the data marts. It produces authorize able different subsets of the data warehouse. It can be used to analyze the return of investment of a department of an organization. The data mart provides savings by reducing time consumed for the analysis of the information (Zhu et al. 2015). If the data mart is not used in the right manner then the whole sdata warehouse can collapse.

Task 2.1

What is a Data Mart?

The following set of images show the charts created using Rapid Miner:

The report is targeted for the analysis of the top five variables, which determine the quality of the white wine. The quality of the wine needs to be plotted against the other variables of the wine samples. The different properties of the scatter graphs have been plotted against quality. The quality has been plotted in the x-axis of the graph and the rest of the variables on the y-axis. The plots vary in color keeping the positioning in mind. The farther away points are colored red, then green, yellow and the points closest to the origin and the axis is colored blue. The best points in the graph are therefore are the red spots. Analyzing the graphs it can be seen that the alcohol, Sulphates, total Sulfur dioxide, fixed acidity and volatile acidity seems to be the top five variables, which can be used to determine the quality of the white wine. Thus the graph with the most amount of red spots are: Quality VS alcohol, Quality VS Sulphates, Quality VS total Sulfur dioxide, Quality VS fixed acidity and Quality VS volatile acidity. This is the initial assumption from the graphs.

After making the correlation table, it became clearer about the top five variables, which can be chosen for making the quality of the wine better. The higher the value of the attribute weight it would be better suited for using in determining the quality of the wine. Thus looking at the table the top five variables come to alcohol, density, chlorides, volatile acidity and total sulfur dioxide. These would be the top five variables that can be used to determine the quality of the white wine.

Task 2.2

The linear regression table has been created in Rapid Miner and the results have been shared below:

To make the final regression model the following steps have to be followed:

Drag the database from the repository and drop it into the process area.
Next, set the operator on which the linear regression has to be done. For this method, the quality has to be selected as a part of the model. Therefore, the role needs to be set to the quality variable.
Search for set role in the operator’s tab. Drag and drop the operator into the process area.
Connect the out of the data set of white wine to the set exa input node of the set role operator.
To specify which variable needs to be worked on change the attribute name to quality and the label to target.
Next search for linear regression in the operators tab and drag the linear regression attribute on to the process area.
Connect the exa output of the set role operator to the linear regression tra node.
Connect the mod node of the linear regression operator to the res node of the process area.
Press F11 on the keyboard to run the process.

The table shows the linear regression performed on the white wine database.

Figure 17: The chart showing the summary table of the results from the linear regression.

The linear regression equation is:

Y=A + BX₁ + CX₂ + DX₃ + EX₄ + FX₅ + GX₆ + HX₇ + IX₈ + JX₉ + KX₁₀

Here in the equation,

Y is the dependent variable of the data set,
X₁, X₂, X₃, X₄, X₅, X₆, X₇, X₈and X₉ are the independent variables of the dataset
A is intercept
B, C, D, E, F, G, H, I, J are the coefficient of the independent variables.

Where Y=quality,

X₁=fixed acidity, X₂=volatile acidity, X₃=residual acidity, X₄=chlorides, X₅=free sulfur dioxide, X₆=total sulfur dioxide, X₇=density, X₈=pH, X₉=sulphates and X₁₀=alcohol
A=149.900, B=0.066, C=-1.868, D=0.081, E=-0.234, F=0.004, G=-0.000, H=-149.986, I=0.684, J=0.632 and K=0.194

Therefore, the linear regression equation becomes:

Quality = 149.900 + 0.066*fixed acidity – 1.868*volatile acidity + 0.081*residual sugar – 0.234*chlorides + 0.004*free sulfur dioxide – 0.000*total sulfur dioxide – 149.986*density + 0.684*pH + 0.632*sulphates + 0.194*alcohol

Now if all the other variables except intercept (A) and BX₁ is zero (0) then the equation becomes:

Y=A + BX₁

If A=0, then Y = BX₁

Looking from the above equation it is clear that Y is directly proportional on X₁, or in other words, the quality of the wine is directly related to the different variables of the wine. On increase in the value of the coefficient of variables, the quality increases with the similar multipliable coefficient and if the value of the coefficient is a negative term then the quality decreases in a similar manner.

Substituting the values into Y = BX₁ we get Quality = 0.066*fixed acidity, which shows that for a single unit change in the fixed acidity the quality of the wine increases by a factor of 0.066.

Differences between a Data Warehouse, Data Lake, and Data Mart

Following the above process if we take Y = CX₂we get Quality = -1.868*volatile acidity, which shows that for a unit change in the volatile acidity of the wine there is a drop in the quality of the wine by a factor of 1.868.

For the given data set of the samples of white wine, the quality of the white wine increases with the effect of fixed quality, residual sugar, free sulfur dioxide, pH, sulphates and alcohol. For the change in the value of volatile acidity, chlorides, total sulfur dioxide and density the quality of the white wine decreases.

The data set provided displays the snowfall at Whistler BC Canada. The files shows other relevant information related to the weather conditions from 1972 to 2009. The table provided us with the daily information about the maximum, minimum and the mean temperatures, total amount of rainfall, snowfall and precipitation and the temperature of the area. The following diagram gives the view of the different graphs created on Tableau:

The above figure shows the total snowfall, total rain, total precipitation, average of the minimum temperatures, average of the maximum temperatures and the average temperatures over the period of years. The average snowfall throughout the period of time has been over 600cm. 1999 had the highest amount of snowfall recorded in the area of 1,564cm. A slight decrease in the rainfall has been observed between the years of 1986 to 2004. There was again a huge decrease in rainfall after 2005. The highest recorded rainfall was 735.9cm. The period of time had an average rainfall of 700mm. The average minimum temperature recorded for the area was below 0^oC. The lowest average minimum temperature recorded was around -9.529^oC. There was a fluctuation in the maximum temperature but reached above 0^oC after 1980. The average maximum temperature declined after 1992, and again rose up after 2005. The lowest average temperature recorded for the period was 15.752^oC.

References

Fang, H., 2015, June. Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), 2015 IEEE International Conference on (pp. 820-824). IEEE.

Golfarelli, M. and Rizzi, S., 2013. Data warehouse testing. Developments in Data Extraction, Management, and Analysis, pp.91-108.

Kimball, R. and Ross, M., 2013. The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons.

Kimball, R., 2013. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling E-Books.

Miloslavskaya, N. and Tolstoy, A., 2016, August. Application of Big Data, Fast Data, and Data Lake Concepts to Information Security Issues. In Future Internet of Things and Cloud Workshops (FiCloudW), IEEE International Conference on (pp. 148-153). IEEE.

O’Leary, D.E., 2014. Embedding AI and crowdsourcing in the big data lake. IEEE Intelligent Systems, 29(5), pp.70-73.

Rahman, L., Riyadi, S. and Prasetyo, E., 2015. Development of Student Data Mart Using Normalized Data Store Architecture. Advanced Science Letters, 21(10), pp.3225-3229.

Ramos, J., Alturas, B. and Moro, S., 2017, June. Business intelligence in a public institution—Evaluation of a financial data mart. In Information Systems and Technologies (CISTI), 2017 12th Iberian Conference on (pp. 1-6). IEEE.

Roski, J., Bo-Linn, G.W. and Andrews, T.A., 2014. Creating value in health care through big data: opportunities and policy implications. Health affairs, 33(7), pp.1115-1122.

Ross, T.R., Ng, D., Brown, J.S., Pardee, R., Hornbrook, M.C., Hart, G. and Steiner, J.F., 2014. The HMO Research Network Virtual Data Warehouse: a public data model to support collaboration. EGEMS, 2(1).

Vaisman, A. and Zimányi, E., 2014. Data Warehouse Systems: Design and Implementation. Springer.

Zhu, Q., Liu, Y., Guo, S., Liu, S., Wang, G., Yan, S. and Tong, K., Linkedin Corporation, 2015. Data mart for machine learning. U.S. Patent Application 14/986,599.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Data Warehouse Vs. Data Lake Vs. Data Mart: A Comparison ”

Get high-quality paper

NEW! AI matching with writer