Importance of Data Curation by Steven Ruggles

FIT5146 Assignment 1: Critical Review Essay

The Importance of Data Curation – Steven Ruggles

Introduction

As academic research, business and government intelligence are becoming progressively more data-intensive, the role data curation has in turning data into comprehensible sources of information and knowledge has become all the more necessary. In Steven Ruggles’ journal article,The Importance of Data Curation (2018), the author stresses this notion where instead of maximising funding into collecting new data, he suggests that researchers should be more concerned with “appropriately integrating, disseminating, and preserving existing data” (Ruggles, 2018, p. 303) to maximise “the utility of existing datasets” (Ruggles, 2018, p. 303). Ruggles outlines four main areas of data curation where data curators and researchers may find the most challenging issues to resolve: data integration, electronic dissemination, data sustainability, and metadata. In this critical analysis of the journal article, each of these main areas of data curation will be defined and the challenges involved will be examined in depth. As this article is mostly concerned with the curation of survey data, an analysis of the key points made regarding each of Ruggles’ four major data curation challenges and its relation to research findings and conclusions reached in other scholarly literature and case studies concerning a variety of datasets from different fields of research will also be discussed.

Data Integration

One of the four challenging areas Ruggles outlines is data integration. As defined by the organisation IBM, it is “the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information” (IBM, 2019). Ruggles emphasises the importance of this process as “it enables time-series and comparative analyses to be undertaken without each individual researcher generating their own systems for harmonising differences in datasets” (2018, p. 303). This process increases the value of the datasets, particularly for researchers who require large sets of survey data in a consistent format and allow for accurate cross-temporal comparability.

Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service

There are several factors that can make data integration a challenge, such as differing names, implicit constants, the degree of denormalisation in the datasets, the different levels of granularity, or detail recorded in each dataset (Brown et al., 2019). These factors can be categorised into two types: structural heterogeneity and semantic heterogeneity (Bergamaschi et al., 2011). Structural heterogeneity is when the datasets use different data models or when the same data model is used but different conceptualisation is chosen to represent the same data in different sources (Bergamaschi et al., 2011). For semantic heterogeneity, it is where two or more sets of data have different meanings and interpretations where one might use the same term to convey different concepts (homonyms) or different terms to convey the same concept (synonyms) (Bergamaschi et al., 2011). One of the factors earlier in the literature, differing names, would fall into this category.

From these definitions, it can be discerned that the greatest difficulty in effective data integration lies in the way two sets of data differs in the way it was recorded and collated prior to this process, even when they may present similar datasets. In one case study that examined the advantages and disadvantages of different computational data integration methods against biological datasets (Gligorijevic et al., 2015), there were similar issues in processing accurate data integration. Some observations include that the Bayesian networks method was well suited to integrating small-size datasets, however, was incapable of integrating larger-scale datasets (Gligorijevic et al., 2015). Alternatively, the Kernel-based method could handle large-scale datasets, but could not integrate heterogeneous data as effectively (Gligorijevic et al., 2015). While it was concluded in the case study that the data integration methods examined still delivered satisfactory results in other similar data integration research, despite some imperfections, both the case study and Ruggles (2018) suggested that further research and developments should be done to improve data integration within and between datasets.

Electronic Dissemination

The second major challenging aspect of data curation Ruggles outlines is the effective electronic dissemination of data. As Ruggles has emphasised, “efficient data dissemination is important because the large investment of scarce resources for the social science infrastructure can only be justified if the data are widely used” (2018, p.304). The author also argues that an “increased emphasis on ongoing improvements to dissemination methods and platforms is necessary to maximising the value of survey data” (Ruggles, 2018, p. 304) and suggests that data storage and access platforms will need to “undergo constant improvements to keep pace with technology advancements and user expectations” (Ruggles, 2018, p.305).

While this argument is sound, other articles have also suggested that the technological currency of a database platform is not the only challenging aspect to consider when it comes to implementing efficient electronic dissemination of datasets. In one article on workflow-based construction research data management and dissemination, it held a similar argument as Ruggles – where central to the difficulties in efficiently sharing and disseminating data, there are “many technical challenges” (Shahi et al., 2014, p. 245) to be concerned of.

However, in the following points, the authors also stated that the challenging factors encountered in sharing data effectively “goes beyond technical issues” (Shahi et al., 2014, p. 245). These factors include some data owners’ (i.e. researchers) lack of incentive to share their raw data, as well as the negative perception among researchers that sharing data would not only further progress but also increase competition within their research communities (Ceci, 1988, as cited in Shahi et al., 2014). Another concern some researchers have raised in providing data to data sharing platforms is the lack of a standard recognition system for research data disseminated through data-sharing platforms, like a system similar to article citation standards (Fischer & Zigmond, 2010, as cited in Shahi et al., 2014).

In another article on the benefits and challenges of opening government health data on a public data-sharing platform (Martin & Begany, 2016), a similar point was raised where the introduction of an open data-sharing platform was met with “cultural resistance” by data providers. One compelling reason for this resistance to share raw datasets was explained by a respondent in the article, “People take a lot of time to accumulate data that they have for their specific project or purpose and it’s not often a first reaction for someone to say, ‘Oh sure, I’ll turn around and make that data open.’” (Martin & Begany, 2016).

If data providers and researchers are reluctant to share their data to data-sharing platforms due to the lack of appropriate recognition systems and positive incentives to share data, not only is the inability to collect a sufficient number of datasets for reliable cross-temporal comparability an issue, disseminating data that cannot be collected by researchers and other owners of data will be a difficult challenge to overcome for data curators.

Data Sustainability

The third key challenging aspect of data curation Ruggles states is data sustainability, which the author briefly defined as “the ability to preserve data, ‘especially old data’” (Ruggles, 2018). In other words, data sustainability is a way to maintain the reusability of data for future research and maintain a broad access to the data for users. As outlined by Ruggles, the importance of effective data sustainability is the great value that can be gained from maintaining datasets that generates the opportunity for accurate cross-temporal analysis between data – the older and better maintained the datasets are, the greater its value will be for researchers (Ruggles, 2018). Ruggles suggests two approaches to ensure data sustainability: introducing a formal preservation plan as a standard to check against and determine if the database is “functioning as effective data archives”(Ruggles, 2018, p.306), and “expand the use of persistent identifiers” (Ruggles, 2018, p.307), such as a Digital Object Identifier (DOI), as a method to consistently identify datasets in the long-term.

In Hanafin, Brookes, and Meaney’s (2014) article on developing and maintaining sustainable data systems for reporting on children’s lives and well-being, the authors had a slightly different approach to maintaining data. In the study, the research group had utilised the Triangulation methodology for conducting a data gap analysis on the datasets they had collected where it was found that implementing this method had greatly assisted with ensuring their data to be sustainable to a significant degree (Hanafin, Brookes, & Meaney, 2014). Through this method, the research group involved were able to identify gaps in the data collated for storage and in this way discovered data research priority areas to improve on, build upon their data curation strategic development, and as a result enabled effective data sustainability.

However, in Shen’s (2018) research article on the examination on Virginia Tech’s team of environmental scientists’ behaviours and responses to data management issues, it was found there are several other factors that have made effective data sustainability difficult to achieve that cannot be so easily solved by technical means. The issues raised by the research participants include a deficiency in appropriate data documentation for existing datasets, introducing quality assurance problems that potentially led to valuable data being deemed useless and discarded (Shen, 2018) because the data lacked a consistent method to be discovered and understood by researchers. Another issue that had factored into poor data sustainability was due to the lack of proper software preservation of old versions of software that had created older sets of data, which meant valuable digital data was also lost in the process. In addition to these issues, the research participants also stated that it was due to poor, unstandardised preservation practices prior to the data getting collated that valuable datasets were susceptible to accidently getting lost or destroyed by the researcher who owned the data.

From what can be drawn from each case, technical as well as proper standardised plans and practices for researchers to follow, are important in ensuring data to be sustainable, reusable and understood in the long-term for future researchers to be able to have efficient access to.

Metadata

Metadata is the last of Ruggles’ four challenging areas of data curation to critically discuss. In the article, Ruggles (2018) defined this term as “class off information that relates to and describes that focal data – it is “data about data”…[which is] vital to enabling efficient data integration and dissemination.” (2018, p. 307). In Ruggles’ article, the main challenging aspect of metadata concerns converting a large amount of data and documentation into structure metadata format which would require a significant investment and future work to complete (2018). In order to resolve this difficulty, Ruggles suggests that “future research aimed at reducing the costs of conversion through the development of “smart” automated processes” would be important for improving the process of converting documentation into metadata. However, it can be found in other articles on the topic of metadata that the challenges within this area of interest is much broader than reducing the costs of metadata conversion. In particular to Hsu’s research article on the challenges, strategies and opportunities found in the data management of experimental geomorphology, there are other additional challenging aspects of metadata that should be considered. The first being the “lack of metadata standards for discoverability and sharing” (Hsu et al., 2015). The other, “insufficient workflow documentation and communication for experimental repeatability” (Hsu et al., 2015), which ideally should have been converted into the datasets’ metadata. This issue in creating quality metadata was also noted in Shen’s (2018) research article where insufficient workflow documentation and metadata had been other factors in poor data sustainability.

Similarly to the discussion on the difficulties discovered in establishing data sustainability, it can be drawn from this that developing and following a set of common standards for documenting and generating metadata can be a factor in improving the quality of datasets and lead to better data curation practices. (304)

Conclusion

There are four main challenging aspects to consider when curating data: data integration, electronic dissemination, data sustainability and metadata. The success of curating datasets depends on the effectiveness of each of these aspects. Throughout the article, Ruggles had outlined the importance of each of these areas of data curation as well as suggested solutions and further research for improving the effectiveness of each area. While Ruggles (2018) mainly argues from a technical point of view, various other case studies and research articles have also suggested the need for a more effective and standardised set of policies and methodologies for data providers to follow to ensure the usability of data and to prevent data loss. In this way, data curators can create database platforms that enable effective data sustainability, reusability, discoverability of the datasets as well as allowing repeatability of the dataset’s experimental methods. When researching solutions for the challenges of data curation, we need to consider establishing effective standardised policies and approaches to manage how researchers and other personnel handle their data responsibly as well as utilising a technical approach to resolve these aspects of data curation.

References:

Ruggles, S. (2018). The Importance of Data Curation. In D.L. Vannette & J.A. Krosnick (Eds.), The Palgrave Handbook of Survey Research (pp. 303-308). Palgrave Macmillan, Cham. Retrieved from https://doi-org.ezproxy.lib.monash.edu.au/10.1007/978-3-319-54395-6

IBM, (2019). Data Integration | IBM. Retrieved from https://www.ibm.com/analytics/data-integration

Bergamaschi, S., Beneventano, D., Guerra, F., Orsini, M., Embley, D., & Thalheim, B. (2011). Data Integration. In D.W. Embley & B. Thalheim (Eds.), Handbook of Conceptual Modeling: Theory, Practice, and Research Challenges (pp. 441-476). Berlin, Heidelberg: Springer Berlin Heidelberg. doi: 10.1007/978-3-642-15865-0

Brown, K.S., Spivak D.I., & Wisnesky R. (2019). Categorical data integration for computational science. Computational Materials Science, 164, 127-132. Retrieved from https://doi.org/10.1016/j.commatsci.2019.04.002

Gligorijevic, V., & Przulj, N. (2015). Methods for biological data integration: Perspectives and challenges. Journal Of The Royal Society Interface, 12(112), 26490630. https://doi-org.ezproxy.lib.monash.edu.au/10.1098/rsif.2015.0571

Shahi, A., Haas, C., West, J., & Akinci, B. (2014). Workflow-Based Construction Research Data Management and Dissemination. Journal of Computing in Civil Engineering, 28(2), 244-252. doi: 10.1061/(ASCE)CP.1943-5487.0000251.

Martin, E., & Begany, G. (2017). Opening government health data to the public: Benefits, challenges, and lessons learned from early innovators. Journal of the American Medical Informatics Association, 24(2), 345-351. doi: 10.1093/jamia/ocw076

Shen, Y. (2018). Data Sustainability and Reuse of Pathways of Natural Resources and Environmental Scientists. New Review of Academic Librarianship, 24(2), 136-156. doi:

10.1080/13614533.2018.1424642

Hanafin, S., Brooks, A., & Meaney, B. (2014). Data sustainability: Broad action areas to

develop data systems strategically. Child Indicators Research, 7(2), 229-243. doi: 10.1007/s12187-013-9218-2

Hsu, L., Martin, R.L., McElroy, B., Litwin-Miller, K., & Kim, W. (2015). Data management, sharing, and reuse in experimental geomorphology: Challenges, strategies, and scientific opportunities. Geomorphology, 244, 180-189. Retrieved from http://dx.doi.org/10.1016/j.geomorph.2015.03.039

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Importance of Data Curation by Steven Ruggles ”

Get high-quality paper

NEW! AI matching with writer