Parallel Data Sorting and Deduplication in Distributed File Systems

PARALLEL DATA SORTING AND DEDUPLICATION

IN DISTRIBUTED FILE SYSTEMS

TABLE OF CONTENTS

SL. NO.

TITLE

PAGE NO.

Title Page

ii.

Declaration

iii.

Certificate

iv.

Acknowledgement

Table of Contents

vi.

Abstract

Introduction

Literature Survey

Project Design

3.1. Description of Modules

3.2. Software and Hardware Specifications

3.3. Model Diagrams and Design Specifications

3.4. Individual Contributions

3.5. Project Plans and Milestones

Achievements

4.1. Analysis Completed

4.1. Results

4.1. Future Work

4.2. Scope for Publishing

References

Appendix A

6.1. Data Pre-Processing

6.2. Parallel Deduplication, Sorting, Client File

ABSTRACT

This project titled “Parallel Data Sorting and Data Duplication” aims to build a distributed file system in Python. This DFS will be distributed in nature, much like the Hadoop Distributed File System (HDFS). In today’s age of cloud computing and big data, where speed of computing is as critical as the reliability of storage, and data is huge in size, DFSs are the storage solution that will provide a long-lasting impact into the future. As mentioned, there are two critical factors, namely speed of computing and reliability of storage. These are the two cornerstones on top of which our distributed file system is built on. For the first factor, i.e. speed, we have designed a parallel deduplication algorithm and a parallel sort algorithm. The latter is based on the sequential merge sort algorithm. This is a novel algorithm, that makes use of the nature of the DFS designed, i.e. the master-slave architecture of HDFS. These operations have been used because this project serves as a basic demonstration of parallel architecture in DFSs, and sorting and deduplication are basic operations that are used in almost every advanced data storage and manipulation operation. The Python language has been used for creating the DFS and the Map-Reduce paradigm. RPC system calls have been used for the DFS, and the RPyC library in Python has been used to achieve this result. Finally, the DFS has been shown to demonstrate speed, reliability of storage and parallel data computations and processing.

1. INTRODUCTION

In recent times, there has been a surge in high performance computing, especially distributed computing and parallel operations. Apart from these, there has also been a major rise in network-based computing. All of these types of computing have several demands from their architectures, namely speed (of computing and data storage/manipulation), reliability (of storage) and scaling capabilities (of the architecture used). We have attempted to create a distributed file system that fulfils all these criteria.

Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service

As an outcome of the progress of web and data innovation, gigantic measures of information are delivered in our everyday life. Expansive volumes of data and petabytes of information are recorded each day. Notwithstanding the information estimate, the huge information has different qualities, for example, assortment and speed. Accordingly, huge information examination by machine learning and information mining methods has turned into an imperative research issue.

Mining enormous information is difficult to oversee, especially when utilizing the present procedures and information mining programming instruments, because of their expansive size and unpredictability. At the end of the day, utilizing a PC to execute the information mining errand over vast scale datasets requires high computational expenses. It is important to utilize all the more ground-breaking registering situations to effectively process and investigate enormous information. The answers for the issue of mining extensive scale datasets can be founded on the parallel and distributed computing stages. On a fundamental level, parallel processing centers around partitioning the picked (huge) issue into smaller ones, every one of which is completed by one single processor exclusively, with the goal that a calculation made out of various computations is performed simultaneously in a distributed and parallel way.

In our distributed file system, we have used the master-slave architecture as seen in Hadoop Distributed File System and the Map-Reduce paradigm for the file system. According to the standard nomenclature, the master node is called the NameNode, slave/minion nodes are called DataNodes and the data is replicated by a factor replication_factor, specified in the config file. The data is first deduplicated parallelly, then sorted using the parallel mergesort algorithm. Then, this data is replicated as mentioned above, and is stored in the DataNodes. This communication between NameNode (master), DataNodes (minions) and the client node takes place using RPCs (Remote Procedure Calls). To do this, we have used the RPyC library in Python.

While reading data, minion nodes are contacted in serial order. If data for corresponding file is contained in the minion, then data is streamed over RPC from minion to client and is displayed on the screen. The data is replicated for fail-safes against data corruption and power failures.

The project has immense scope for future work. Distributed file systems are the future of data storage and manipulation. In today’s age data is stored in data centers all over the globe, and this data must be constantly synced across all the centers. To make this possible. DFSs provide the ultimate solution, both in terms of speed and reliability. Additional parallel computations (deduplication and sorting) only increase the speed of operations. Thereby, this work can be carried forward, and modified for larger data, for additional features in the file system such as deleting DFS files, and endless more possibilities.

2. LITERATURE SURVEY

Large-scale Distributed L-BFGS

M. M. Najafabadi et. al. (2017) suggest a solution to the problem of examining large amounts of data or extracting patterns from large amounts of data. The data is used for training machine learning algorithms [1]. Limited Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) is an optimization method proposed used for estimating parameters increased. Since, resources from a single computer could be insufficient in running this algorithm, this paper presents a parallel implementation of L-BFGS algorithm on a distributed file system. It uses HPCC as the distributed system.

Scalable Distributed Implementation of a Biologically Inspired Parallel Model

G. Ciobanu (2015) introduces distributed computing middleware [2]. Distributed computing middleware is inspired by biological models and it helps in solving various synchronization issues. The MapReduce algorithm is used to develop a parallel and scalable implementation which permits the division of a single task into several tasks or subtasks. These subtasks are executed parallelly and the results of these are aggregated into a final result. This model can provide solutions to NP-complete problems.

Efficient Parallel Spectral Clustering Algorithm Design for Large Data Sets under Cloud Computing Environment

R. Gin et. al. (2013) improve the clustering speed of the MapReduce algorithm [3]. It uses spectral clustering and MapReduce by evaluating sparse matrix eigenvalues and by finding distributed clusters. The paper concludes that when the processing data increases, the rate of clustering increasing linearly. Thus, the parallel spectral clustering algorithm

iiHadoop: An Asynchronous Distributed Framework for Incremental Iterative Computations

A. G. B. Saadon et. al. (2017) introduce iiHadoop as an extension to the existing Hadoop framework because, even though MapReduce was recently introduced to solve the problem of handling computations with massive amounts of data [4]. However, it does not provide a solution for handling small amounts of incremental data. In the proposed iiHadoop tech, it speeds up the program by performing the incremental computations on the small fraction of affected data. It also improves the performance by executing iterations asynchronously.

Trustworthy Group Making Algorithm in Distributed Systems

A. Aikebaier et. al. (2011) introduce the distributed and scalable system as a peer to peer system [5]. The security between each group member or peer is the primary concern. In the system, each peer has to be trust worthy as behaviour of one peer can affect the whole system. The trustworthiness of each peer is a ground variable for the distributed environment. The paper introduces a new approach of incrementing a safe group in the distributed system protocols.

Meta-MapReduce for Scalable Data Mining

X. Liu et. al. (2015) tackle the problem of the time taken by MapReduce algorithm to sole machine learning problems involving iterations [6]. The existing framework of MapReduce suffers a very significant weakness in that is cannot support iterations. A new algorithm, Meta MapReduce is introduced which reduces the computational complexity of the training data while the number of nodes increases. It also obtains smaller error rates.

Lessons Learned from CPES Co-Simulation with Distributed, Heterogeneous Systems

C. Steinbrink et. al. (2018) present a case study based on co-simulation with distributed, heterogenous simulation [7]. In today’s world, there is increased integration of renewable energy sources into the conventional power grid, and very often, this results into the grid being transformed into a cyber-physical energy system. Although this provides options for stable, optimized control, it also poses vulnerabilities through ignorance of certain setup characteristics, and this through this paper, the authors present a system MOSAIK that aims to bridge the gap between requirements for special interfacing and high usability of the systems.

A Feasible MapReduce Peer-to-Peer Framework for Distributed Computing Applications

H. M. Tran et. al. (2015) introduce a MapReduce peer to peer framework which helps MR implementations to P2P networks [8]. This is useful for people who cannot afford dedicated clusters for rare demands. Another advantage of the framework is that. it allows internet users to make use of large data on distributed systems. There also are features to improve fault tolerance and to manage peer failures.

Parallel Backprojection: A Case Study in High-Performance Reconfigurable Computing

B. Cordes et. al. (2009) present an implementation of backprojection for use in synthetic aperture radar (SAR) performed on a high-performance reconfigurable computing (HPRC) system [9]. Backprojection is an image synthesis algorithm that can be used as a part of SAR. This is especially done on HPRCs, where a novel approach using general-purpose processors and FPGAs permits designers to exploit both fine-grained and coarse-grained parallelism, thereby reaching very degrees of computation speedup.

3. PROJECT DESIGN

3.1 Description of Modules

There are 5 major modules in this project:

Data Pre-processing: We have taken data from 2 sources: the Brown corpus present in NLTK, and the Wikipedia 100k Dataset which contains the 100,000 most used words. The 2nd dataset had repetitions of words as well as non-English words. While repetition was not a major concern, non-English words were, which is why they had to be removed. Finally, we resulted with a large dataset containing 306,694 repeating words. All these operations were performed in a Jupyter notebook.

Master Node: This is the module pertaining to the master node, or more appropriately, the NameNode. This module handles everything pertaining to communication among all the other nodes over Remote Procedure Calls through the RPyC module. The master node is responsible for keeping communication channels open, for links between the client and the minion, and for maintaining the integrity of the data nodes.

Minion Node:This is the module that creates the minion nodes to store data according to the Map-Reduce paradigm. This module is responsible to ensure the authenticity and integrity of data, and to make sure that data is appropriately duplicated and split between the nodes.

Client: The client module is responsible to get the parallel tasks done, i.e. to deduplicate the data parallelly and then to sort it using parallel mergesort. Functions in the client file call objects from the master and minion nodes to get information about the distributed file system, and get these tasks done.

Configuration File: The config file contains basic information pertaining to the DFS, i.e. the block size and the replication factor. This information is used by the master and minon files to create the DFS and form the basic building blocks of the file system.

3.2 Software and Hardware Specifications

The major software libraries used in this project are:

RPyC: This library in Python offers convenient ways to utilize Remote Procedure Calls and to create network-based communication. RPyC is the means through which master, minion and client communicate with each other and transfer data over the localhost (in this case) network. Since we are using RPCs, this project can also be extended to be built over distributed systems so that even if specifications of different systems are not known, communication can still happen over remote procedure calls.

Multiprocessing: The multiprocessing library in Python is analogous to the OpenMP library in C. It facilitates using more cores than actually allotted, creating threads for parallel computing and much more. The number of cores, threads etc. can be set, mutually exclusive and critical areas of code and be specified etc. This library has been used for parallel data deduplication and parallel data sorting.

NLTK: The Natural Language Toolkit library (NLTK) has been used for the words from the Brown corpus to generate the dataset for English words.

The project has been done on a Dell Inspiron 7560 laptop notebook having 8GB RAM and an Intel Core i5-7200U quad-core processor with a base clock of 2.5GHz that can be overclocked to 2.75GHz. Apart from a standard Intel HD Graphics 620 card, there us also am NVIDIA GeForce 940MX GPU with GDDR5 RAM and 4GB of dedicated graphics memory. The primary memory is a 1TB SATA3 NTFS hard drive, with the operating system located on a Samsung 850EVO M.2 solid state drive for higher speed when accessing applications and programs.

3.3 Model Diagrams and Design Specifications

Fig. 1: Model of the distributed file system

Fig. 2: Flowchart of tasks in the DFS

3.4 Individual Contributions

The individual contributions of all the group members is listed below:

Yash Gupta (15BCE2073):

Literature Survey

Implement parallel sorting

Implement master node

Implement minion nodes

Perform integration testing

Sukriti Jain (15BCB0065):

Requirements analysis

Implement MapReduce

Perform unit tests

Implement client interaction

Gahana Agarwal (15BCE0552):

Find relevant research

Implement communication network

Parallel data deduplication

Final documentation

3.5 Project Plan and Milestones

Fig. 3: Gantt chart for project milestones

4. ACHIEVEMENTS

4.1 Analysis Completed

We have successfully finished the implementation of the project ahead of schedule. Having done this, we have also finished unit and integration test on the project and have analysed a comparison of the parallel operations with corresponding sequential operations. Although the parallel operations take slightly more time than the sequential operations, we have observed that this time gap reduces with an increase in the size of the dataset. Thus, we have concluded that parallel operations are more suited when the size of the dataset huge, in the order of hundreds of millions to billions of data values. For smaller datasets, the overhead of creating extra threads is too much to be sorting or deduplicating the data parallelly.

4.2 Results

Fig. 4: Jupyter notebook for data pre-processing

Fig. 5: Reading data from source and storing in nodes

Fig. 6: Client reading data from the source

Fig. 7: Accessing the data stored in nodes

Fig. 8: Data after being sorted parallelly

Fig. 9: Storing the sorted data back into the nodes

Fig. 10: Reading already sorted data

4.3 Future Work

As mentioned earlier, distributed file systems represent the future of data storage and manipulation. Data centres across the world must now access data and be in synchronization at all times. That is where DFSs come into the picture, and with parallel operations, this can be made even faster. Thus, this project has a lot of future scope and applications. Moreover, more functions can be implemented such as deleting nodes, deletion of DFS variables that store data etc.

4.4 Scope for Publishing

This project has a strong scope for being published because the work done here is genuine, original and cutting-edge. DFSs are only emerging, and not a lot of research and development has gone into the area of distributed systems with parallel operations. Finally, after more extensive testing on larger datasets on more efficient systems, this work could be sent to journals to be published.

5. REFERENCES

Najafabadi, M. M., Khoshgoftaar, T. M., Villanustre, F., & Holt, J. (2017). Large-scale distributed L-BFGS. Journal of Big Data, 4(1), 22.

Ciobanu, G. (2015). Scalable distributed implementation of a biologically inspired parallel model. Complex & Intelligent Systems, 1(1-4), 69-80.

Jin, R., Kou, C., Liu, R., & Li, Y. (2013). Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment. Journal of Cloud Computing: Advances, Systems and Applications, 2(1), 18.

Saadon, A. G. B., & Mokhtar, H. M. (2017). iiHadoop: an asynchronous distributed framework for incremental iterative computations. Journal of Big Data, 4(1), 24.

Aikebaier, A., Enokido, T., & Takizawa, M. (2011). Trustworthy group making algorithm in distributed systems. Human-centric computing and information sciences, 1(1), 6.

Liu, X., Wang, X., Matwin, S., & Japkowicz, N. (2015). Meta-MapReduce for scalable data mining. Journal of Big Data, 2(1), 14.

Steinbrink, C., Köhler, C., Siemonsmeier, M., & van Ellen, T. (2018). Lessons learned from CPES co-simulation with distributed, heterogeneous systems. Energy Informatics, 1(1), 38.

Tran, H. M., Ha, S. V. U., Huynh, T. K., & Le, S. T. (2015). A feasible MapReduce peer-to-peer framework for distributed computing applications. Vietnam Journal of Computer Science, 2(1), 57-66.

Cordes, B., & Leeser, M. (2009). Parallel backprojection: a case study in high-performance reconfigurable computing. EURASIP Journal on Embedded Systems, 2009, 1.

6. APPENDIX A

6.1 Data Pre-Processing

from nltk.stem import LancasterStemmer

from nltk.stem import PorterStemmer

from nltk.corpus import words

eng_words = set(words.words())

data_dirty = [line.rstrip(‘\n’) for line in open(“data_dirty.txt”, encoding = “utf-8”)]

data_clean = []

porter_stemmer = PorterStemmer()

lancaster_stemmer = LancasterStemmer()

for word in data_dirty:

if word.lower() in eng_words:

data_clean.append(word)

elif word in eng_words:

data_clean.append(word.lower())

elif porter_stemmer.stem(word.lower()) in eng_words:

data_clean.append(porter_stemmer.stem(word))

elif porter_stemmer.stem(word) in eng_words:

data_clean.append(porter_stemmer.stem(word.lower()))

elif lancaster_stemmer.stem(word.lower()) in eng_words:

data_clean.append(lancaster_stemmer.stem(word))

elif lancaster_stemmer.stem(word) in eng_words:

data_clean.append(lancaster_stemmer.stem(word.lower()))

len(data_dirty), len(data_clean)

data_clean.extend(list(eng_words))

len(data_clean)

with open(“data_clean.txt”, “w”) as f:

for word in data_clean:

f.write(“%s\n” % word)

6.2 Parallel Deduplication, Sorting, Client File

warnings.filterwarnings(“ignore”)

def merge(*args):

left, right = args[0] if len(args) == 1 else args

left_length, right_length = len(left), len(right)

left_index, right_index = 0, 0

merged = []

while left_index < left_length and right_index < right_length:

if left[left_index] <= right[right_index]:

merged.append(left[left_index])

left_index += 1

else:

merged.append(right[right_index])

right_index += 1

if left_index == left_length:

merged.extend(right[right_index:])

else:

merged.extend(left[left_index:])

return merged

def merge_sort(data):

length = len(data)

if length <= 1:

return data

middle = length // 2

left = merge_sort(data[:middle])

right = merge_sort(data[middle:])

return merge(left, right)

def parallel_sort(data):

processes = multiprocessing.cpu_count()

pool = multiprocessing.Pool(processes = processes, maxtasksperchild = 1)

size = int(math.ceil(float(len(data)) / processes))

data = [data[i * size: (i + 1) * size] for i in range(processes)]

data = pool.map(merge_sort, data, chunksize = 1)

while len(data) > 1:

extra = data.pop() if len(data) % 2 == 1 else None

data = [(data[i], data[i + 1]) for i in range(0, len(data), 2)]

data = pool.map(merge, data) + ([extra] if extra else [])

return data[0]

def send_to_minion(block_uuid, data, minions):

print(“sending to: ” + str(block_uuid) + str(minions))

minion = minions[0]

minions = minions[1:]

host, port = minion

con = rpyc.connect(host, port = port)

minion = con.root.Minion()

minion.put(block_uuid, data, minions)

def read_from_minion(block_uuid, minion):

host, port = minion

con = rpyc.connect(host, port=port)

minion = con.root.Minion()

return minion.get(block_uuid)

def get(master, fname):

file_table = master.get_file_table_entry(fname)

if not file_table:

print(“404: File Not Found”)

return

data_unsorted = “”

print(“\nData stored in nodes is:\n”)

for block in file_table:

for m in [master.get_minions()[_] for _ in block[1]]:

data = read_from_minion(block[0], m)

if data:

sys.stdout.write(data)

data_unsorted += data

break

else:

try:

if os.path.getsize(os.getcwd() + “\\minion_nodes\\” + str(block[0])) != 0:

print(“No blocks found. Possibly a corrupt file.”)

except:

print(“No blocks found. Possibly a corrupt file.”)

data_unsorted = data_unsorted.split(“\n”)

if data_unsorted == [“”]:

return

if data_unsorted[-1] == “”:

data_unsorted = data_unsorted[:-1]

data_sorted = parallel_sort(data_unsorted)

if data_unsorted != data_sorted:

print(“\n\n\nData stored in nodes is now sorted:\n”)

with open(“data_{}_sorted.txt”.format(fname), “w”) as f:

for word in data_sorted:

print(word)

f.write(“%s\n” % word)

print()

put(master, “data_{}_sorted.txt”.format(fname), fname, “get”)

else:

print(“\nData is already sorted.”)

def put(master, source, dest, src):

if src == “main”:

data_dup = [line.rstrip(“\n”) for line in open(source)]

data_dup = list(set(data_dup))

source = “data_{}_deduplicated.txt”.format(dest)

with open(source, “w”) as f:

for word in data_dup:

f.write(“%s\n” % word)

size = os.path.getsize(source)

b_size = int(math.ceil(float(size) / master.get_block_size()))

blocks = master.write(dest, size, src)

if src == “main”:

rep = master.get_replication_factor()

else:

rep = 1

for r in range(rep):

with open(source) as f:

for i in range(b_size):

b = blocks[b_size * r + i]

data = f.read(master.get_block_size())

block_uuid = b[0]

minions = [master.get_minions()[_] for _ in b[1]]

send_to_minion(block_uuid, data, minions)

def main(args):

con = rpyc.connect(“localhost”, port = 2131)

master = con.root.Master()

if len(args) != 0 and args[0] == “get”:

get(master, args[1])

elif len(args) != 0 and args[0] == “put”:

put(master, args[1], args[2], “main”)

else:

print(“TRY ‘put srcFile.txt destFile’ OR ‘get destFile'”)

if __name__ == “__main__”:

main(sys.argv[1:])

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Parallel Data Sorting and Deduplication in Distributed File Systems ”

Get high-quality paper

NEW! AI matching with writer