PARALLEL DATA SORTING AND DEDUPLICATION
IN DISTRIBUTED FILE SYSTEMS
TABLE OF CONTENTS
SL. NO.
TITLE
PAGE NO.
i.
Title Page
1
ii.
Declaration
2
iii.
Certificate
3
iv.
Acknowledgement
4
v.
Table of Contents
5
vi.
Abstract
6
1.
Introduction
7
2.
Literature Survey
8
3.
Project Design
10
3.1. Description of Modules
10
3.2. Software and Hardware Specifications
11
3.3. Model Diagrams and Design Specifications
12
3.4. Individual Contributions
13
3.5. Project Plans and Milestones
14
4.
Achievements
15
4.1. Analysis Completed
15
4.1. Results
15
4.1. Future Work
19
4.2. Scope for Publishing
19
5.
References
19
6.
Appendix A
20
6.1. Data Pre-Processing
20
6.2. Parallel Deduplication, Sorting, Client File
21
ABSTRACT
This project titled “Parallel Data Sorting and Data Duplication” aims to build a distributed file system in Python. This DFS will be distributed in nature, much like the Hadoop Distributed File System (HDFS). In today’s age of cloud computing and big data, where speed of computing is as critical as the reliability of storage, and data is huge in size, DFSs are the storage solution that will provide a long-lasting impact into the future. As mentioned, there are two critical factors, namely speed of computing and reliability of storage. These are the two cornerstones on top of which our distributed file system is built on. For the first factor, i.e. speed, we have designed a parallel deduplication algorithm and a parallel sort algorithm. The latter is based on the sequential merge sort algorithm. This is a novel algorithm, that makes use of the nature of the DFS designed, i.e. the master-slave architecture of HDFS. These operations have been used because this project serves as a basic demonstration of parallel architecture in DFSs, and sorting and deduplication are basic operations that are used in almost every advanced data storage and manipulation operation. The Python language has been used for creating the DFS and the Map-Reduce paradigm. RPC system calls have been used for the DFS, and the RPyC library in Python has been used to achieve this result. Finally, the DFS has been shown to demonstrate speed, reliability of storage and parallel data computations and processing.
1. INTRODUCTION
In recent times, there has been a surge in high performance computing, especially distributed computing and parallel operations. Apart from these, there has also been a major rise in network-based computing. All of these types of computing have several demands from their architectures, namely speed (of computing and data storage/manipulation), reliability (of storage) and scaling capabilities (of the architecture used). We have attempted to create a distributed file system that fulfils all these criteria.
Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service
As an outcome of the progress of web and data innovation, gigantic measures of information are delivered in our everyday life. Expansive volumes of data and petabytes of information are recorded each day. Notwithstanding the information estimate, the huge information has different qualities, for example, assortment and speed. Accordingly, huge information examination by machine learning and information mining methods has turned into an imperative research issue.
Mining enormous information is difficult to oversee, especially when utilizing the present procedures and information mining programming instruments, because of their expansive size and unpredictability. At the end of the day, utilizing a PC to execute the information mining errand over vast scale datasets requires high computational expenses. It is important to utilize all the more ground-breaking registering situations to effectively process and investigate enormous information. The answers for the issue of mining extensive scale datasets can be founded on the parallel and distributed computing stages. On a fundamental level, parallel processing centers around partitioning the picked (huge) issue into smaller ones, every one of which is completed by one single processor exclusively, with the goal that a calculation made out of various computations is performed simultaneously in a distributed and parallel way.
In our distributed file system, we have used the master-slave architecture as seen in Hadoop Distributed File System and the Map-Reduce paradigm for the file system. According to the standard nomenclature, the master node is called the NameNode, slave/minion nodes are called DataNodes and the data is replicated by a factor replication_factor, specified in the config file. The data is first deduplicated parallelly, then sorted using the parallel mergesort algorithm. Then, this data is replicated as mentioned above, and is stored in the DataNodes. This communication between NameNode (master), DataNodes (minions) and the client node takes place using RPCs (Remote Procedure Calls). To do this, we have used the RPyC library in Python.
While reading data, minion nodes are contacted in serial order. If data for corresponding file is contained in the minion, then data is streamed over RPC from minion to client and is displayed on the screen. The data is replicated for fail-safes against data corruption and power failures.
The project has immense scope for future work. Distributed file systems are the future of data storage and manipulation. In today’s age data is stored in data centers all over the globe, and this data must be constantly synced across all the centers. To make this possible. DFSs provide the ultimate solution, both in terms of speed and reliability. Additional parallel computations (deduplication and sorting) only increase the speed of operations. Thereby, this work can be carried forward, and modified for larger data, for additional features in the file system such as deleting DFS files, and endless more possibilities.
2. LITERATURE SURVEY
Large-scale Distributed L-BFGS
M. M. Najafabadi et. al. (2017) suggest a solution to the problem of examining large amounts of data or extracting patterns from large amounts of data. The data is used for training machine learning algorithms [1]. Limited Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) is an optimization method proposed used for estimating parameters increased. Since, resources from a single computer could be insufficient in running this algorithm, this paper presents a parallel implementation of L-BFGS algorithm on a distributed file system. It uses HPCC as the distributed system.
Scalable Distributed Implementation of a Biologically Inspired Parallel Model
G. Ciobanu (2015) introduces distributed computing middleware [2]. Distributed computing middleware is inspired by biological models and it helps in solving various synchronization issues. The MapReduce algorithm is used to develop a parallel and scalable implementation which permits the division of a single task into several tasks or subtasks. These subtasks are executed parallelly and the results of these are aggregated into a final result. This model can provide solutions to NP-complete problems.
Efficient Parallel Spectral Clustering Algorithm Design for Large Data Sets under Cloud Computing Environment
R. Gin et. al. (2013) improve the clustering speed of the MapReduce algorithm [3]. It uses spectral clustering and MapReduce by evaluating sparse matrix eigenvalues and by finding distributed clusters. The paper concludes that when the processing data increases, the rate of clustering increasing linearly. Thus, the parallel spectral clustering algorithm
iiHadoop: An Asynchronous Distributed Framework for Incremental Iterative Computations
A. G. B. Saadon et. al. (2017) introduce iiHadoop as an extension to the existing Hadoop framework because, even though MapReduce was recently introduced to solve the problem of handling computations with massive amounts of data [4]. However, it does not provide a solution for handling small amounts of incremental data. In the proposed iiHadoop tech, it speeds up the program by performing the incremental computations on the small fraction of affected data. It also improves the performance by executing iterations asynchronously.
Trustworthy Group Making Algorithm in Distributed Systems
A. Aikebaier et. al. (2011) introduce the distributed and scalable system as a peer to peer system [5]. The security between each group member or peer is the primary concern. In the system, each peer has to be trust worthy as behaviour of one peer can affect the whole system. The trustworthiness of each peer is a ground variable for the distributed environment. The paper introduces a new approach of incrementing a safe group in the distributed system protocols.
Meta-MapReduce for Scalable Data Mining
X. Liu et. al. (2015) tackle the problem of the time taken by MapReduce algorithm to sole machine learning problems involving iterations [6]. The existing framework of MapReduce suffers a very significant weakness in that is cannot support iterations. A new algorithm, Meta MapReduce is introduced which reduces the computational complexity of the training data while the number of nodes increases. It also obtains smaller error rates.
Lessons Learned from CPES Co-Simulation with Distributed, Heterogeneous Systems
C. Steinbrink et. al. (2018) present a case study based on co-simulation with distributed, heterogenous simulation [7]. In today’s world, there is increased integration of renewable energy sources into the conventional power grid, and very often, this results into the grid being transformed into a cyber-physical energy system. Although this provides options for stable, optimized control, it also poses vulnerabilities through ignorance of certain setup characteristics, and this through this paper, the authors present a system MOSAIK that aims to bridge the gap between requirements for special interfacing and high usability of the systems.
A Feasible MapReduce Peer-to-Peer Framework for Distributed Computing Applications
H. M. Tran et. al. (2015) introduce a MapReduce peer to peer framework which helps MR implementations to P2P networks [8]. This is useful for people who cannot afford dedicated clusters for rare demands. Another advantage of the framework is that. it allows internet users to make use of large data on distributed systems. There also are features to improve fault tolerance and to manage peer failures.
Parallel Backprojection: A Case Study in High-Performance Reconfigurable Computing
B. Cordes et. al. (2009) present an implementation of backprojection for use in synthetic aperture radar (SAR) performed on a high-performance reconfigurable computing (HPRC) system [9]. Backprojection is an image synthesis algorithm that can be used as a part of SAR. This is especially done on HPRCs, where a novel approach using general-purpose processors and FPGAs permits designers to exploit both fine-grained and coarse-grained parallelism, thereby reaching very degrees of computation speedup.
3. PROJECT DESIGN
3.1 Description of Modules
There are 5 major modules in this project:
Data Pre-processing: We have taken data from 2 sources: the Brown corpus present in NLTK, and the Wikipedia 100k Dataset which contains the 100,000 most used words. The 2nd dataset had repetitions of words as well as non-English words. While repetition was not a major concern, non-English words were, which is why they had to be removed. Finally, we resulted with a large dataset containing 306,694 repeating words. All these operations were performed in a Jupyter notebook.
Master Node: This is the module pertaining to the master node, or more appropriately, the NameNode. This module handles everything pertaining to communication among all the other nodes over Remote Procedure Calls through the RPyC module. The master node is responsible for keeping communication channels open, for links between the client and the minion, and for maintaining the integrity of the data nodes.
Minion Node:This is the module that creates the minion nodes to store data according to the Map-Reduce paradigm. This module is responsible to ensure the authenticity and integrity of data, and to make sure that data is appropriately duplicated and split between the nodes.
Client: The client module is responsible to get the parallel tasks done, i.e. to deduplicate the data parallelly and then to sort it using parallel mergesort. Functions in the client file call objects from the master and minion nodes to get information about the distributed file system, and get these tasks done.
Configuration File: The config file contains basic information pertaining to the DFS, i.e. the block size and the replication factor. This information is used by the master and minon files to create the DFS and form the basic building blocks of the file system.
3.2 Software and Hardware Specifications
The major software libraries used in this project are:
RPyC: This library in Python offers convenient ways to utilize Remote Procedure Calls and to create network-based communication. RPyC is the means through which master, minion and client communicate with each other and transfer data over the localhost (in this case) network. Since we are using RPCs, this project can also be extended to be built over distributed systems so that even if specifications of different systems are not known, communication can still happen over remote procedure calls.
Multiprocessing: The multiprocessing library in Python is analogous to the OpenMP library in C. It facilitates using more cores than actually allotted, creating threads for parallel computing and much more. The number of cores, threads etc. can be set, mutually exclusive and critical areas of code and be specified etc. This library has been used for parallel data deduplication and parallel data sorting.
NLTK: The Natural Language Toolkit library (NLTK) has been used for the words from the Brown corpus to generate the dataset for English words.
The project has been done on a Dell Inspiron 7560 laptop notebook having 8GB RAM and an Intel Core i5-7200U quad-core processor with a base clock of 2.5GHz that can be overclocked to 2.75GHz. Apart from a standard Intel HD Graphics 620 card, there us also am NVIDIA GeForce 940MX GPU with GDDR5 RAM and 4GB of dedicated graphics memory. The primary memory is a 1TB SATA3 NTFS hard drive, with the operating system located on a Samsung 850EVO M.2 solid state drive for higher speed when accessing applications and programs.
3.3 Model Diagrams and Design Specifications
Fig. 1: Model of the distributed file system
Fig. 2: Flowchart of tasks in the DFS
3.4 Individual Contributions
The individual contributions of all the group members is listed below:
Yash Gupta (15BCE2073):
Literature Survey
Implement parallel sorting
Implement master node
Implement minion nodes
Perform integration testing
Sukriti Jain (15BCB0065):
Requirements analysis
Implement MapReduce
Perform unit tests
Implement client interaction
Gahana Agarwal (15BCE0552):
Find relevant research
Implement communication network
Parallel data deduplication
Final documentation
3.5 Project Plan and Milestones
Fig. 3: Gantt chart for project milestones
4. ACHIEVEMENTS
4.1 Analysis Completed
We have successfully finished the implementation of the project ahead of schedule. Having done this, we have also finished unit and integration test on the project and have analysed a comparison of the parallel operations with corresponding sequential operations. Although the parallel operations take slightly more time than the sequential operations, we have observed that this time gap reduces with an increase in the size of the dataset. Thus, we have concluded that parallel operations are more suited when the size of the dataset huge, in the order of hundreds of millions to billions of data values. For smaller datasets, the overhead of creating extra threads is too much to be sorting or deduplicating the data parallelly.
4.2 Results
Fig. 4: Jupyter notebook for data pre-processing
Fig. 5: Reading data from source and storing in nodes
Fig. 6: Client reading data from the source
Fig. 7: Accessing the data stored in nodes
Fig. 8: Data after being sorted parallelly
Fig. 9: Storing the sorted data back into the nodes
Fig. 10: Reading already sorted data
4.3 Future Work
As mentioned earlier, distributed file systems represent the future of data storage and manipulation. Data centres across the world must now access data and be in synchronization at all times. That is where DFSs come into the picture, and with parallel operations, this can be made even faster. Thus, this project has a lot of future scope and applications. Moreover, more functions can be implemented such as deleting nodes, deletion of DFS variables that store data etc.
4.4 Scope for Publishing
This project has a strong scope for being published because the work done here is genuine, original and cutting-edge. DFSs are only emerging, and not a lot of research and development has gone into the area of distributed systems with parallel operations. Finally, after more extensive testing on larger datasets on more efficient systems, this work could be sent to journals to be published.
5. REFERENCES
Najafabadi, M. M., Khoshgoftaar, T. M., Villanustre, F., & Holt, J. (2017). Large-scale distributed L-BFGS. Journal of Big Data, 4(1), 22.
Ciobanu, G. (2015). Scalable distributed implementation of a biologically inspired parallel model. Complex & Intelligent Systems, 1(1-4), 69-80.
Jin, R., Kou, C., Liu, R., & Li, Y. (2013). Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment. Journal of Cloud Computing: Advances, Systems and Applications, 2(1), 18.
Saadon, A. G. B., & Mokhtar, H. M. (2017). iiHadoop: an asynchronous distributed framework for incremental iterative computations. Journal of Big Data, 4(1), 24.
Aikebaier, A., Enokido, T., & Takizawa, M. (2011). Trustworthy group making algorithm in distributed systems. Human-centric computing and information sciences, 1(1), 6.
Liu, X., Wang, X., Matwin, S., & Japkowicz, N. (2015). Meta-MapReduce for scalable data mining. Journal of Big Data, 2(1), 14.
Steinbrink, C., Köhler, C., Siemonsmeier, M., & van Ellen, T. (2018). Lessons learned from CPES co-simulation with distributed, heterogeneous systems. Energy Informatics, 1(1), 38.
Tran, H. M., Ha, S. V. U., Huynh, T. K., & Le, S. T. (2015). A feasible MapReduce peer-to-peer framework for distributed computing applications. Vietnam Journal of Computer Science, 2(1), 57-66.
Cordes, B., & Leeser, M. (2009). Parallel backprojection: a case study in high-performance reconfigurable computing. EURASIP Journal on Embedded Systems, 2009, 1.
6. APPENDIX A
6.1 Data Pre-Processing
from nltk.stem import LancasterStemmer
from nltk.stem import PorterStemmer
from nltk.corpus import words
eng_words = set(words.words())
data_dirty = [line.rstrip(‘\n’) for line in open(“data_dirty.txt”, encoding = “utf-8”)]
data_clean = []
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
for word in data_dirty:
if word.lower() in eng_words:
data_clean.append(word)
elif word in eng_words:
data_clean.append(word.lower())
elif porter_stemmer.stem(word.lower()) in eng_words:
data_clean.append(porter_stemmer.stem(word))
elif porter_stemmer.stem(word) in eng_words:
data_clean.append(porter_stemmer.stem(word.lower()))
elif lancaster_stemmer.stem(word.lower()) in eng_words:
data_clean.append(lancaster_stemmer.stem(word))
elif lancaster_stemmer.stem(word) in eng_words:
data_clean.append(lancaster_stemmer.stem(word.lower()))
len(data_dirty), len(data_clean)
data_clean.extend(list(eng_words))
len(data_clean)
with open(“data_clean.txt”, “w”) as f:
for word in data_clean:
f.write(“%s\n” % word)
6.2 Parallel Deduplication, Sorting, Client File
warnings.filterwarnings(“ignore”)
def merge(*args):
left, right = args[0] if len(args) == 1 else args
left_length, right_length = len(left), len(right)
left_index, right_index = 0, 0
merged = []
while left_index < left_length and right_index < right_length:
if left[left_index] <= right[right_index]:
merged.append(left[left_index])
left_index += 1
else:
merged.append(right[right_index])
right_index += 1
if left_index == left_length:
merged.extend(right[right_index:])
else:
merged.extend(left[left_index:])
return merged
def merge_sort(data):
length = len(data)
if length <= 1:
return data
middle = length // 2
left = merge_sort(data[:middle])
right = merge_sort(data[middle:])
return merge(left, right)
def parallel_sort(data):
processes = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes = processes, maxtasksperchild = 1)
size = int(math.ceil(float(len(data)) / processes))
data = [data[i * size: (i + 1) * size] for i in range(processes)]
data = pool.map(merge_sort, data, chunksize = 1)
while len(data) > 1:
extra = data.pop() if len(data) % 2 == 1 else None
data = [(data[i], data[i + 1]) for i in range(0, len(data), 2)]
data = pool.map(merge, data) + ([extra] if extra else [])
return data[0]
def send_to_minion(block_uuid, data, minions):
print(“sending to: ” + str(block_uuid) + str(minions))
minion = minions[0]
minions = minions[1:]
host, port = minion
con = rpyc.connect(host, port = port)
minion = con.root.Minion()
minion.put(block_uuid, data, minions)
def read_from_minion(block_uuid, minion):
host, port = minion
con = rpyc.connect(host, port=port)
minion = con.root.Minion()
return minion.get(block_uuid)
def get(master, fname):
file_table = master.get_file_table_entry(fname)
if not file_table:
print(“404: File Not Found”)
return
data_unsorted = “”
print(“\nData stored in nodes is:\n”)
for block in file_table:
for m in [master.get_minions()[_] for _ in block[1]]:
data = read_from_minion(block[0], m)
if data:
sys.stdout.write(data)
data_unsorted += data
break
else:
try:
if os.path.getsize(os.getcwd() + “\\minion_nodes\\” + str(block[0])) != 0:
print(“No blocks found. Possibly a corrupt file.”)
except:
print(“No blocks found. Possibly a corrupt file.”)
data_unsorted = data_unsorted.split(“\n”)
if data_unsorted == [“”]:
return
if data_unsorted[-1] == “”:
data_unsorted = data_unsorted[:-1]
data_sorted = parallel_sort(data_unsorted)
if data_unsorted != data_sorted:
print(“\n\n\nData stored in nodes is now sorted:\n”)
with open(“data_{}_sorted.txt”.format(fname), “w”) as f:
for word in data_sorted:
print(word)
f.write(“%s\n” % word)
print()
put(master, “data_{}_sorted.txt”.format(fname), fname, “get”)
else:
print(“\nData is already sorted.”)
def put(master, source, dest, src):
if src == “main”:
data_dup = [line.rstrip(“\n”) for line in open(source)]
data_dup = list(set(data_dup))
source = “data_{}_deduplicated.txt”.format(dest)
with open(source, “w”) as f:
for word in data_dup:
f.write(“%s\n” % word)
size = os.path.getsize(source)
b_size = int(math.ceil(float(size) / master.get_block_size()))
blocks = master.write(dest, size, src)
if src == “main”:
rep = master.get_replication_factor()
else:
rep = 1
for r in range(rep):
with open(source) as f:
for i in range(b_size):
b = blocks[b_size * r + i]
data = f.read(master.get_block_size())
block_uuid = b[0]
minions = [master.get_minions()[_] for _ in b[1]]
send_to_minion(block_uuid, data, minions)
def main(args):
con = rpyc.connect(“localhost”, port = 2131)
master = con.root.Master()
if len(args) != 0 and args[0] == “get”:
get(master, args[1])
elif len(args) != 0 and args[0] == “put”:
put(master, args[1], args[2], “main”)
else:
print(“TRY ‘put srcFile.txt destFile’ OR ‘get destFile'”)
if __name__ == “__main__”:
main(sys.argv[1:])
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Contact Essay is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Essay Writing Service Works
First, you will need to complete an order form. It's not difficult but, in case there is anything you find not to be clear, you may always call us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download