Enterprise Search System Upgrade For Biotech Company

The System Necessities Include:

A Biotech Company specializing in the production of vaccines needs a system upgrade. Being the appointed Senior Software Engineer, I’m assigimplement the organization’s new enterprise search system. The system upgrade aims to control the rising number of internal documents.

The System Necessities Include:

The system has to be available to all its member staff.
The system should allow member staff to share documents with other members.
The system must contain some restrictive or privacy features to safeguard confidential documents from individuals considered non-employees.
The system should also return relevant documents to information needs.

In This Report, Three Possible Solutions Are Analyzed:

Solution 1: Creation of an in-house search engine implementing Boolean retrieval.

Solution 2: Come up with an in-house search engine based on the Vector Model.

Solution 3: A website that allows the crawler of commercial Web Search Engines. It can capture and index the documents.

Currently, we are living in a digitalized era where retrieving information from a compilation of resources has become critical to all aspects of life (Azad and Deepak, 2019). Information retrieval (IR)is a discipline that concerns itself with structuring, analyzing, organizing, storing, searching, and retrieving of information.

The information retriva is a systematic approach introduced by some (information need) query (Bahri, 2021). Highlighted below are the steps tp the process.

Indexing a collection of documents.
Matching the process of retrieval bases on relevance.

IR model can either be Best Match Models (The Vector Space Model) documents retrieved based on relevance (DeepAI, 2019) or be Exact match model (The Boolean Model) documents are either retrieved or not (Borisov, et al, 2018).

IR is evident in Information filtering, Recommender system Search Engines, and digital libraries.

The Boolean Retrieval Model is among the ancient retrieval models known. According to Azad and Deepak (2019), the Boolean Retrieval model is a mode for information retrieval that accepts any question in a Boolean expression format in which terms are combined with operators (AND, OR, and NOT). The Boolean Retrieval model applies the exact matching method (George, 2018) to satisfy any user’s question matching (locating) all documents that are deemed relevant to the words specified in the question.

The purpose of the Exact-match model is to classify a group of documents into two categories; one is those matching the criteria of the query, two ones failling to match the query (Günther, et al 2019) The documents retrieved are not graded rather they are categorsed by a specific criteria like document Id number. Exact match retrieval is an easy and effective way of retrieving information. However, there exists scalability (increase in size of collection) limitations to this model.

Gysel, et al, (2018) defined the Vector Space Model also called the Term Vector model is an algebraic model used in representing text documents or any other objects, as vectors of identifiers, for example index terms. Leung, Lee, and Song, (2019) took in a further detailed approach in definition of the vector space model founded on its process. Vector-space model demonstrates everything (queries and documents) as vectors of weight. The purpose of the model is to classify vectors to queries based on their resemblance.

Technical Background

There are several approaches to measuring similarities including: Cosine Similarity, inner product, Dice similarity, Jaccard similarity, Bahri, (2021).

According to Azzopardi, et al, (2018). the aim of a general-purpose search engine is indexing a sizeable share of the Web, that is independent of the topic and domain.

There are three main components in a search engine working towards retrieving a user’s information need. These are: Crawler, Indexer and Query Engine.

It is aso know as robot or spider. It heps in collectin of documents by repeatedly collecting links from start pages (Azad and Deepak, 2019). Pages retrieved and their components are reduced and stored in a page repository. The URLs and their respective links are transferred to the crawler control model.

The pages previously collected by the crawler are then precessed by the indexer (Lin, and Ma, 2021).

The query engine is used to process user queries and retrieving ranked matching answers (documents).

The documents retrieved from the query engines are then ranked in order of importance using these three approaches collectively: anchor, link, and content (similarity).

Link-based ranking technique- One of the most we knownlink-base technique currenty being used is a variant of PageRank algorithm that is implemented in google search engine.
Content-based ranking techniqueJust as in the vector based model a similarity score is calculated between a specified topic and a The process is explained by the cosine similarity measure.
Anchor-based ranking technique- The anchor text is the visible hyperlinked text on the page. This technique analyses quality of the pageby pattern matching between the URL’s anchor text and the query vector.

Pros

The results are predictable and relatively easy to explain.
Efficient processing because multiple documents can be eliminated from the search.
Many varying features can be incorporated into the system.

Cons

Complex queries tend to be diffucult.
Its effectiveness entirely depends on the user.
Usually, simple queries don’t work well.
Structuring Boolean queries requires a certain level ofskill

Pros

It is simple and fast.
It sorts the documents depending on their degree of their similarities to the query.
Term-weighting increases the quality of the answer set.
It improves the retreival performance.

Cons

It is based on the assumption that the index terms are independent.
There are no predictions about the techniques for purposes of ranking.

Pros

It is easy to apply.
Useful in searching for specific or unique topics.
Users can retrieve resuts matching the words they are looking for.

Cons

It lacks security.
Search engines can not update the indexes with a similar speed at which the web evolves.

The Boolean retrieval technique follows the Exact-Match model, yet this is a precise method of retrieval (Azad and Deepak, 2019). Its concept either matches or does not match the query. Aso, there is the risk of the system returning either too many or very little results depending on the operator used, which results potentially in an ineffective query.

The vector space model uses weights to classify documents according to their similarities (relevance) (Abu-Salih, 2018). This approach eradicates the issue (retrieval of too many or very little documents) with Boolean retrieval model by use of a ranking system inorder to retrieve the most appropriate documents to the user’s query.

There are several problems that exist in regard to the search effectiveness of Web Search Engines. They include:

Web search engines arenot updated as regularly according to Pokorny (2004) search engine update monthly.
Lack of authority – Spamming,
Lack of Quality Control – Document precisionand reliability are not inadvertentlyguaranteed.

Even if employees upload high standard and particular documents to the Web, there is no affirmation that when an employee needs to extract a document;

The web search engine is updated to initially locate the document.
there is the possibility of false ranking caused by spamming etc.

Boolean Retreival Model

Due to the structure of Boolean querying in with respect to operator usage, this method would need users to have some eve of skill or idea of formulating queries in such a way that they know specifically what document(s) they are looking for (Azzopardi, et al, 2018). However, it isn’t sure that contempated users (employees at the biotech company) can effectively formuate Boolean queries that bring back the documents they are looking for when needed.

Vector Space Model

Users written down queries in nature language. An approach that is more user friendly to querying that formulating Boolean queries catering to all employees.

Web Search Engine

It is user friendly and intuitive, given that all employees are familiar with how common search engines like Google work.

Analysis And Recommendations

Boolean Retreiva Model

Creation and maintainance of such a system would attract high development costs.

Vector Space Model

High development costs would be inccured in creating a VSM IR system (Priyadarshini, et al, 2010).

Web Serach Engine

It is the greatest cost efficient model.

Boolean Retreival Model

In-house search engine gets rid of the security dangers of information retrieval through the web. Similarly, systems such as Diaog and Westlaw, all feature a data role based identity features like user authentication ensuring secure collection of documents from all unwanted users (Rehma, Awan, and Butt, 2018). All these must be implemented during the system development.

Vector Space Model

In-house search engine does away with the security risks of information retrieval through the web (Zamani, et al, 2020). Similry, systems such as Elasticsearch feature data role based identity features like user authentication ensuring secure documents collection from unwanted users. All of which has to be implemented during development

Web Research Engine

Due to the reliance on retrieval of information through the web current day, itis no surprise that there are security matters. Using popular search engines such as google returns mutitudes of URLs from a sole query. However, some of which encompass your security, for example, phishing sites and Trojan sites (Priyadarshini, et al., 2010). While there are few ways to moderate the danger of attack by uploading private information to a website for the purpose of retrieval through the web makes you prone to many security breaches.

Scoring scale, 1- possible lowest score and 5- highest possible score.

		Possible Solutions
Criteria	Weighting	Solution A	Solution B	Solution C
		Score	Total	Score	Total	Score	Total
Search Effectiveness	5	3	15	5	25	4	20
Usability	5	2	10	4	20	5	25
Budget	4	3	12	4	16	4	16
Security & Role-based identity	5	5	25	5	25	3	15
Total		62	86	76

On the basis of the outcome from the Weighted Decision Matrix above Solution B – The Vector Space Model acquired the greatest score. It met all the set requirements getting perfect scores in Usability and Security, Search Effectiveness, and Role-based identity. Because of the following reasons, it is ideal to develop an in-house search engine that is based on the Vector Space Model for the biotech company:

Search Effectiveness– It ranks documents based on their similarity eliminating either too few or very many document matches also only returns relevant documents and of a high standard.
Usability– It is uses user-friendly queries using natural language, that is familiar to all employees.
Budget– This model leans a little more to the costly end on the scale but it is necessary to satisfy all the other requirements.
Security and Role-based identity– The documents are only accessible on the in-house search engine, therefore getting rid of security dangers of the web. Also, user authentication characteristics would be implemented adding another layer for protection

Conclusion

The aim of this report was to evaluate the three possible Information Retrieval solutions namely Solution A – in-house search engine that is based on Boolean Retrieval Model, Solution B – in-house search engine that is based on the Vector Space Model and Solution C – Uploading internal documents to a Website by use of use of a Commercial Web Search Engine for information retrival to upgrade the system for a biotech company due to an increase in internal documents.

The systems solution had to satisfy the following four criteria: Search effectiveness, Budget and Security, Usability, and Role-based identity. The report further features a practical background of Information retrieval and every solution highlighting the key attributes. Then, it analysed each solution on the basis of specified system requirements. It positioned each solution utilizing the weighted decision matrix in order to determine the best suited solution.

Reseach Effectiveness

In conclusion, developing an in-house search engine based on the vector space model of information retrieval was most appropriate considering the system needs of the biotech company.

References

Abu-Salih B, (2018), Applying Vector Space Model (VSM) Techniques in Information Retrival for Arabic Language.

Azad, H.K. and Deepak, A., 2019. Query expansion techniques for information retrieval: a survey. Information Processing & Management, 56(5), pp.1698-1735.

Azzopardi, L., Thomas, P. and Craswell, N., 2018, June. Measuring the utility of search engine result pages: an information foraging based measure. In The 41st international acm sigir conference on research & development in information retrieval (pp. 605-614).

Bahri, S., 2021. Aplikasi Pencarian Bahan Pustaka Di Perpustakaan Menggunakan Metode Vector Space Model. JIMP-Jurnal Informatika Merdeka Pasuruan, 5(2).

Billal S, (2018), Development of search engine based on vector space model, A Dissertation in Fulfillment for the Requirements of the Degree of Master in Computer Science.

Borisov, A., Wardenaar, M., Markov, I. and de Rijke, M., 2018, June. A click sequence model for web search. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (pp. 45-54).

(2019). Cosine Similarity. [online] Available at: https://deepai.org/machine-learningglossary-and-terms/cosine-similarity [Accessed 4 Nov. 2021].

George, M., 2018, September. Unsupervised Topic Detection based on 2D Vector Space model using Apriori Algorithm and NLP. In 2018 Thirteenth International Conference on Digital Information Management (ICDIM) (pp. 279-283). IEEE.

Günther, F., Rinaldi, L. and Marelli, M., 2019. Vector-space models of semantic representation from a cognitive perspective: A discussion of common misconceptions. Perspectives on Psychological Science, 14(6), pp.1006-1033.

Gysel, C.V., De Rijke, M. and Kanoulas, E., 2018. Neural vector spaces for unsupervised information retrieval. ACM Transactions on Information Systems (TOIS), 36(4), pp.1-25.

Leung, C., Lee, W. and Song, J.J., 2019. Information technology-based patent retrieval models. In Springer Handbook of Science and Technology Indicators (pp. 859-874). Springer, Cham.

Lin, J. and Ma, X., 2021. A few brief notes on deepimpact, coil, and a conceptual framework for information retrieval techniques. arXiv preprint arXiv:2106.14807.

Priyadarshini, S. Aishwarya and A. Ajaaz Ahmed, “Search engine vulnerabilities and threats – a survey and proposed solution for a secured censored search platform,” 2010 International Conference on Communication and Computational Intelligence (INCOCCI), 2010, pp. 535-539.

Rehma, A.A., Awan, M.J. and Butt, I., 2018. Comparison and evaluation of information retrieval models.

Zamani, H., Dumais, S., Craswell, N., Bennett, P. and Lueck, G., 2020, April. Generating clarifying questions for information retrieval. In Proceedings of The Web Conference 2020 (pp. 418-428).

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Enterprise Search System Upgrade For Biotech Company ”

Get high-quality paper

NEW! AI matching with writer