Multilingual Information Retrieval

Introduction:

Is it possible to have an information retrieval system that would be able to cross language barriers throughout the world? As artificial intelligence retrieval has risen with the advent of IR systems such as ChatGPT, would it be possible to create a multilingual information retrieval system that would have the ability to use a cross-language platform that would enable its users to gain access to information from countries that they’re not knowledgeable in the language. In this paper, we will analyze three main issues in creating a multilingual information retrieval system through understanding what the platform would have to be, as well as how images, language, and user behavior would impact the way it interacts with the information that would be readily available. Over the years the internet has become full of information from around the world, yet no multilingual system has been created for libraries. The reasons for this can be a multitude of things, including the way some languages are written versus how others are, such as languages derived in Latin characters versus East Asian languages.

Argument I: Multilingual Information Retrieval

When trying to gain an understanding of Multilingual Information Retrieval, one of the first issues that a new user can come across in their research are the different types of possible systems associated with a platform. To find a platform that would have the functionality to one day become a multilingual IR system one of the first things would be to understand the difference between the technical jargon. One of the problems is that there are three different terms associated with this kind of informational retrieval system. The first was Multilingual Information Retrieval, the second was Cross Language Information Retrieval, and the last was Interactive Multilingual Information Access (MLIA), they all shared a common purpose but have different titles associated with them.

First what exactly is Multilingual Information Retrieval, Chowdhury states that “Multilingual information retrieval in digital libraries involves two major issues: recognition, manipulation and display of multiple languages, and cross-language information search and retrieval.” (Chowdhury, 2010, p. 467) Secondly, Cross Language Information Retrieval, “The goal of Cross-Language Information Retrieval is to build search engines that use a query expressed in one language (e.g., English) to find content that is expressed in some other language (e.g., French).” (Galuščáková et al., 2022, p. 1) It could also be considered “that cross-language information retrieval (CLIR) should have the function of providing information useful enough in identifying the relevance of the retrieved document, in the user’s language.” (Suzuki et al., 2001, p. 422)

Lastly, let’s look at what is written about the MLIA system in which “Multilingual communication enables the dissemination of information beyond the boundaries of languages. Interactive Multilingual Information Access (MLIA) refers to a process in which a user and a system collaborate to find documents that satisfy his/her information needs, regardless of the language in which those documents are created.” (Wu et al., 2012, p. 524)

All three of these systems would essentially function in the same capacity with only slight differences, this is one of the major issues in attempting to define what a true multilingual IR system would be as multiple areas of study attempting to find their own definition without finding ways to work as one system. The one thing that is common is that all three of these systems have a common goal and that is creating a system that would make it possible for individuals in one region of the world to gain access to information from a language that they wouldn’t normally have access to. As we see in Figure 1, in which the MLIA system shows how the process would take place, it shows how using the system would operate from the first query until it reaches the final access point, it shows possible pathways to getting information through several processes, and how translation of specific terms can eventually lead to an outcome.

When writing about MLIA, the line between these systems cross paths, as the query involves using Cross-Language Information Retrieval, as the pathways in Figure 1 show how these systems interact with one another. “Translation ambiguity, the problem of not knowing which translation alternatives are appropriate for the current query, is an important issue in CLIR. Researchers have developed various methods to handle possible translation ambiguities, in which relevance feedback-based techniques are among the effective methods.” (Wu et al., 2012, p. 524) With CLIR the translation of the documents can be obtained in a similar way to the MLIA system in which “First we can ask which items should be translated—the queries or the documents? With that question answered, we must then ask how those documents or queries can be broken up into units (which we call terms) that can be translated.” (Galuščáková et al., 2022, p. 4)

A diagram of a process

Description automatically generated

Figure Figure 1: Steps in interactive multilingual information access (Wu et al., 2012, p. 524)

In order for the barriers created from having multiple languages to come down there has to be a digital code that would give the results needed, even with the solutions offered by multiple entities over the years finding a common ground on how this will work is still slow moving. “The most common approach is query translation, where the user queries are translated into the content language(s). The query translation process requires several steps, which are either performed automatically or with assistance from the user.” (Stiller et al., 2013, p. 87) Through this it will make it easier for the user to access information throughout digital libraries. In order to create such a system, it would mean condensing the focus on the three well known areas Multilingual Information Retrieval, Cross Language Information Retrieval, and Multilingual Information Access, as creating three systems has resulted in the exact same process needed for each, just with different terms.

Argument II: Language and Images in Multilingual Information Retrieval

Once defined, multilingual information retrieval also has other categories that it must pull information from, and with that there will also be other caveats that come with it. “Three major approaches commonly employed for cross-lingual information retrieval include controlled vocabulary, knowledge-based, and corpus-based.” (C.C. Yang et al., 2008, p. 597) What are these three known approaches, controlled vocabulary “adopts a predetermined set of vocabularies for user queries and document indexing,” knowledge-based vocabulary “adopts an ontology and dictionary to translate queries from one language to another language,” and corpus-based makes use of the statistical information of term usage in a parallel or comparable corpus to automatically construct a statistically based cross-lingual thesaurus to overcome the limitations of knowledge-based approach.” (C.C. Yang et al., 2008, p. 597) Out of these while studying multilingual information retrieval, I’ve come to understand controlled vocabulary the most as these are a list of terms in which the Library of Congress uses to index the terms under an umbrella system. Most would see this on a library website when searching for a book as these vocabularies are used when indexing library items.

Outside of this there is also the problem on indexing terms through multiple languages, as there are some terms that are not translatable into other languages. “The predominance of English on the Web has also resulted in a predominance of the Latin script on the Web, meaning that languages that do not use Latin script are often written in Latin script, for instance by non-speakers of that language, or out of convenience.” (Bland & Lioma, 2009, p. 327) The problem with a predominant language on the internet is that if there’s a term that doesn’t translate what would it mean in another language. “As a consequence, terms of a document containing term(s) with no entry in the translation dictionary are artificially enhanced compared to those of a document that all its terms are available in the dictionary.” (Rahimi et al., 2015, pp. 257-258). How can you translate something that doesn’t have a word in another language becomes the biggest problem in a multilingual system.

Things such as language used and images will also play a role in how the system works in the long run when it comes to images uploaded with information in another language outside of the searcher’s own, how would they be able to find it if they don’t know how to read the language present. Outside of the language used there is “the difficulties faced by searchers during the image retrieval process in a multilingual context – that is, when the language of the query differs from the indexing language.” (Menard and Smithglass, 2013, p.99) How can users find an image when all they know about it is what is happening in it, or the colors in which are in the image, as there are multiple possibilities when it comes to a single image file. “The main focus of image retrieval research has been on how people search for and describe images. However, despite widespread studies on user behaviour or performance when searching for images, little is known about the characteristics and functionalities of existing search interfaces and similar tools available for image retrieval.” (Menard and Smithglass, 2013, p.100)

How can controlled vocabulary be used for image files, or would it be there would have to be more than one system for images are they carry their own metadata that is embedded in the image itself. Ménard tested out how using controlled vocabularies versus uncontrolled vocabularies would work when it comes to indexing images for a multilingual system, stating that “analysis of indexing terms shows that controlled and uncontrolled vocabulary indexing differ from one another at the terminological, perceptual and structural levels.” (Ménard, 2009, p. 73) Some of the key takeaways from their article are that controlled vocabularies use more terms where uncontrolled vocabularies would use less to describe an image, controlled vocabulary does little to describe an image where uncontrolled vocabulary has the option to makes far greater use of physical, functional, identifier and other perceptual attributes, as well as structural relationships. (Ménard, 2009, p. 74)

Argument III: User Behavior and Multilingual Information Retrieval

Those who would use this system have spent years using a monolingual information retrieval system for the queries so how would they use a multilingual information retrieval system where they may or may not know the language that they are searching through. “In monolingual IR, users do not always accurately specify their information needs, often using short and ambiguous queries. This problem is exacerbated in CLIR, where users may be able to read information in foreign languages but have limited capacity to formulate suitable queries in that language. It is inevitable that access to the relevant cross-language information is much more difficult than in a monolingual setting.” (Zhou et al., 2016, p. 449) When the first multilingual search platform became live, it was through Google’s web service, in which it would give the user the ability to conduct a search through a different language, “The launch of the cross-language search by Google was a breakthrough because it signified the transition from cross-language information retrieval (CLIR) research to a real application” (Chen & Bao, 2009, p.2) today Google has pages for most spoken languages. This was the start, but how does user behavior play into these functions, how do user behavior help change how a multilingual system would work.

There is also the factors of why these systems have never evolved into becoming a true multilingual information retrieval system as when it comes to making a program as such how would it integrate into a library information retrieval system. “Libraries have always been based on values that extend to today’s digital and hybrid instances. Libraries must continue to nurture and take care of these values today; while it is easy to say that Google or Bing will do it, it is important to understand and realize that these commercial entities do not really care much about the fundamental human things that libraries have cared about like access, cooperation, learning, intellectual freedom, fairness, quality, communication and stewardship.” (Marchionini, 2014, pp.143-144) For websites that serve as an information hub, such as Google, they gain advertisement revenue based on searches and clicks so it would be detrimental to their mission to make an a system that would eliminate the need for those clicks. For a user who needs to find information for anything, such as school or work, the amount of search clicks would help Google.

This is where libraries are also a part of this, user behavior in libraries is different, whereas they would normally go on Google or Bing while at home, in the library they would have help from a librarian or a library information system to gain what they need. “Thus libraries are serving both local patrons, the people who always come to the library but also serving people anywhere in the world who may be coming in through the internet. Libraries are extending their reach. They participate more in knowledge creation, publishing activities and many traditional libraries now are truly hybrid libraries, and they have almost as much digital resources as they have physical.” (Marchionini, 2014, p. 145) Libraries have the unique ability to find information for a patron to further help their needs whether it’s a physical item or a digital one. The caveat is that it’s subjected to a monolingual search, whereas Google has the ability to do a multilingual search.

In a study where it was asked about what users of the Saudi Digital Library experienced as it is a multilingual information retrieval system, one of the participants stated that “If I were to suggest [anything], I think one of the important things is to know what the users’ needs are to design search systems.” (Alsalmi, 2019, p. 95) It was further iterated that “receiving wrong suggestions, incorrect translations and unrelated results point to a need to reevaluate the CLIR in the SDL to provide accurate results that satisfy its users’ needs.” (Alsalmi, 2019, p. 97) This study shows the difficulties that multilingual digital libraries face, “With the added variable of multiple languages, multilingual digital libraries face challenges that other libraries may not. Multilingual digital libraries might lack the resources to accommodate every language spoken by users. Users might even have difficulty accessing resources in the library’s database for different reasons. First, problems might arise with the multilingual search function itself. Second, characteristics of different languages may limit their search results.” (Alsalmi, 2019, p. 87)

What are the Implications of Multilingual Information Access

One of the first questions that are associated with having a multilingual system, are the possible implications of having a system such as this for a user? But also, what would it mean to create such a system in the time of artificial intelligence and Google. The ones who would benefit the most from such a system are students and researchers as it would open up the avenues of information for what they’re doing. “Developing an open access, multi-institutional, multilingual, international digital library requires robust technological and institutional infrastructures that support the needs of individual institutions alongside the collaborative and ensure continuous communication and development of the shared vision for the digital library as a whole.” (Wooldridge et al., 2009, p. 43)

Digital libraries are an important factor in trying to develop these systems that would support multiple languages. For applications such as Google, where they have the ability to do such a thing, they would already have the information needed to understand their users as “User profiles are typically learned from a user’s usage information.” (Zhou et al., 2016, p. 454) But what would it mean for those who speak languages where the text is different, “People from different languages and cultures read in different ways. For instance, Semitic cultures read from right-to-left, while most Western cultures read from left-to-right and Pacific-Oceanic cultures read vertically from top-to-bottom in columnar format.” (Miraz et al., 2016, p. 433) For applications such as search engines the infrastructure is there to support various languages to an extent, it’s applying these information retrieval systems to libraries that would change how documents could be searched for.

As for image search it would come down to how language would be used in association with the images, for libraries which use controlled vocabularies, users would be give the option to use the following,

(1) a simple, natural language keyword search; or

(2) a more structured “advanced” search, offering multiple predefined categories in drop-down menus to use with search terms, and Boolean options for further limiting results. (Menard and Smithglass, 2013, p.105)

Image search is still a developing area of study, but it is one that can help libraries that are associated with archives and museums as a way to catalogue images for easier access to the materials, and the metadata associated with them. In a study in which a Cross-Language Image Retrieval System was used in conjunction with St. Andrews University Library’s image archive it was stated that “Their key findings were that the overall performance of the cross-language system was relatively close to the monolingual system and the image categories assisted users in cross-language search. Users tended to browse through pages of image results rather than viewing the image captions; concept hierarchy was frequently used and bilingual searching was preferable.” (Yang & Lam, 2006, p. 631) This is a simpler way to find images in a multilingual system, as users would find it easier to browse items that match their search as they’re looking for what they needed.

Unsolved Issues in Multilingual Information Retrieval

One of the unsolved issues that would be caused by creating a multilingual information retrieval system is how an information system would react to the changes presented due to the differences in languages because “A navigation bar may be totally unsuitable on the right for one culture but may be perfectly normal for another.” (Miraz et al., 2016, p. 433) There would be multiple areas where the differences in some languages from East Asian countries would make it difficult for a user to be able to use a search bar, “such as Arabic, Chinese and Korean, are difficult to read at font sizes that are perfectly legible for European languages like English, French and Russian.” (Miraz et al., 2016, p. 435) To create such a system would require the ability to use a multitude of different characters from different languages, which would have to be carefully created user interface that would be able to detect the language that was being used. In a library setting this would create a centralized way for users to be able to access information worldwide without having to worry about where the information is coming from. This would aid in research for students especially because it would help them use resources from around the world instead of using whatever their schools have at their disposal.

When it comes to libraries, their growing digital collections, controlled vocabulary plays a major role in what can be found by a user, “In fact, real-world running applications in domain-specific sectors, even those that contain collections in multiple languages, are reluctant to adopt any serious cross-language functionality. Most of those that do adopt strategies that use some kind of controlled vocabulary.” (Gey et al., 2005, p. 426) Uncontrolled Vocabulary, or even something such as social tagging, which is also known as a folksonomy, is still unheard of, “Social cataloging site records are created and maintained by users of the site rather than professional catalogers.” (Oudenaar & Bullard, 2023, p. 203) These kinds of digital spaces are created and maintained by those who have joined it whereas physical libraries have systems already in place. If you ask a classroom of student to search for something in their native language and then ask them to search for the same thing in another language, once the search is complete some of the responses given by the students is that it is difficult to find things in that other language that is readily available as a translated source. The only way it’s possible is if the student has the same knowledge of the second language, as some sites outside of their native language may not have a translation readily available.

Conclusion

As libraries begin to digitize much of their catalogues and archival materials becoming readily available can it be possible for those who don’t speak the language to access the materials within these new digital libraries? “With the growing focus in many countries on constructing digital libraries, more and more of the world’s cultural heritage is being preserved in digital form. Most digitized material related to cultural heritage is presented in its original language, which might be unknown to interested users.” (Wang et al., 2004, p. 247) With this the need for a comprehensive multilingual system becomes more and more apparent as researchers have started searching for sources outside of their known languages. The next steps needed to be taken to create a multilingual information retrieval system would be to start small, by using one or two languages in the region that is testing it, it would be possible to identify ways to create a digital library, or a library IR system that would pick up and translate materials through this. As we’ve seen, it’s not an easy path to getting a functioning system as a lot of dedication is needed when it comes to creating and maintaining a system such as this one, but as the internet changes and the creation of information retrieval systems advance it may be possible sometime in the future.

References

Alsalmi, H.M. (2021), "Information-seeking in multilingual digital libraries: Comparative

case studies of five university students", Library Hi Tech, Vol. 39 No. 1, pp. 80-100. https://doi-org.ezproxy.lib.uwm.edu/10.1108/LHT-06-2019-0119

Blanco, R., & Lioma, C. (2009). Mixed monolingual homepage finding in 34 languages: the

role of language script and search domain. Information Retrieval Journal, 12(3), 324–351. https://doi-org.ezproxy.lib.uwm.edu/10.1007/s10791-008-9082-8

Chowdhury, G. G. (Gobinda G. ). (2010). Introduction to modern information retrieval / G.G.

Chowdhury. (Third edition.). Neal-Schuman Publishers.

Galuščáková, P., Oard, D. W., & Nair, S. (2022). Cross-language Information Retrieval.

ArXiv.Org. https://doi.org/10.48550/arxiv.2111.05988

Gey, F. C., Kando, N., & Peters, C. (2005). Cross-Language Information Retrieval: the way

ahead. Information Processing & Management, 41(3), 415–431. https://doi-org.ezproxy.lib.uwm.edu/10.1016/j.ipm.2004.06.006

Jiangping Chen, & Yu Bao. (2009). Cross-language search: The case of Google Language

Tools. First Monday, 14(3), 3. https://doi-org.ezproxy.lib.uwm.edu/10.5210/fm.v14i3.2335

Jenq-Haur Wang, Wen-Hsiang Lu, & Lee-Feng Chien. (2004). Toward Web mining of cross-

language query translations in digital libraries. International Journal on Digital Libraries, 4(4), 247–257. https://doi-org.ezproxy.lib.uwm.edu/10.1007/s00799-004-0091-y

Marchionini, G. (2014). Libraries of People. Information Studies, 20(3), 143–194.

Ménard, E. (2009). Images: indexing for accessibility in a multi-lingual environment –

challenges and perspectives. Indexer, 27(2), 70–76. https://doi-org.ezproxy.lib.uwm.edu/10.3828/indexer.2009.23

Menard, E. and Smithglass, M. (2014), "Digital image access: an exploration of the best

practices of online resources", Library Hi Tech, Vol. 32 No. 1, pp. 98-119. https://doi-org.ezproxy.lib.uwm.edu/10.1108/LHT-05-2013-0064

Miraz, M., Excell, P., & Ali, M. (2016). User interface (UI) design issues for multilingual

users: a case study. Universal Access in the Information Society, 15(3), 431–444. https://doi-org.ezproxy.lib.uwm.edu/10.1007/s10209-014-0397-5

Oudenaar, H., & Bullard, J. (2023). NOT A BOOK: Goodreads and the Risks of Social

Cataloging with Insufficient Direction. Cataloging & Classification Quarterly, 61(2), 203–227. https://doi-org.ezproxy.lib.uwm.edu/10.1080/01639374.2023.2207189

Rahimi, R., Shakery, A. & King, I. Multilingual information retrieval in the language

modeling framework. Inf Retrieval J 18, 246–281 (2015). https://doi.org/10.1007/s10791-015-9255-1

Stiller, J., Gäde, M. & Petras, V. (2013). Multilingual Access to Digital Libraries: The

Europeana Use Case / Mehrsprachiger Zugang zu Digitalen Bibliotheken: Europeana / Accès multilingue aux bibliothèques numériques: Le cas d’Europeana. Information - Wissenschaft & Praxis, 64(2-3), 86-95. https://doi.org/10.1515/iwp-2013-0014

Suzuki, M., Inoue, N., & Hashimoto, K. (2001). A Method for Supporting Document Selection

in Cross-language Information Retrieval and its Evaluation. Computers & the Humanities, 35(4), 421–438. https://doi-org.ezproxy.lib.uwm.edu/10.1023/A:1011877503081

WOOLDRIDGE, B., TAYLOR, L., & SULLIVAN, M. (2009). Managing an Open Access, Multi-

Institutional, International Digital Library: The Digital Library of the Caribbean. Resource Sharing & Information Networks, 20(1/2), 35–44. https://doi-org.ezproxy.lib.uwm.edu/10.1080/07377790903014534

Wu, D., He, D. and Xu, X. (2012), "A study of relevance feedback techniques in interactive

multilingual information access", Library Hi Tech, Vol. 30 No. 3, pp. 523-544. https://doi-org.ezproxy.lib.uwm.edu/10.1108/07378831211266645

Yang, C. C., & Lam, W. (2006). Introduction to the special topic section on multilingual

information systems. Journal of the American Society for Information Science & Technology, 57(5), 629–631. https://doi-org.ezproxy.lib.uwm.edu/10.1002/asi.20325

Yang, C. C., Wei, C.-P., & Li, K. W. (2008). Cross-lingual thesaurus for multilingual knowledge

management. Decision Support Systems, 45(3), 596–605. https://doi-org.ezproxy.lib.uwm.edu/10.1016/j.dss.2007.07.005

Zhou, D., Lawless, S., Wu, X., Zhao, W. and Liu, J. (2016), "A study of user profile

representation for personalized cross-language information retrieval", Aslib Journal of Information Management, Vol. 68 No. 4, pp. 448-477. https://doi-org.ezproxy.lib.uwm.edu/10.1108/AJIM-06-2015-0091

Stephanie Trinidad

Search This Blog

Multilingual Information Retrieval

Labels

Comments

Post a Comment

Popular posts from this blog

Montresor’s Downfall A Psychoanalytical look at Edgar Allan Poe’s The Cask of Amontillado

The Modern Girls: Japan’s Generation of Change

Literature and Women during the Heian Period