July 17, 2025

Blog Post

Retrieval Augmented Generation (RAG) using MediaWiki

When diving into the world of data and content management, one may encounter the term "RAG," which stands for Retrieval-Augmented Generation. Before delving into the exciting possibilities of integrating MediaWiki with RAG, it's essential to understand what a RAG and LLM is. If you have not read our blog on LLMs, we recommend starting there.

What is RAG aka Retrieval Augmented Generation

Retrieval-augmented generation (RAG) is a technique for optimizing the output of large language models (LLMs) by referencing authoritative external knowledge bases. LLMs are trained on vast volumes of data and use billions of parameters to generate original output for tasks such as answering questions, translating languages, and completing sentences. However, RAG fills a gap in how these neural networks operate. While LLMs can generate coherent and contextually accurate responses based on their training data, they may lack access to the most up-to-date or domain-specific information.

By integrating RAG, LLMs fetch relevant facts from external sources before generating responses. This process enhances the accuracy and reliability of generative AI models without the need for retraining. As a cost-effective approach, RAG extends the powerful capabilities of LLMs, ensuring their outputs remain relevant, accurate, and useful in various contexts, including specific domains or an organization's internal knowledge base.

Integrating RAG with MediaWiki, a robust platform for collaborative content creation and management, can revolutionize how information is accessed, updated, and utilized. This blog will explore the benefits and implementation strategies for combining these two technologies, shedding light on the potential for creating a more dynamic and intelligent content ecosystem.

How does Retrieval-Augmented Generation work?

Without RAG: The LLM takes the user input and generates a response based on its training data, ssentially, what it already knows.

With RAG: An information retrieval component is introduced, utilizing the user input to pull relevant information from new data sources. The user query and the relevant information are both fed to the LLM, enabling it to generate more accurate and informative responses by combining the new knowledge with its existing training data.

Create External Data

The new data outside of the LLM's original training dataset is called external data. This data can come from various sources, such as APIs, databases, or document repositories, and may exist in formats like files, database records, or long-form text. Another AI technique, embedding language models, converts this data into numerical representations and stores it in a vector database, creating a knowledge library that generative AI models can understand.

Retrieve Relevant Information

The next step is performing a relevancy search. The user query is converted into a vector representation and matched against the vector database. For instance, consider a smart chatbot designed to answer human resource questions for an organization. If an employee asks, "How much annual leave do I have?" the system will retrieve relevant documents like the annual leave policy and the employee's past leave records. These documents are returned because they are highly relevant to the query, with relevancy determined using mathematical vector calculations.

Augment the LLM Prompt

The RAG model then augments the user input (or prompt) by adding the relevant retrieved data in context. This step employs prompt engineering techniques to communicate effectively with the LLM. The augmented prompt allows the LLM to generate an accurate and contextually rich answer to the user's query.

Update External Data

Maintaining current information for retrieval is crucial. To prevent external data from becoming outdated, documents and their embedding representations should be updated regularly. This can be achieved through automated real-time processes or periodic batch processing, addressing a common challenge in data analytics with various data-science approaches to change management.

By integrating these steps, RAG significantly improves the accuracy and reliability of responses generated by LLMs, making them more relevant and useful in a wide range of applications.

Workflow of RAG

How Are RAGs Used?

Retrieval-augmented generation allows users to interact with data repositories in a conversational manner, creating new and dynamic experiences. The potential applications for RAG are vast and are only limited by the number of available datasets.

For instance, a generative AI model enhanced with a medical index can serve as an invaluable assistant to doctors or nurses. Similarly, financial analysts can gain significant insights from an assistant connected to market data.

Virtually any business can transform its technical or policy manuals, videos, or logs into knowledge bases that enhance LLMs. These enriched sources can be utilized for various purposes such as customer or field support, employee training, and improving developer productivity.

This wide-ranging potential is why businesses of all sizes, from small enterprises to large corporations, are adopting RAG technology. It provides a cost-effective method for rapid access to extensive data, thereby boosting efficiency and productivity across multiple domains.

MediaWiki and RAG: A Powerful Combination

Integrating Retrieval-Augmented Generation with MediaWiki can revolutionize how we access and utilize vast repositories of knowledge. MediaWiki, widely recognized for its robust content management capabilities, serves as an excellent platform for creating RAGs. Given its extensive and well-organized database of information, MediaWiki can significantly enhance the efficiency of data retrieval. By leveraging RAG, users can seamlessly interact with the knowledge stored in MediaWiki, effortlessly pulling up relevant data and being directed to the precise pages containing the information they seek. This integration not only streamlines the process of finding accurate and up-to-date content but also ensures that users can maximize the potential of the comprehensive knowledge base that MediaWiki offers.

Using MediaWiki for Training LLMs and Creating RAGs

MediaWiki is an exceptional tool for training large language models and creating Retrieval-Augmented Generations. Its flexibility allows you to load data from diverse sources, including PDFs, documents, images, and more, into a centralized repository. Once the data is in MediaWiki, it can be collaboratively sorted, enhanced, and edited, ensuring a comprehensive and well-organized knowledge base. This curated data can then be exported to train an LLM, providing it with a rich and diverse training dataset. Alternatively, the processed information can be converted into a vector database that a RAG can access, enabling precise and efficient retrieval of relevant information. By leveraging MediaWiki's robust content management capabilities, you can streamline the process of data preparation for advanced AI applications, enhancing both the quality and accessibility of the information.

Workflow for creating a RAG using MediaWiki

Integrating a Retrieval-Augmented Generation system into MediaWiki involves several key steps. Here's a simplified workflow to guide you through the process:

1) Generate an XML Dump

Start by generating an XML dump of your MediaWiki content using the DumpBackup.php tool. This dump will contain all the articles, images, edit logs, and other relevant data from your MediaWiki instance.

2) Load the XML Dump

Use a customized loader to import the XML dump into a processing environment. This loader should be capable of handling extra metadata, such as article URLs, to enhance the data's utility. A useful tool for this is a customized version of a MediaWiki XML dump loader that uses mwparserfromhell to parse the content effectively. Frameworks like Langchain provide such tools.

3) Split the Text

Once the XML content is loaded, use a text-splitting tool, such as Langchain Text Splitter, to break the articles into manageable chunks. This step is crucial for making the data more accessible and efficient to process.

4) Generate Embeddings

Convert these text chunks into numerical representations, or embeddings, using a model like OpenAI Embeddings. These embeddings capture the semantic meaning of the text, making it easier for the AI to understand and retrieve relevant information.

5) Store in a Vector Database

Store the generated embeddings in a vector database, such as FAISS or Chroma DB. These databases will allow quick and efficient retrieval of relevant information when queries are made.

6) Build the RAG System

Integrate the vector database with a retrieval-augmented generation system. When a user makes a query, the system will use the embeddings to find relevant information from the MediaWiki content and generate accurate responses by combining this retrieved data with the model’s pre-trained knowledge.

7) Enhance and Edit the Data

Continuously enhance and edit the MediaWiki data collaboratively. This iterative process ensures that the knowledge base remains up-to-date and comprehensive, providing better responses over time.

By following these steps, you can effectively incorporate a RAG system into MediaWiki, transforming it into a powerful tool for knowledge retrieval and generation. This workflow allows you to leverage the vast repository of information within MediaWiki, making it easier to access and use in various applications.

Custom MediaWiki RAG and LLM Solutions

If you're interested in building a custom LLM or RAG using MediaWiki, we'd love to help! Schedule a free, no-obligation consultation with us, and let's discuss how we can create a tailored solution that meets your unique business needs.

Reach out today to start transforming your knowledge base into a powerful, customized tool.