The Potential of RAG in DX for the Image Sensor Manufacturing Industry and the Challenge of Unstructured Data
Table of Contents
1. Introduction
1.1 Background: Manufacturing DX and the Use of In-House Technical Knowledge
In manufacturing—including the image sensor manufacturing industry—AI-driven digital transformation (DX) is becoming the core of competitiveness. Underlying this is the sheer scale of the economic impact that generative AI brings. According to an industry report published by McKinsey & Company in 2023, across the 63 use cases analyzed, generative AI could add the equivalent of $2.6–4.4 trillion in value annually—a scale comparable to the United Kingdom’s 2021 GDP. The same report also argues that generative AI has the theoretical potential to automate 60–70% of the working hours that employees currently spend [1].
This potential of generative AI is being reflected directly in manufacturers’ management decisions. In an industry survey on smart manufacturing published by Deloitte in 2025, 92% of manufacturing executives answered that smart manufacturing would be the leading factor determining competitiveness over the next three years—a six-point increase from 2019. The same survey further reports that, as a net effect following the adoption of smart manufacturing, production volume rose by 10–20% and employee productivity by 7–20%, and that 78% of responding companies allocate more than 20% of their improvement budget to smart manufacturing [2]. These figures are based on an industry survey and include self-reported responses, a point that warrants caution, but they nonetheless indicate a trend of manufacturers expanding their investment.
This current is also surfacing as concrete corporate strategy. In March 2026, Samsung Electronics announced a strategy to transition all of its manufacturing sites into “AI-Driven Factories” by 2030. The company states that it will integrate AI across the entire manufacturing value chain—from procurement logistics through production, quality inspection, and final shipping—and progressively introduce digital-twin-based simulation, AI agents specialized for quality control, production, and logistics, and Agentic AI that autonomously plans, executes, and optimizes [3]. This is an official corporate announcement and may include forward-looking visions and information favorable to the company, a point to keep in mind; nonetheless, it is a concrete example of a major manufacturer accelerating AI-driven DX as a core management strategy. In this context, there is growing demand to make the vast technical knowledge accumulated in-house referenceable by generative AI and to put it to use in operations.
1.2 The Challenge: The Unstructured Nature of In-House Technical Documents
When attempting to leverage in-house technical knowledge with generative AI, the first thing one encounters is the fact that much of an organization’s technical documentation is unstructured data. In general, the technical materials generated and accumulated daily on the manufacturing floor exist in a diverse range of formats: report slides in PowerPoint, measurement data and management tables in Excel, images such as JPEGs, and even binary data such as the RAW images peculiar to the image sensor manufacturing industry. In these materials, information is often embedded within visual structures such as figures, graphs, tables, and layout, and simply extracting them as plain text loses the greater part of their meaning. For example, a line graph showing how some characteristic changes over time, or a table comparing multiple conditions, itself carries a logical assertion, but that assertion cannot be reconstructed by extracting the textual information alone.
In the image sensor manufacturing industry, binary data such as RAW images can also be an important knowledge source. A RAW image is the unprocessed raw data captured by the sensor; unlike an image file that humans view directly, it is not in a form a language model can interpret as-is. Thus, to use in-house technical documents as a knowledge source for generative AI, one runs into the preprocessing wall of “structuring”—converting them into a machine-readable structure.
The problem statement that this paper takes as its starting point is based on the author’s own experience confronting it on the floor of the image sensor manufacturing industry, and is not necessarily presented as a general claim backed by statistics. However, as discussed below, the difficulty of document question answering—which cannot be solved without interpreting figures, tables, and layout—and the context-length wall of retrieval-augmented generation handling multimodal documents are challenges that have been repeatedly reported in academic research as well. In Chapter 4, this paper addresses this structuring problem head-on from a technical standpoint.
1.3 Objectives and Structure of This Paper
In light of the above, this paper aims to discuss “to what extent Retrieval-Augmented Generation (hereafter RAG) can respond to the manufacturing floor’s demand to leverage in-house unstructured technical data.”
The structure of this paper is as follows. Chapter 2 reviews the definition and basic architecture of RAG, as well as its advantages over a standalone LLM, confirming the fundamentals of RAG. Chapter 3 surveys the history of RAG’s evolution—from Naive RAG to Advanced RAG, Modular RAG, GraphRAG, and Agentic RAG—and the progress of the elemental technologies that support it. Chapter 4, the core of this paper, delves into the family of technologies that convert unstructured data into a machine-readable structure, covering layout analysis, table extraction, chart/figure understanding, OCR, and document RAG, as well as the handling of RAW images peculiar to the image sensor manufacturing industry. Chapter 5 describes the technical foundations for incorporating RAG as a tool for AI agents, and Chapter 6 examines the feasibility, limitations, and remaining challenges of industrial application together with case studies from other companies. Chapter 7 presents the conclusion and an outlook.
2. Overview of RAG
2.1 Definition and Basic Architecture of RAG
RAG is a framework that responds to knowledge-intensive tasks by combining a retriever, which searches an external knowledge source, with a generator, which produces an answer based on the retrieval results, and by coordinating the two. The first to clearly formulate this framework was the RAG proposed by Lewis et al. in 2020 [4]. Their concern was that, because a pretrained language model implicitly holds world knowledge within its parameters, it is difficult to manipulate and update that knowledge precisely, difficult to present the basis for a judgment, and the model also suffers from the problem of hallucination (the generation of plausible-sounding misinformation).
To address this challenge, Lewis et al. proposed a configuration in which BART-large (400 million parameters), a pretrained seq2seq model, serves as the generator—that is, the parametric memory—while a dense vector index of Wikipedia serves as a non-parametric memory referenced by the retriever. A distinctive feature is that the retriever and generator are fine-tuned end-to-end. Specifically, they presented two schemes: RAG-Sequence, which conditions on the same retrieved document throughout the entire answer, and RAG-Token, which can select a different document for each generated token. With this configuration, they achieved state-of-the-art accuracy at the time on open-domain question answering (Natural Questions, WebQuestions, CuratedTrec), and on fact verification with FEVER came within 4.3% of state-of-the-art pipeline methods [4].
An important technical backdrop for the retriever is Dense Passage Retrieval (DPR), proposed by Karpukhin et al. [5]. In conventional open-domain question answering, sparse-vector lexical-matching retrieval such as TF-IDF or BM25 was the mainstream, but it had the limitation of being unable to capture synonyms and paraphrases well. DPR is a bi-encoder retriever that embeds questions and passages into dense vectors using separate BERT encoders and learns the similarity of their inner products. It trains efficiently using a relatively small number of question–passage pairs together with in-batch negatives, and at inference time performs nearest-neighbor search with FAISS. As a result, it surpassed BM25 by 9–19% in absolute terms on Top-20 retrieval accuracy, and on Top-5 accuracy posted a large margin of 65.2% versus 42.9% [5]. The retriever in RAG is built upon this development of dense retrieval.
2.2 Advantages over a Standalone LLM
RAG has three principal advantages over a standalone large language model (LLM) that does not reference external knowledge: suppression of hallucination, knowledge updatability, and transparency through source attribution.
To consider the first advantage, suppression of hallucination, one must first organize what hallucination itself is. The survey on hallucination in natural language generation by Ji et al. classifies hallucination into intrinsic hallucination, in which the output contradicts the input, and extrinsic hallucination, which cannot be verified from the input, and systematizes its causes, evaluation metrics, and mitigation methods [6]. Huang et al., meanwhile, propose a taxonomy suited to the LLM era, broadly dividing hallucination into factuality hallucination, which concerns discrepancies between the generated content and real-world facts, and faithfulness hallucination, which concerns deviation from user input, context, or self-consistency, and discuss RAG as one mitigation strategy along with its limits [7]. While neither of these surveys is primarily aimed at quantitative metrics, they show that hallucination is deeply related to a lack of grounding in external knowledge, providing the theoretical basis for the claim that RAG—which grounds answers in retrieved documents—can contribute to its suppression.
The second advantage, knowledge updatability, is a benefit derived directly from RAG’s architecture. In a standalone LLM, knowledge is fixed within the parameters, so reflecting new knowledge requires retraining. With RAG, by contrast, knowledge can be updated simply by swapping out the non-parametric memory—that is, the external retrieval index—without retraining the generator. Lewis et al. demonstrated that such knowledge updating by swapping is in fact possible, and also showed that RAG generates more factual and diverse output than the BART baseline [4]. As for the third advantage, source attribution, RAG can present retrieved documents as the basis for its answer, increasing the transparency of its judgments. This is a particularly important property in settings such as in-house technical decision-making, where an explicit basis is required.
2.3 The Lineage of RAG: The Development of Retrieval- and Memory-Referencing Language Models
The RAG of Lewis et al. did not appear out of nowhere; it is positioned atop a lineage of prior work seeking to incorporate retrieval or external memory into language models. Each of these studies opened up the design axis of “where and how to incorporate retrieval.”
First, the starting point for the idea of making retrieval itself an object of learning was ORQA (Open-Retrieval Question Answering) by Lee et al. Whereas conventional open-domain question answering relied on candidate evidence returned by a black-box information retrieval (IR) system, ORQA was the first framework to jointly learn the retriever and the reader from only question–answer string pairs, without a separate IR system, treating evidence retrieval from all of Wikipedia as a latent variable. In settings where the user genuinely seeks an unknown answer, it showed that such learned retrieval surpasses conventional BM25 by up to 19 points in exact match, opening the path toward “learning retrieval itself” that leads on to DPR and the RAG of Lewis et al. [8].
A representative of the line that integrates retrieval from the pretraining stage is REALM by Guu et al. REALM incorporates a latent knowledge retriever into language-model pretraining, retrieving and referencing documents from a large corpus such as Wikipedia, and was the first to show a method for training the retriever via retrieval over millions of documents using the unsupervised signal of masked language modeling. As a result, it surpassed prior methods by 4–16% in absolute accuracy on multiple open-domain question answering benchmarks, while also demonstrating advantages in interpretability and modularity [9]. A representative of the line that references external memory at inference time, on the other hand, is kNN-LM by Khandelwal et al. kNN-LM linearly interpolates a pretrained language model’s predictions with the results of a k-nearest-neighbor search keyed on context embeddings, showing that rare factual knowledge can be explicitly memorized and referenced without additional training. On WIKITEXT-103 it improved perplexity (a metric expressing how hard it is for a language model to predict the next word; lower is better) by 2.86 points without additional training, and also showed that a small model referencing a large datastore can surpass a larger model [10]. This property of “updating knowledge without additional training” is a precursor to the knowledge updatability of RAG described in the previous section.
As a method for integrating multiple retrieved documents into the generator, the Fusion-in-Decoder (FiD) of Izacard and Grave is important. FiD processes each retrieved document independently in the encoder together with the question, and the decoder attends over the concatenation of all those representations, thereby integrating the evidence from multiple documents to generate an answer. Because the documents are processed independently on the encoder side, the computational cost is linear in the number of documents, and performance improves as the number of retrieved documents increases (up to 100 documents) [11]. Furthermore, Atlas is a retrieval-augmented language model that combines a dense retriever with an FiD-style generator and uses retrieval in both pretraining and fine-tuning, demonstrating that joint pretraining of each component is decisive for few-shot learning performance. The result that an 11-billion-parameter Atlas achieved 42.4% accuracy on Natural Questions with only 64 examples—surpassing the 50-times-larger, 540-billion-parameter PaLM—demonstrates that retrieval augmentation is effective even in the few-shot setting [12]. In terms of corpus scale, RETRO by Borgeaud et al. retrieves documents similar to preceding chunks from a database on the order of 2 trillion tokens and incorporates them via chunked cross-attention, achieving performance comparable to GPT-3 and Jurassic-1 with 25 times fewer parameters [13]. Each of these prior works is an exploration of how to combine parametric and non-parametric memory, and can be understood as a current converging toward the RAG formulated by Lewis et al.
3. The History of RAG’s Evolution: Key Technologies and Trends
3.1 Three Paradigms: Naive / Advanced / Modular RAG
The RAG of Lewis et al. described in Chapter 2 jointly trained the retriever and generator end-to-end. However, RAG’s subsequent evolution did not necessarily follow this joint-training route. Rather, much of the RAG that is now widely implemented does not retrain a massive language model such as GPT-4 or Claude, but treats it as a frozen “black box,” merely prepending retrieved documents to the input prompt. The work demonstrating the effectiveness of this approach was REPLUG [14]. REPLUG showed that, keeping the language model frozen as a black box and merely prepending retrieved documents to the input, one can improve the language modeling of GPT-3 (175 billion parameters) by 6.3% and Codex’s 5-shot MMLU by 5.1%. It further showed that additional improvements can be obtained by fine-tuning only the retriever, using the language model’s predictions as a supervisory signal. Similarly, In-Context RALM demonstrated that, without modifying the language model’s architecture at all and merely concatenating retrieved documents before the input prefix, a language-modeling improvement equivalent to increasing the model’s parameter count by 2–3 times can be obtained from a combination of an off-the-shelf general-purpose language model and a general-purpose retriever [15]. These works underpin the implementation philosophy of modern RAG—augmenting a massive foundation model with retrieval without retraining it. In the following, we examine how such retrieval-and-generation pipelines premised on a frozen model have been systematically advanced.
The central survey that gives a bird’s-eye view of this evolution is the one by Gao et al. [16]. They surveyed over 100 RAG studies and organized RAG’s development into three paradigms: Naive RAG, Advanced RAG, and Modular RAG.
Naive RAG is a simple framework with the linear “retrieve-then-generate” structure of retrieving documents and generating with them as context. However, this structure is vulnerable to variability in retrieval quality and the inclusion of irrelevant documents, and the coordination between retrieval and generation is not optimized. Advanced RAG addresses these limitations by optimizing the pre-processing and post-processing of retrieval. Specifically, it introduces improvements to chunking and indexing, such as sliding windows, fine-grained segmentation, and the addition of metadata. Modular RAG further refers to the direction of decomposing the RAG system into independent modules and operators and treating it as a reconfigurable framework. The survey by Gao et al. cites studies comparing RAG and unsupervised fine-tuning, organizing them to show that RAG consistently surpasses fine-tuning for both existing and new knowledge [16].
The one that further elaborated the concept of Modular RAG is the Modular RAG framework, also by Gao et al. [17]. This realizes a complex RAG system as a highly flexible framework that decomposes it into independent modules and dedicated operators and that can be reconfigured like LEGO blocks. Going beyond the conventional linear architecture, it integrates mechanisms such as routing, scheduling, and fusion, and identifies common RAG patterns such as linear, conditional, branching, and loop.
3.2 Concrete Examples of Advancement: Self-Reflection and Correction
As concrete manifestations of Advanced / Modular RAG, methods emerged in which the system itself controls whether retrieval is needed and the quality of the retrieval results. Naive RAG, regardless of whether retrieval is necessary or whether the retrieved passages are truly relevant, uniformly incorporates a fixed number of documents, raising the problem that irrelevant documents can actually degrade generation quality.
Self-RAG (Self-Reflective RAG) addresses this problem by training a single language model end-to-end to generate special “reflection tokens” [18]. That is, it generates retrieval tokens that judge whether retrieval is needed, and critique tokens that evaluate the relevance, support, and output quality of the retrieved passages, and at inference time it can control its behavior according to task requirements via decoding that scores with a weighted linear sum of these token probabilities. It has been reported that a relatively small Self-RAG of 7B or 13B significantly surpasses ChatGPT and a retrieval-augmented version of Llama2-chat on open-domain question answering, reasoning, and fact verification tasks, greatly improving factuality and citation accuracy in long-form generation.
A method that evaluates and corrects the quality of the retrieval results themselves is Corrective RAG (CRAG) [19]. In CRAG, a lightweight retrieval evaluator assesses the quality of retrieved documents and computes a confidence score, based on which it triggers different knowledge-acquisition actions: {Correct, Incorrect, Ambiguous}. To compensate for the limits of a static corpus, it integrates large-scale web search as an extension, and further uses a decompose-then-recompose algorithm to selectively extract key information from documents and remove irrelevant parts. It is reported that this further improves the performance of standard RAG and the aforementioned Self-RAG.
3.3 GraphRAG
Naive RAG, which relies solely on semantic similarity, also has fundamental limitations: an inability to capture the structural and relational knowledge between entities (neglect of relationships); the phenomenon whereby concatenating many documents into a prompt makes information placed in the middle easy to overlook (the so-called lost-in-the-middle problem); and difficulty in handling holistic questions that require an overview of the entire corpus (Query-Focused Summarization, QFS). To compensate for these limitations, GraphRAG, which handles knowledge in a graph structure, emerged.
Peng et al. authored the first systematic survey on GraphRAG, formalizing its workflow into three stages: graph-based indexing (G-Indexing), graph-guided retrieval (G-Retrieval), and graph-enhanced generation (G-Generation) [20]. This is a survey aimed primarily at classifying and organizing methods rather than at quantitative metrics, but it provides an overall picture of the GraphRAG field. As a concrete representative method, there is the GraphRAG by researchers at Microsoft [21]. To address the problem that ordinary vector-retrieval RAG cannot handle global, sensemaking questions such as “what are the main themes in this dataset,” it uses an LLM to build a two-stage graph index. In the first stage, it derives an entity knowledge graph from the source documents and pre-generates a summary for each group of closely related entities (a community). For a question, it generates partial responses from each community summary (the map step) and integrates them to obtain the final global answer (the reduce step)—a map-reduce-style answer generation. On global questions over a dataset on the order of one million tokens, it was shown, by a method using an LLM as evaluator, that with GPT-4 it greatly surpasses conventional vector-retrieval RAG in both the comprehensiveness and diversity of answers [21].
Handling holistic, bird’s-eye questions can be realized even without converting knowledge into a graph structure. RAPTOR, proposed by Sarthi et al., is a method that recursively repeats the operations of embedding, clustering, and summarizing chunks, building bottom-up a tree of summaries at different levels of abstraction as the index [22]. Whereas Naive RAG retrieves only short, contiguous chunks and thus cannot survey the context of the entire document, RAPTOR retrieves from multiple levels of this tree at inference time, allowing it to answer questions that require both detail and the overall picture. On QuALITY, a reading-comprehension benchmark requiring complex multi-step reasoning, combining RAPTOR’s retrieval with GPT-4 is reported to have improved the state of the art by 20% in absolute accuracy [22]. Whereas GraphRAG gains its bird’s-eye view by explicitly representing the relationships between entities as a graph, RAPTOR gains it through a hierarchy of summaries—a contrast that presents two different solutions to questions concerning an entire corpus.
3.4 The Development into Agentic RAG
RAG is evolving from a static, linear pipeline into Agentic RAG, which autonomously iterates retrieval and reasoning. Conventional RAG pipelines lacked adaptability to complex tasks requiring multi-step reasoning or iterative refinement of the response.
Positioned as a precursor to this evolution is the family of active, multi-step retrieval methods that do not end retrieval after a single attempt but actively trigger retrieval during the generation process. Many retrieval-augmented models retrieve only once at the start, based on the input, but this is insufficient for long-form generation and multi-step question answering. FLARE (Forward-Looking Active REtrieval), proposed by Jiang et al., is a method that provisionally generates the next sentence to look ahead at future content and, if that sentence contains low-confidence tokens, uses the provisional sentence as a query to retrieve relevant documents and regenerate the sentence, repeating this process [23]. They report that on four long-form, knowledge-intensive generation tasks, FLARE performs as well as or better than baselines. Similarly, IRCoT by Trivedi et al. interweaves each step of Chain-of-Thought with retrieval, guiding retrieval with reasoning and improving reasoning with retrieval results in an iterative manner [24]. On multi-step question answering, where “what to retrieve” depends on “what has been derived so far,” IRCoT is reported to improve retrieval by up to 21 points and downstream question answering by up to 15 points with GPT-3, while reducing hallucination. These methods, which dynamically judge “when and what to retrieve,” form the bridge to Agentic RAG, in which the model itself controls whether and what to retrieve.
Singh et al. authored a survey that analytically reviews Agentic RAG, which incorporates autonomous AI agents into the RAG pipeline [25]. There, agent design patterns—reflection, planning, tool use, and multi-agent collaboration—and workflow patterns—prompt chaining, routing, parallelization, orchestrator-worker, and evaluator-optimizer—are organized, and a taxonomy based on the number of agents, control structure, autonomy, and knowledge representation is presented. As a study organizing frameworks that mutually drive retrieval and reasoning, there is the RAG-Reasoning survey [26]. It classifies methods into three types—the direction in which reasoning enhances RAG (Reasoning-Enhanced RAG), the direction in which RAG enhances reasoning (RAG-Enhanced Reasoning), and the direction in which the two iteratively cooperate (Synergized RAG-Reasoning)—depicting a structure of cooperation in which retrieval supports reasoning and reasoning guides the next retrieval. Since neither of these surveys is primarily aimed at quantitative metrics, this paper references them as sources for definitions and classification. Agentic RAG offers the perspective of treating RAG not as a mere retrieval mechanism but as part of an agent, bridging to the technical foundations of agent integration described in Chapter 5.
3.5 The Evolution of Elemental Technologies
The advancement of RAG described thus far is supported by the evolution of the elemental technologies that underpin retrieval—namely embeddings, approximate nearest neighbor search, reranking, hybrid search, query transformation—and document segmentation (chunking).
First, regarding embeddings, the one that established the foundation for representing a sentence as a fixed-length vector was Sentence-BERT. Because conventional BERT has a cross-encoder structure that takes a sentence pair as joint input every time, finding the most similar pair among 10,000 sentences required about 65 hours, making it inapplicable to large-scale retrieval. Sentence-BERT fine-tunes BERT in a siamese structure to produce fixed-length sentence embeddings comparable by cosine similarity, shortening this search to about 5 seconds [27]. Subsequently, general-purpose, high-performance embedding models emerged. E5 was the first model to surpass BM25 in a zero-shot setting without fine-tuning on BEIR, through weakly-supervised contrastive pretraining using large-scale text pairs curated from semi-structured data, and was evaluated on 56 datasets including BEIR and MTEB [28]. BGE (C-Pack), which integrates evaluation benchmark, training data, model family, and training recipe as a single package, surpassed the existing models at the time of publication by more than 10% on the Chinese benchmark C-MTEB and also showed top-tier performance in English [29].
Even once embeddings convert documents into vectors, it is impractical to exhaustively and exactly search for similar vectors within a document set on the order of millions to hundreds of millions. What makes this feasible at production scale is Approximate Nearest Neighbor (ANN) search, whose representative algorithm is HNSW (Hierarchical Navigable Small World) [30]. HNSW places each element into a multi-layer proximity graph with exponentially decaying probability and begins the search from the upper sparse graph, descending into the lower dense graph, thereby keeping the search complexity logarithmic. This makes it possible to search a large vector set at low latency while maintaining recall close to that of exact search. HNSW is widely adopted in major vector indexes and vector databases, including FAISS, and constitutes the infrastructure layer that physically supports RAG’s retrieval. The cost and scalability of industrial application discussed in Chapter 6 depend heavily on the efficiency of this retrieval infrastructure.
Next is reranking. ColBERT encodes the query and document independently into sets of contextualized embeddings and estimates relevance via the inexpensive late interaction of MaxSim (the sum, over each query term, of its maximum cosine similarity). Because document embeddings can be precomputed offline, it maintains effectiveness competitive with BERT-based methods while accelerating overall retrieval by two orders of magnitude, achieving more than a 170-fold speedup over existing BERT-based methods particularly for reranking [31]. Also, monoT5, which adapts the sequence-transformation model T5 as a reranker by having it generate “true/false” from the query and document and reordering by that probability, is a representative example of generative reranking; on the MS MARCO dev set it reached an MRR@10 of 0.383, surpassing the combination of BM25 and BERT-large [32].
The fusion of retrieval methods—that is, hybrid search—is also an important element. On the sparse-retrieval side, BM25, based on the probabilistic relevance framework, is positioned as the classical canon [33]. This is a theoretical framework for probabilistically estimating relevance; it is not an experimental paper aimed at quantitative metrics, but it provides the theoretical foundation for lexical-matching retrieval. On the dense-retrieval side there is the aforementioned DPR [5], and the method that integrates the two is rank fusion. Reciprocal Rank Fusion (RRF) is a simple method that, for each document, sums the reciprocal 1/(k+r) of its rank r in each ranking across all systems to form a score; by fixing the constant k at 60, it integrates multiple retrieval results without training examples and was shown to surpass the best individual system by 4–5% on average [34]. Furthermore, as a query-transformation method, HyDE has an instruction-following LLM generate a hypothetical document that answers the query, embeds it with an unsupervised encoder, and retrieves the neighbors of real documents. With this, in a zero-shot setting using no relevance labels at all, it achieved an nDCG@10 of 61.3 on TREC DL19, greatly surpassing BM25 (50.6) and Contriever (44.5) [35].
Finally, we take up chunking, the preprocessing that precedes retrieval—the step of dividing a document into small units called chunks. How the granularity and boundaries of chunks are set ripples across all stages of embedding, indexing, and retrieval, greatly affecting retrieval accuracy and answer quality. The most basic are fixed-length splitting, which mechanically divides by token or character count with overlap between adjacent chunks, and recursive splitting to a target size following the hierarchy of delimiters such as paragraphs, sentences, and words; neither has a single canonical paper, and both are widely used as implementation-level baselines. The aforementioned survey by Gao et al. also lists chunking improvements such as sliding windows and fine-grained segmentation as one of the core advancements of Advanced RAG [16].
In recent years, peer-reviewed studies targeting the splitting method itself have appeared. The one that reconsidered the granularity of the retrieval unit is Dense X Retrieval by Chen et al. [36]. In addition to document, passage, and sentence, they proposed the proposition—text decomposed into the smallest self-contained factual units—as a new retrieval unit, and constructed FactoidWiki by splitting Wikipedia at three granularities. On five open-domain question answering benchmarks, proposition-level retrieval improved the Recall@5 of unsupervised retrievers by 9–12 points on average, and improved downstream question answering accuracy (EM@500) by 2.7–4.1 points [36]. LumberChunker by Duarte et al., on the other hand, proposed dynamic chunking that, rather than applying fixed rules, has an LLM judge the points where content shifts and divides at variable length [37]. By giving the LLM consecutive groups of passages and having it iteratively identify the semantic breakpoints, it divides a long document into semantically coherent units. On the evaluation benchmark GutenQA, it surpassed recursive splitting—the most competitive baseline—by 7.37% in retrieval performance (DCG@20), and when integrated into a RAG pipeline it surpassed competitors such as Gemini 1.5 Pro.
There are also ingenuities that reconsider the order of splitting and embedding. The late chunking of Günther et al. is a method that, rather than splitting into chunks and then embedding each individually, first embeds the entire token sequence of the document with a long-context embedding model and then pools at the chunk level [38]. This allows each chunk embedding to retain information about its surrounding context, and is reported to improve nDCG@10 by a relative 2.7–3.6% over naive splitting on multiple retrieval benchmarks. However, it should be noted that this work is a preprint from a company (Jina AI) and has not been peer-reviewed.
That said, more sophisticated chunking is not always worth the cost. Qu et al. of Vectara, a commercial RAG vendor, systematically examined—across three tasks of document retrieval, evidence retrieval, and answer generation—whether semantic chunking, which finds semantic breakpoints from the embedding similarity of each sentence, brings a consistent performance improvement worth the additional computational cost compared with simple fixed-length splitting [39]. The result was clearly situation-dependent. That is, while semantic chunking was superior on topically diverse datasets that concatenate multiple documents, on evidence retrieval fixed-length splitting was best on three of five datasets; the superiority of semantic chunking was not consistent. They concluded that the computational cost of semantic chunking is not justified by a consistent performance improvement. The choice of chunking is a design decision whose cost-effectiveness must be assessed according to the use case; in manufacturing practice as well, rather than simply adopting a sophisticated method, one must select the approach in light of the nature of the target documents and the balance of required accuracy and cost.
3.6 Trend: The Relationship with Long-Context LLMs
In recounting the history of RAG’s evolution, one cannot, in recent years, avoid the relationship with the Long-Context LLM (hereafter LC). With the emergence of LLMs whose context length reaches the scale of hundreds of thousands to one million tokens, the question arises whether—rather than bothering to retrieve documents and feed them into the prompt with RAG—one might simply pour the entire group of relevant documents directly into the context.
The study that systematically examined this question is the work by Li et al. [40]. Using the latest LLMs, they compared RAG and the LC approach across multiple datasets and showed that, given sufficient computational resources, the LC approach consistently surpasses RAG in nearly all settings. However, the predictions of the two agree for many queries, and so they proposed a hybrid approach (Self-Route) that first processes cheaply with RAG and routes only the queries on which RAG is not confident to the LC approach. With this, they claim that quality equivalent to the LC approach can be achieved at substantially lower cost. Yu et al., by contrast, take the position of defending the effectiveness of RAG even in the long-context era [41]. They proposed order-preserving RAG (OP-RAG), which arranges retrieved chunks not in order of relevance but in their order of appearance in the original document, and showed that this achieves higher answer quality with far fewer tokens than the LC approach of feeding the entire context as-is.
What these studies show is the picture that the rise of long-context LLMs does not render RAG unnecessary; rather, the two should be used selectively within the trade-off between cost and quality. In particular, in the context of manufacturing, which holds vast amounts of in-house technical documentation, pouring all documents into the context every time is unrealistic in terms of both cost and confidentiality management, and the significance of RAG—retrieving and supplying only the necessary documents—remains substantial.
4. Technologies for Structuring Unstructured Data
4.1 A Technical Redefinition of the Challenge
Let us redefine, from a technical standpoint, the challenge of “the unstructured nature of in-house technical documents” raised in Chapter 1. It reduces to the problem of converting information embedded within visual structures—figures, tables, layout—into a machine-readable structure.
The difficulty of this problem becomes vivid through the concrete task of document question answering. DocVQA, constructed by Mathew et al., is a question answering dataset created from real industrial documents, and answering its questions requires not only reading text but also interpreting document structures such as forms, tables, and figures. On this dataset of 50,000 questions over 12,767 document images, it was shown that a large gap still remains between existing models and human performance (94.36% ANLS) [42]. This plainly shows that merely extracting text cannot fully capture the meaning of a document.
Then, would it suffice to hand the whole document as an image to a general-purpose multimodal LLM? Here a context-length wall stands in the way. There is a trade-off: methods combining OCR and an LLM tend to lose structural information, while natively multimodal LLMs are poor at handling long contexts. According to a survey on multimodal document RAG, representative document-RAG benchmarks require 20–200M visual tokens, which greatly exceeds the typical context length (128K–1M) of existing multimodal LLMs [43]. This survey is not primarily aimed at quantitative metrics, but it corroborates the essential difficulty of handling documents multimodally. In the following sections, we examine the family of technologies that tackle this structuring problem from diverse angles.
4.2 Layout Analysis and Document Understanding
The starting point for structuring unstructured documents is document AI, which jointly models text together with layout and visual information. Conventional NLP pretraining focused only on text and ignored layout and style information essential to document understanding. Against this, LayoutLM was the first model to jointly pretrain text and 2D layout (position embeddings) in a single framework, and was pretrained on over 11 million scanned document images. As a result, it achieved new state-of-the-art accuracy on multiple downstream tasks, including improving FUNSD form understanding from 70.72 to 79.27 [44]. The subsequent LayoutLMv2 integrated text, vision, and layout in a multimodal Transformer from the pretraining stage, surpassing the first version on six tasks, including improving document image question answering on DocVQA from 0.7295 to 0.8672 [45]. LayoutLMv3 further, in addition to unified masking of text and image, learned cross-modal alignment via Word-Patch Alignment, which predicts whether the corresponding image patch is masked, becoming the first document AI model that does not depend on a CNN-based backbone. It achieved a state-of-the-art F1 of 92.08 on FUNSD with its LARGE model, while its BASE model remains relatively small at 133M parameters [46].
Whereas these LayoutLM-family models use text and layout obtained from OCR together, there is also a line that seeks to understand documents without going through OCR itself. Donut proposed an end-to-end Transformer that generates structured output directly from a raw image, addressing the problems of relying on OCR as external preprocessing (computational cost, low language flexibility, and propagation of OCR errors to later stages). Consisting of a Swin Transformer encoder and a decoder, it handles multiple languages and domains via a synthetic data generator. This is a promising approach to handling unstructured data—such as PowerPoint or image-based data—for which OCR tends to be unstable [47]. Similarly, DocFormer introduced a multimodal self-attention layer that shares spatial embeddings across modalities, realizing an end-to-end document understanding architecture that does not depend on an object detection network or custom OCR [48]. Also, from the standpoint of converting PDFs into structured text, Nougat is important. Nougat is a Visual Transformer that converts an image of a document page into lightweight markup (Markdown), able to convert scientific documents containing mathematical expressions and superscripts/subscripts—which existing OCR could not handle—into machine-readable text [49]. These technologies provide a realistic means of distilling in-house documents such as PowerPoint and PDFs into structured text.
4.3 Table Extraction
Management tables in Excel and tables embedded in PDFs and images appear frequently in the technical documentation of the manufacturing floor. Such tabular data can be made machine-readable through table structure recognition technology.
What underpinned the foundation of table extraction research was the development of large-scale benchmarks. TableBank constructed a table dataset on the scale of 417,234 images by automatically assigning labels as weak supervision from the source code of Word and LaTeX documents [50]. PubTabNet released a large-scale table recognition dataset derived from PubMed Central, proposed the Encoder-Dual-Decoder (EDD) in which a structure decoder that recovers table structure assists the cell decoder, and further introduced TEDS (Tree-Edit-Distance-based Similarity), an evaluation metric based on tree edit distance [51]. Because this TEDS captures cell shifts and content errors more appropriately than existing adjacency-relation metrics, it has become a standard metric for measuring the accuracy of table structure recognition. As a method that predicts table structure and cell boundaries simultaneously, there is TableFormer. This is a Transformer-based approach that predicts table structure and cell bounding boxes end-to-end, greatly improving simple tables from 91% to 98.5% and complex tables from 88.7% to 95% in TEDS [52]. Furthermore, PubTables-1M constructed a large-scale dataset of about one million tables and was the first to apply DETR (Detection Transformer) to the three tasks of table detection, table structure recognition, and functional analysis. This data corrects, via a canonicalization procedure, the annotation inconsistency of “oversegmentation,” in which multiple valid interpretations arise for the same table, and showed that improving data quality alone significantly improves table structure recognition performance [53].
4.4 Chart, Figure, and Graph Understanding
Figures and graphs, which are particularly abundant in technical documentation, are among the most difficult targets to structure, but they are becoming interpretable through technology that converts graphs into tables and through multimodal pretraining.
The one that provided the evaluation foundation for this field is ChartQA. This is a question answering benchmark over 20,882 charts collected from four real-world sources, characterized by its inclusion of complex visual and logical reasoning such as aggregation, comparison, and maximum-value computation. 43.0% of the questions require complex reasoning, reflecting realistic difficulty that conventional template-generated datasets could not handle [54]. On the method side, a two-stage approach that converts a graph into a structured table and then reasons with an LLM is promising. DePlot is a module that converts a plot into a linearized table; by prompting its output with Chain-of-Thought and connecting it to an LLM, it achieved a 29.4% one-shot improvement on human-authored questions over the then-state-of-the-art model fine-tuned on thousands of examples [55]. Also, the one that strengthened graph understanding itself through pretraining is MatCha. This introduced two lines of pretraining tasks—derendering, which generates the original data table or rendering code from a graph image, and mathematical reasoning—surpassing the previous state of the art that does not assume access to a data table by up to about 20% [56]. These are approaches—directly tied to this paper’s concern—that distill unstructured charts and figures once into a structured table and then handle them with an LLM.
4.5 OCR and Multimodal LLMs
OCR (optical character recognition), the foundational technology for reading text from documents, is also indispensable to the discussion of structuring. Its representative engine is Tesseract. Adopting a staged pipeline starting from connected-component analysis and combining a classifier that adapts to the fonts within a document, it is known as a foundational OCR engine that operates even under conditions that commercial engines of the time found difficult [57].
In addition to OCR, general-purpose multimodal LLMs have in recent years become a promising means of document, chart, and figure understanding. GPT-4V demonstrated the ability to read specialized figures in scientific papers and figures containing text, while limitations have also been reported, such as erroneously merging adjacent but distinct text within an image [58]. LLaVA, a representative example of an open multimodal LLM, is a model trained by using a language-only GPT-4 to generate multimodal instruction-following data consisting of conversation, detailed description, and complex reasoning, and connecting a vision encoder with a language decoder; it achieved a relative score of 85.1% versus GPT-4 on synthetic data [59]. Also, as a vision-language model strong at reading text, there is Qwen-VL. It is a model with a total of 9.6B parameters that introduces a position-aware cross-attention adapter and possesses fine-grained visual understanding such as grounding and text reading [60]. These general-purpose multimodal LLMs are powerful, but as the example of GPT-4V shows, they retain reliability challenges of hallucination and misreading, and caution is required in applying them to specialized documents.
4.6 An Approach That Handles Data Without Structuring: The Document RAG Pipeline
Thus far we have examined technologies that convert unstructured documents into a machine-readable structure, but in contrast, an approach that handles unstructured documents without explicit structuring is also promising—namely, the method of making the page image itself the target of retrieval.
What supports this method is cross-modal embedding, which maps images and text into a common vector space. The one that laid its foundation is CLIP, which uses 400 million image–text pairs collected from the internet and, through contrastive learning of guessing “which caption corresponds to which image,” embeds images and text into the same space [61]. With this, after pretraining one can reference visual concepts in natural language, and zero-shot transfer is possible on more than 30 datasets without task-specific training. For example, on ImageNet it matched the accuracy of the original ResNet-50 zero-shot without using any of the 1.28 million training examples [61]. This framework, which can measure similarity across text and images, forms the foundation of the following method that makes the page image itself the target of retrieval.
ColPali takes its stance from the concern that, although documents convey information not only through text but also through figures, layout, tables, and fonts, existing retrieval systems depend on text extracted by OCR and fragile preprocessing and fail to leverage visual cues. ColPali extends a vision-language model to generate ColBERT-style multi-vector embeddings from a page image and matches them via late interaction. On the multi-domain, multilingual retrieval benchmark ViDoRe, ColPali achieved an average nDCG@5 of 81.3, greatly surpassing a strong baseline combining OCR and text. Furthermore, its indexing at 0.39 seconds per page is faster than a PDF parser (7.22 seconds/page), which is also a practical advantage [62]. This method, in that it does not depend on the fragility of OCR or preprocessing and can handle unstructured documents without structuring them, is one promising solution to this paper’s challenge.
4.7 Turning RAW Images into RAG: A Two-Stage Pipeline of Development and Multimodal Understanding
The discussion so far has targeted general documents, but the image sensor manufacturing industry has a data format unique to it: the RAW image. A RAW image is the unprocessed raw data captured by the sensor, and in this paper we propose handling it not directly but as a two-stage pipeline: first developing it into an 8-bit RGB image, then interpreting its meaning with a multimodal model. Note that, because direct peer-reviewed literature on the handling specific to RAW images is scarce, this section is a discussion based on the author’s own experience, combining the general image processing of development with existing research on multimodal understanding.
First, as background, let us confirm the nature of RAW images. The RAW format is a format that retains the unprocessed sensor data so that white balance and tone mapping can be adjusted after shooting. This characteristic is also stipulated by ISO 12234-4:2026, which internationally standardized Adobe’s DNG format, as one of the standard file formats for RAW images (DNG) [63]. Technically, the sensor is covered by a color filter array (usually a Bayer filter), and because each pixel captures only partial color information, a RAW file stores the mosaic-pattern data before demosaicing at a high bit depth of typically 12 or 14 bits [64]. Thus, a RAW image is, as-is, not in a form a language model can interpret, and requires some conversion.
The first stage, development, is the step of obtaining an 8-bit RGB image through processing such as demosaicing and tone mapping. Conventionally this step was handled by a complex, hand-crafted pipeline individually designed by the camera’s ISP (image signal processing), but in recent years research replacing this with learned models has progressed. PyNET is a single end-to-end deep learning model that directly converts RAW Bayer data into high-quality RGB images without prior knowledge of the sensor or optics, achieving PSNR 21.19 and MS-SSIM 0.8620 and reported to surpass the target smartphone’s built-in ISP in perceptual quality evaluation [65]. Also, Learning to See in the Dark is a method that directly takes RAW images shot in extremely low light as input and replaces the entire conventional ISP pipeline with an end-to-end network—a representative example of directly processing RAW sensor data into an image [66]. These show that the step of developing RAW into 8-bit RGB can be realized as a single process by a learned model.
In the second stage, the RGB image thus obtained is interpreted for meaning by the multimodal models described in the preceding sections. That is, existing technologies—Donut [47] for OCR-free document understanding, GPT-4V [58] and LLaVA [59] for general-purpose chart and figure understanding, or ColPali [62], which makes the page image a direct target of retrieval—are applied to the developed image. Through this two-stage pipeline, a path emerges for, in principle, handling even RAW images—an extremely unstructured form of data—as a knowledge source for RAG. This is a response unique to the image sensor manufacturing industry to the challenge raised in Chapter 1.
4.8 Practical Aspects: Cloud / OSS Services
Many of the structuring technologies described above are provided as cloud services or open-source software (OSS) in a form usable in practice.
At the document ETL (extract, transform, load) layer, the OSS Unstructured ingests over 25 diverse document formats and provides splitting into structured elements and semantic chunking [67]. At the OCR, table, and form extraction layer, AWS’s Amazon Textract provides, in addition to OCR of printed and handwritten characters, the extraction of tables and forms and the extraction of specified information [68]. At the managed RAG layer, Amazon Bedrock Knowledge Bases manages everything from data ingestion through embedding, indexing, and retrieval, and supports automatic parsing by document type, Agentic Retrieval that performs multi-hop reasoning, responses with citations, and integration that allows it to be called as a tool from MCP-compatible agents [69]. On the Microsoft side, Azure AI Document Intelligence handles the extraction of layout, tables, and forms [70], and Azure AI Search provides hybrid search combining vector search and keyword search [71]. Furthermore, the integrated vectorization feature of Azure AI Search performs chunking and embedding generation at ingestion and automatic vectorization at query time in one go, eliminating the need to build a separate vectorization pipeline [72]. Google also provides Document AI, which converts unstructured data into structured data, with processors for digitization via OCR, the extraction of forms and tables, and document classification [73]. These services divide the labor of each layer—document ETL, OCR, layout/table extraction, managed RAG, and vector search—and in practice one combines them according to the use case.
5. Technologies for Incorporation into AI Agents
5.1 Academic Foundations: Reasoning, Acting, and Tool Use
The idea of integrating RAG not as a mere retrieval mechanism but as a tool used by an AI agent rests on a series of academic foundations that connect reasoning and acting.
One of its starting points is ReAct, which integrates reasoning and acting. ReAct identified the problems that reasoning (chain-of-thought) and acting in LLMs had been studied separately, and that chain-of-thought reasoning alone is not grounded in the external world and is prone to hallucinating facts. It therefore proposed a paradigm that alternately generates reasoning traces and task-specific actions, planning actions through reasoning and, through acting, interacting with external information sources such as Wikipedia to take in knowledge. With this, it suppressed hallucination on HotpotQA and Fever, and on ALFWorld and WebShop—benchmarks for interactive decision-making—surpassed imitation-learning and reinforcement-learning methods in success rate by 34% and 10% in absolute terms, respectively [74]. This structure of reasoning while interacting with external information sources can be called the prototype of the idea of incorporating RAG as an agent’s action.
As a framework in which an LLM learns to use tools autonomously, there is Toolformer. Toolformer is a method that learns, via self-supervised learning, which API to call, when, with what arguments, and how to incorporate the result, sampling candidate API calls and fine-tuning while keeping only the calls that actually reduce next-token-prediction loss. A 6.7-billion-parameter GPT-J-based model achieved zero-shot performance comparable to the much larger GPT-3—for instance, spontaneously calling a calculation tool on 97.9% of calculation tasks [75]. Furthermore, the one that systematized the elements composing such agents is the survey on LLM-based autonomous agents. This organizes the construction of an agent into a unified framework of four modules—profile, memory, planning, and action [76]—and, while not primarily aimed at quantitative metrics, provides a framework that positions RAG as being incorporated into an agent as a means of “action” or of access to external “memory.”
5.2 The Foundation of Tool Use (Function Calling)
The core of bringing the academic foundations down into implementation is function calling, by which an LLM calls external tools. This provides the practical foundation for connecting RAG to an agent.
According to Anthropic’s official documentation, tool use in Claude is a feature that allows the model to call user-defined functions or Anthropic-provided functions and to integrate with external tools and APIs [77]. The model judges whether a call is needed based on the user’s request and the tool’s description, and tools are divided into client tools executed on the application side and server tools executed on the Anthropic side (web search, code execution, web fetch, etc.). The execution flow is structured as a loop (the agent loop) in which the model returns a block requesting tool use and the application returns the result. Retrieval by RAG, too, can be connected as one tool within this framework. That is, by defining a RAG that searches in-house documents as a tool, one realizes a configuration in which the agent calls it in situations it judges necessary and generates an answer based on the retrieval results.
5.3 MCP: A Standard for Connecting Data Sources
What emerged as a mechanism to connect diverse data sources—including RAG’s knowledge sources—to agents in a standardized way is the Model Context Protocol (MCP).
MCP is an open standard published by Anthropic in November 2024, aimed at securely connecting AI assistants and data sources [78]. Conventionally, connecting an AI assistant to each data source required implementing a separate integration for each data source, and that fragmentation was an obstacle to adoption. MCP aims to replace such individual integrations with a single protocol. By providing, in addition to the protocol specification and SDK, prebuilt connectors for representative data sources such as Google Drive, Slack, GitHub, and PostgreSQL, it realizes standardized connection without per-data-source individual implementation. In connecting to an agent a RAG whose knowledge source is technical documentation scattered across an organization, such a standardized protocol greatly reduces the burden of implementation and operation.
5.4 Skills: Packaging Domain Knowledge
As a mechanism to give an agent domain-specific procedures and knowledge, there are modular extensions such as Anthropic’s Agent Skills.
Agent Skills is a feature that packages instructions, metadata, and optional scripts or templates, which Claude automatically uses in relevant situations [79]. Provided on a file-system basis, it adopts “progressive disclosure,” first reading the metadata and then progressively reading the body instructions and accompanying resources as needed, thereby referencing only the necessary knowledge while curbing the waste of context. Noteworthy is that, as prebuilt document-related Skills, ones handling PowerPoint, Excel, Word, and PDF are provided. This corresponds exactly to the formats of in-house unstructured documents that this paper has treated as a challenge, and constitutes a concrete means of giving an agent domain-specific document-processing procedures.
On the foundation of tool use, MCP, and Skills described above, Agentic RAG—which incorporates RAG as an agent’s tool (the [25] of Chapter 3, and the Agentic Retrieval of Amazon Bedrock Knowledge Bases [69])—is realized. That is, an operation in which the agent autonomously judges whether retrieval is needed, calls the in-house-document RAG as necessary, and incorporates its results into its reasoning to answer has become technically possible. The next chapter examines to what extent such technologies have been realized on the industrial floor.
6. Discussion
6.1 The Feasibility of Industrial Application
The application of RAG on the industrial floor is beginning to produce concrete results, while much of it remains at the demonstration or prototype stage. What best illustrates this reality is the interview study of industrial practitioners by Brehme et al. According to this study, which analyzed semi-structured interviews with 13 practitioners, the majority of RAG implementations are concentrated on question answering tasks, and 12 of the 13 were below technology readiness level (TRL) 7—that is, they remained at the prototype stage. Also, in terms of requirement importance, while answer quality, confidentiality, and privacy were highly rated, data preprocessing was cited as the main challenge governing quality [80].
That said, there are also cases in which RAG has shown its effectiveness in concrete manufacturing use cases. Chen et al. built an interactive knowledge management system that retrieves information from industry-domain-specific unstructured documents and responds to inquiries about technical services and internal regulations. With a configuration that retrieves chunks using both BM25 and embeddings, reranks them with a reranker, and generates, it achieved high accuracy: a mean reciprocal rank (MRR) of 88% and recall of 85% for technical services, and an MRR of 97.97% and recall of 91.62% for internal regulation documents [81]. Heredia et al., taking quality control of ceramic tile manufacturing as their subject, built an advanced RAG combining bi-encoder retrieval, cross-encoder reranking, and generation. They achieved a Jaccard similarity of 92.68% and an F1 score of 85.81% in retrieval evaluation and an average ROUGE-L of 0.61 in generation evaluation, processing 830 queries at a cost of about $1 while being more accurate than a general-purpose GPT-4 [82]. These are concrete examples showing that domain-specialized RAG can function in manufacturing practice.
Incidentally, whether such domain specialization should be realized through retrieval augmentation (RAG) or retraining (fine-tuning) is a design issue in industrial application. What this paper has assumed thus far is frozen-type RAG, which prepends retrieved documents to the prompt without retraining the generator, but the two are not necessarily mutually exclusive. RAFT (Retrieval Augmented Fine-Tuning), proposed by Zhang et al., is a method that compromises between the two: it mixes, into the group of retrieved documents, the documents that form the correct basis and irrelevant distractor documents, and fine-tunes the generator to ignore the distractors and answer with chain-of-thought while quoting the basis verbatim from the relevant documents. It is reported to consistently surpass domain-specific fine-tuning and RAG alone in domain-specialized settings such as PubMed and HotpotQA [83]. In a domain like manufacturing, which contains much specialized terminology and in-house-specific knowledge, such a combination of RAG and fine-tuning also comes into view as an option among adaptation methods.
6.2 The Limits of Accuracy and Its Evaluation
While industrial application advances, RAG’s accuracy has clear limits, and moreover the methods for measuring that accuracy are themselves still developing.
The one that quantitatively showed the limits of accuracy is the benchmark study RGB by Chen et al. This measures the four fundamental abilities required for RAG—noise robustness, negative rejection, information integration, and counterfactual robustness—and revealed the degradation of accuracy when noise is mixed into the retrieval results. Under a noise ratio of 0.8, ChatGPT’s accuracy fell sharply from 96.33% to 76.00%, and the success rate of negative rejection—rejecting when one should not answer—remained at most 45% in English [84]. This shows that RAG harbors fundamental weaknesses in the mixing of retrieval noise, the appropriate rejection when no answer exists, and the integration of information from multiple documents.
As for the method of how to measure accuracy, Ragas, proposed by Es et al., provided a framework for automatic evaluation that requires no reference answer. Ragas automatically measures, via prompts to an LLM, three axes: faithfulness, which measures whether the answer is grounded in the context; answer relevance, which measures whether it accurately answers the question; and context relevance, which measures whether the retrieved context is focused [85]. However, as the aforementioned interview study reports, evaluation on the industrial floor is still done almost entirely by hand, with only 2 of 13 cases automated [80]. The automation and standardization of evaluation is a remaining challenge in the practical application of RAG.
6.3 Remaining Challenges of Confidentiality and Cost
As long as in-house confidential data is the knowledge source, the risk of data leakage and operational cost are remaining challenges that cannot be avoided.
The one that raised a serious problem regarding confidentiality is the privacy study by Zeng et al. They proposed a composite structured prompt attack consisting of an information part and an instruction part, and demonstrated that confidential information can be extracted verbatim from RAG’s retrieval database. The attack success rate reached nearly 50%, and summarization as a mitigation reduced the risk of untargeted attacks by about 50% but had limited effect against targeted attacks [86]. In the context of manufacturing, where in-house confidential data is the knowledge source, this is a risk that cannot be overlooked. Indeed, in the aforementioned Deloitte survey as well, 55% of companies are concerned about unauthorized access and 47% about the theft of intellectual property in manufacturing OT environments, and it is reported that an average of 15.74% of the IT budget is allocated to cybersecurity [2].
On the other hand, in terms of cost, there are also cases showing that RAG can in fact be inexpensive. The aforementioned quality-control RAG of Heredia et al. processed 830 queries at a cost of about $1 [82], and other companies’ cases discussed later also report the effect of reducing labor hours. Ensuring confidentiality requires correspondingly careful operational design, but the cost itself can, depending on the use case, fall within a sufficiently realistic level.
6.4 The Gap Between the Technology in Papers and Current Services
Between the cutting-edge technologies described thus far and the services actually implemented and operated on the floor, there is a gap that cannot be ignored. Cutting-edge methods such as GraphRAG, Agentic RAG, and multimodal document understanding have achieved remarkable results in research, but their field application remains limited. The fact that the aforementioned interview study cited data preprocessing as the main challenge and reported that the majority of implementations remain at the prototype stage plainly illustrates this point [80]. In practice, more than the cutting-edge algorithms themselves, it is the steady, unglamorous processes—data preprocessing and preparation, integration with in-house authentication, and file-level access control—that determine the success or failure of a system. The other companies’ cases examined in the next section also corroborate this picture.
6.5 Implications from Other Companies’ Cases
Official cases of companies that tackled turning their own in-house data into RAG empirically corroborate the issues this paper has discussed—feasibility, cost, confidentiality, and data preparation. However, the following are all based on official announcements by companies or vendors and must be read with the caveat that they may include information favorable to the company.
What corroborates the importance of data preparation in the words of those involved is the case of Panasonic Connect. The company deployed its in-house AI assistant “ConnectAI” to all approximately 12,400 domestic employees, making RAG reference, in addition to its own public information (about 3,700 website pages, 495 news-release pages, and 3,200 external homepage pages), 630 confidential quality-control documents totaling 11,743 pages from April 2024. While it is reported that the assistant reduced the working hours of all employees by 186,000 hours over one year of deployment, the company itself, as a party involved, clearly states that it could not adequately answer questions requiring its own proprietary data, and that the preparation of its own data is extremely important [87]. This is the most important case, directly connecting the core of this paper, Chapter 4 (the structuring of in-house unstructured data), with this chapter’s discussion of limits.
What shows that a confidentiality-conscious platform is feasible is the case of Toyota Motor Corporation’s Advanced R&D and Engineering Company on AWS. The company consolidated the RAG systems that had proliferated department by department and built a secure RAG platform with hybrid search combining semantic search and vector search, query expansion, integration with the in-house authentication system, and file-level access control. As of December 2024, 11 departments and about 150 users were using it, and it is reported to have reduced research time by about 20% and labor hours in situations of querying already-registered information by about 50% [88]. This is a good example showing both the concrete confidentiality-conscious measures of in-house authentication integration and file-level access control, and a quantitative effect.
As field application of an implementation beyond Naive RAG, there is Panasonic Connect’s observation-driven AI agent. This is a technology that uses a knowledge graph rather than text as RAG’s reference target and improves answer accuracy by iterating the three stages of observation, action, and reflection, and its results were accepted at ACL 2024, a top international conference in natural language processing [89]. It can be called a field move that combines the GraphRAG described in Chapter 3 with the Agentic RAG described in Chapter 5. Similarly, Fujitsu’s enterprise generative AI framework “Fujitsu Kozuchi” possesses “knowledge-graph-augmented RAG,” which structures enterprise data as a knowledge graph and references on the scale of 10 million tokens, and “generative AI audit technology,” which verifies whether the output complies with corporate rules and laws, citing uses such as technical succession and compliance [90]. This is a supplementary case indicating a direction of handling large volumes of documents while guaranteeing confidentiality and reliability.
These official cases should be read against the findings of peer-reviewed literature such as the aforementioned industrial RAG study [80] and the privacy study [86]. That is, behind the brilliant results shown by official announcements, there still exist the challenges that peer-reviewed research points out—the difficulty of data preparation and the risk to confidentiality. By overlaying the two, the true picture of industrial application comes into clearer view.
7. Conclusion
7.1 Conclusion
This paper took as its starting point the question of “to what extent RAG can respond to the manufacturing floor’s demand to leverage in-house unstructured technical data.” Drawing on the discussion from Chapters 2 through 6, we summarize the answer to this question.
RAG is a framework with a clear advantage over a standalone LLM in that, by referencing external knowledge sources, it suppresses hallucination, can update knowledge without retraining, and can present the basis for an answer. Its research has systematically evolved from Naive RAG to Advanced RAG, Modular RAG, GraphRAG, and Agentic RAG, and—supported also by elemental technologies such as embeddings, reranking, and hybrid search—has steadily advanced. As for the structuring of unstructured data, which this paper took as its core, a diverse array of technologies has come into place, from layout analysis, table extraction, chart/figure understanding, OCR, and multimodal LLMs, through to document RAG that handles page images directly. For RAW images peculiar to the image sensor manufacturing industry, too, a path can be drawn—in principle—for handling them as a RAG knowledge source via a two-stage pipeline that converts them into 8-bit RGB images by development and then interprets their meaning with a multimodal model. Furthermore, on the foundation of function calling, MCP, and Skills, the operation of incorporating RAG as an agent’s tool has also become technically possible.
Therefore, the conclusion of this paper is that RAG is a promising technology for leveraging unstructured data in the image sensor manufacturing industry. However, this is not unconditional. Looking at the reality of industrial application, many implementations remain at the prototype stage, and field challenges remain—the limits of accuracy due to retrieval noise, the risk of confidential information leaking from the retrieval database, and the lag in automating evaluation. In particular, that the structuring and preparation of data is the main challenge governing quality is something that both peer-reviewed research and other companies’ cases consistently point out. The promise of RAG is realized only on the premise of a realistic response to these challenges.
7.2 Outlook
Finally, we state an outlook. First, the two-stage pipeline for RAW images proposed in this paper (development and multimodal understanding) still remains within the realm of the author’s discussion, and refinement and verification grounded in the real data of the image sensor manufacturing industry are required. Second, the field application of methods beyond Naive RAG, such as GraphRAG and Agentic RAG, is beginning—as seen in other companies’ cases as well—and a direction of integrating these with multimodal document understanding is anticipated. How to overcome the context-length wall of multimodal document RAG [43] is a key issue therein. Third, regarding accuracy evaluation, which has repeatedly appeared as a challenge in this paper, advancing methods such as reference-free automatic evaluation [85] and establishing an evaluation foundation that can be operated as standard on the industrial floor will be important in supporting the practical application of RAG. Tackling these challenges, we believe, opens the path to truly leveraging in-house unstructured technical data through RAG.
References
The type of each reference (peer-reviewed / preprint / vendor official documentation / industry report / standard or tertiary source) is noted alongside it. The access date for all vendor official documentation, standards, and tertiary sources is June 19, 2026. Each entry is followed by a brief summary of what the work actually presents.
[1] McKinsey & Company. The economic potential of generative AI: The next productivity frontier. Industry report, 2023. Link (Industry report. The full text of the report itself was not obtained; only figures confirmable from the public summary and third-party reporting were used.)
An industry report by McKinsey & Company estimates that generative AI could add USD 2.6-4.4 trillion in value annually across the 63 use cases analyzed, with roughly 75% of this value concentrated in four areas: customer operations, marketing and sales, software development, and research and development.
[2] Deloitte. 2025 Smart Manufacturing and Operations Survey. Industry report, 2025. Link (Industry survey.)
A Deloitte industry survey of 600 executives at major U.S. manufacturers found that 92% regard smart manufacturing as the key driver of competitiveness over the next three years, with adoption yielding 10-20% gains in production volume and 7-20% gains in workforce productivity; 65% rank operational risk among their top concerns.
[3] Samsung Electronics. Samsung Electronics Announces Strategy To Transition Global Manufacturing Into ‘AI-Driven Factories’ by 2030. Official press release, March 2026. Link (Official corporate announcement. Includes forward-looking vision.)
An official Samsung Electronics press release announcing a strategy to transform all manufacturing sites into AI-driven factories by 2030, integrating AI across the entire production value chain from procurement logistics to quality inspection and shipment, with the deployment of digital twins and dedicated AI agents.
[4] Lewis, P., Perez, E., Piktus, A., et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arXiv:2005.11401. Peer-reviewed paper.
This work proposes Retrieval-Augmented Generation (RAG), which combines a pre-trained seq2seq model (parametric memory) with a dense Wikipedia index accessed by a DPR retriever (non-parametric memory) and fine-tunes the retriever and generator end-to-end. It achieved state-of-the-art accuracy on several open-domain QA benchmarks and demonstrated that knowledge can be updated by swapping the non-parametric memory.
[5] Karpukhin, V., Oğuz, B., Min, S., et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. arXiv:2004.04906. Peer-reviewed paper.
This work proposes Dense Passage Retrieval (DPR), a bi-encoder approach that embeds questions and passages into dense vectors using two BERT encoders and learns inner-product similarity. It outperformed BM25 by 9-19 absolute points in Top-20 retrieval accuracy and surpassed ORQA in end-to-end QA (41.5% vs. 33.3% on Natural Questions).
[6] Ji, Z., Lee, N., Frieske, R., et al. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, Vol.55, No.12, 2023. DOI:10.1145/3571730 (arXiv:2202.03629). Peer-reviewed paper (survey).
A comprehensive survey of hallucination in deep-learning-based natural language generation, classifying it into intrinsic hallucination (output contradicting the input) and extrinsic hallucination (output unverifiable from the input). It also presents a dichotomy of data-related versus training-and-inference-related causes and organizes evaluation metrics and mitigation methods by task.
[7] Huang, L., Yu, W., Ma, W., et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. Preprint, 2023. arXiv:2311.05232. Preprint (not peer-reviewed).
A survey presenting a hallucination taxonomy tailored to the LLM era, distinguishing factuality hallucination (inconsistency with real-world facts) from faithfulness hallucination (deviation from user input, context, or self-consistency). It organizes causes across the data, training, and inference stages and discusses RAG as a mitigation method along with its limitations.
[8] Lee, K., Chang, M.-W., Toutanova, K. Latent Retrieval for Weakly Supervised Open Domain Question Answering (ORQA). ACL 2019. arXiv:1906.00300. Peer-reviewed paper.
This work proposes ORQA, the first open-retrieval QA system that jointly trains a retriever and a reader using only question-answer string pairs, pre-training the retriever with the Inverse Cloze Task. On datasets where users genuinely seek unknown answers, learned retrieval proved decisive, outperforming BM25 by 6-19 points in exact match.
[9] Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.-W. REALM: Retrieval-Augmented Language Model Pre-Training. ICML 2020. arXiv:2002.08909. Peer-reviewed paper.
This work proposes REALM, a retrieval-augmented language model that incorporates a latent knowledge retriever into language model pre-training to retrieve and attend over documents from a large text corpus. It outperformed prior methods by 4-16 absolute points on three open-domain QA benchmarks while offering advantages in interpretability and modularity.
[10] Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., Lewis, M. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020. arXiv:1911.00172. Peer-reviewed paper.
This work proposes kNN-LM, which linearly interpolates a pre-trained language model’s predictions with the results of k-nearest-neighbor retrieval. By building a datastore that maps context embeddings (keys) to next tokens (values), it can explicitly reference rare patterns without additional training, achieving a perplexity of 15.79 on WIKITEXT-103 and setting a new state of the art.
[11] Izacard, G., Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (Fusion-in-Decoder). EACL 2021. arXiv:2007.01282. Peer-reviewed paper.
This work proposes Fusion-in-Decoder (FiD), which processes each retrieved passage independently with the encoder alongside the question and lets the decoder attend over the concatenation of all representations to generate an answer. It achieved state-of-the-art results on Natural Questions and TriviaQA (NQ EM 51.4, TriviaQA EM 67.6) and showed that performance improves as the number of retrieved passages increases up to 100.
[12] Izacard, G., Lewis, P., Lomeli, M., et al. Atlas: Few-shot Learning with Retrieval Augmented Language Models. JMLR Vol.24, 2023. arXiv:2208.03299. Peer-reviewed paper.
This work proposes Atlas, a retrieval-augmented language model combining a Contriever-based dense retriever with a Fusion-in-Decoder generator and using retrieval in both pre-training and fine-tuning. With 11B parameters and only 64 examples, it achieved 42.4% accuracy on Natural Questions, outperforming the 50-times-larger 540B PaLM by about 3 points.
[13] Borgeaud, S., Mensch, A., Hoffmann, J., et al. Improving Language Models by Retrieving from Trillions of Tokens (RETRO). ICML 2022. arXiv:2112.04426. Peer-reviewed paper.
This work proposes RETRO (Retrieval-Enhanced Transformer), a retrieval-augmented autoregressive model that retrieves document chunks similar to preceding chunks from a large corpus and incorporates them via a chunked cross-attention mechanism. Using a 2-trillion-token database, it matched the performance of GPT-3 and Jurassic-1 on The Pile with 25 times fewer parameters.
[14] Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis, M., Zettlemoyer, L., Yih, W.-t. REPLUG: Retrieval-Augmented Black-Box Language Models. NAACL 2024. arXiv:2301.12652. Peer-reviewed paper.
This work proposes REPLUG, a retrieval-augmented framework that keeps the language model frozen as a black box and simply prepends retrieved documents to the input, along with REPLUG LSR, which fine-tunes the retriever using the language model’s predictions as a supervisory signal. With the tuned retriever, it improved GPT-3 (175B) language modeling by 6.3% and Codex’s MMLU by 4.5%.
[15] Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K., Shoham, Y. In-Context Retrieval-Augmented Language Models. TACL Vol.11, 2023. arXiv:2302.00083. Peer-reviewed paper.
This work proposes In-Context RALM, which leaves the language model architecture entirely unchanged and simply concatenates retrieved documents before the input prefix. Usable with off-the-shelf general-purpose language models and retrievers, it achieved language-modeling improvements equivalent to a 2-3x increase in parameter count across all corpora tested.
[16] Gao, Y., Xiong, Y., Gao, X., et al. Retrieval-Augmented Generation for Large Language Models: A Survey. Preprint, 2023/2024. arXiv:2312.10997. Preprint (a widely cited central survey).
A survey reviewing over 100 RAG studies and organizing their development into three paradigms—Naive RAG, Advanced RAG, and Modular RAG. It analyzes each stage through the three components of Retrieval, Generation, and Augmentation, and reports that RAG consistently outperforms unsupervised fine-tuning.
[17] Gao, Y., Xiong, Y., Wang, M., Wang, H. Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks. Preprint, 2024. arXiv:2407.21059. Preprint (not peer-reviewed).
Proposes the Modular RAG framework, which decomposes increasingly complex RAG systems into independent modules and dedicated operators that can be reconfigured like LEGO blocks. Going beyond the conventional linear structure, it integrates routing, scheduling, and fusion mechanisms and identifies common RAG patterns such as linear, conditional, branching, and looping.
[18] Asai, A., Wu, Z., Wang, Y., Sil, A., Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. ICLR 2024. arXiv:2310.11511. Peer-reviewed paper.
Proposes Self-RAG, which trains a single LM end-to-end to generate “reflection tokens” that judge whether retrieval is needed and assess the relevance, support, and output quality of retrieved passages. Its 7B and 13B models significantly outperform ChatGPT and retrieval-augmented Llama2-chat on open-domain QA, reasoning, and fact verification, substantially improving factuality and citation accuracy in long-form generation.
[19] Yan, S.-Q., Gu, J.-C., Zhu, Y., Ling, Z.-H. Corrective Retrieval Augmented Generation (CRAG). Preprint, 2024. arXiv:2401.15884. Preprint (not peer-reviewed).
Proposes Corrective RAG (CRAG), in which a lightweight retrieval evaluator assesses the quality of retrieved documents and triggers different knowledge-acquisition actions—Correct, Incorrect, or Ambiguous—based on confidence. With integrated web search and a decompose-then-recompose extraction step, it significantly improves over standard RAG and Self-RAG across four datasets.
[20] Peng, B., Zhu, Y., Liu, Y., et al. Graph Retrieval-Augmented Generation: A Survey. Preprint, 2024. arXiv:2408.08921. Preprint (not peer-reviewed).
The first systematic survey of GraphRAG, formalizing its workflow into three stages: Graph-Based Indexing, Graph-Guided Retrieval, and Graph-Enhanced Generation. To capture structural and relational knowledge among entities that semantic similarity alone misses, it analyzes the model choices, method designs, and enhancement strategies at each stage.
[21] Edge, D., Trinh, H., Cheng, N., et al. From Local to Global: A Graph RAG Approach to Query-Focused Summarization (Microsoft GraphRAG). Preprint, 2024. arXiv:2404.16130. Preprint (not peer-reviewed).
Proposes Microsoft GraphRAG, which uses an LLM to build an entity knowledge graph from source documents, pre-generates summaries for each community of related entities, and answers global queries via a map-reduce process. On sensemaking questions over million-token datasets, it substantially outperforms conventional vector RAG in both comprehensiveness and diversity of answers when using GPT-4.
[22] Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., Manning, C. D. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. ICLR 2024. arXiv:2401.18059. Peer-reviewed paper.
Proposes RAPTOR, which recursively embeds, clusters, and summarizes chunks to build a tree of summaries at varying levels of abstraction in a bottom-up manner. By retrieving from multiple tree layers at inference time, it can grasp an entire document holistically, and RAPTOR combined with GPT-4 improves the best result on the QuALITY benchmark by 20% in absolute accuracy.
[23] Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi-Yu, J., Yang, Y., Callan, J., Neubig, G. Active Retrieval Augmented Generation (FLARE). EMNLP 2023. arXiv:2305.06983. Peer-reviewed paper.
Generalizes active retrieval-augmented generation, which actively decides when and what to retrieve during generation, and proposes the concrete method FLARE. By tentatively generating the upcoming sentence to look ahead and re-retrieving using that sentence as a query when it contains low-confidence tokens, FLARE achieves superior or competitive performance across all four long-form, knowledge-intensive generation tasks.
[24] Trivedi, H., Balasubramanian, N., Khot, T., Sabharwal, A. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions (IRCoT). ACL 2023. arXiv:2212.10509. Peer-reviewed paper.
Proposes IRCoT, which interleaves retrieval and chain-of-thought (CoT) steps, using CoT to guide retrieval and retrieval results to refine the CoT iteratively. With GPT-3, IRCoT improves retrieval by up to 21 points and downstream QA by up to 15 points across four multi-hop QA datasets while reducing hallucination.
[25] Singh, A., Ehtesham, A., Kumar, S., Talaei Khoei, T. Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. Preprint, 2025. arXiv:2501.09136. Preprint (not peer-reviewed).
A survey providing an analytical overview of Agentic RAG, which embeds autonomous AI agents into the RAG pipeline. It organizes design patterns such as reflection, planning, tool use, and multi-agent collaboration along with various workflow patterns, and presents a principled taxonomy based on the number of agents, control structure, autonomy, and knowledge representation.
[26] Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs. Preprint, 2025. arXiv:2507.09477. Preprint (not peer-reviewed).
A survey that organizes retrieval and reasoning under a unified perspective, presenting a taxonomy that classifies methods, datasets, and challenges into three types: Reasoning-Enhanced RAG, RAG-Enhanced Reasoning, and Synergized (agentic) RAG-Reasoning that iteratively coordinates the two. It discusses the developmental direction of iterative coordination between retrieval and reasoning.
[27] Reimers, N., Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP-IJCNLP 2019. arXiv:1908.10084. Peer-reviewed paper.
Proposes Sentence-BERT, which fine-tunes BERT/RoBERTa with siamese and triplet structures and applies pooling to produce fixed-length sentence embeddings comparable by cosine similarity. It reduces the search for the most similar pair among 10,000 sentences from about 65 hours to roughly 5 seconds and outperforms InferSent by 11.7 points on average across seven STS tasks.
[28] Wang, L., Yang, N., Huang, X., et al. Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5). Preprint, 2022. arXiv:2212.03533. Preprint (not peer-reviewed).
Trains the general-purpose embedding model E5 via weakly supervised contrastive pre-training with in-batch negatives, using a large set of text pairs (CCPairs) selected by a consistency-based filter. Evaluated on 56 datasets including BEIR and MTEB, it becomes the first model to surpass BM25 in a zero-shot setting without any labels.
[29] Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., et al. C-Pack: Packed Resources For General Chinese Embeddings (BGE). SIGIR 2024. arXiv:2309.07597. Peer-reviewed paper.
Develops and releases C-Pack, a resource package for general-purpose Chinese embeddings comprising the C-MTEB evaluation benchmark, the large-scale C-MTP training data, the BGE family of embedding models, and a three-stage training recipe. BGE outperforms existing Chinese embeddings by over 10% on C-MTEB at release and is offered in small, base, and large sizes.
[30] Malkov, Yu. A., Yashunin, D. A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (HNSW). IEEE TPAMI, Vol.42, No.4, 2020. arXiv:1603.09320. Peer-reviewed paper.
Proposes HNSW, an approximate nearest-neighbor search that incrementally builds a multi-layer proximity graph in which elements are assigned to layers by an exponentially decaying probability distribution and links are separated by characteristic distance scale. By starting the search from upper layers, it achieves logarithmic complexity scaling and substantially outperforms the state-of-the-art vector-specific methods of its time in the speed-recall trade-off.
[31] Khattab, O., Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020. arXiv:2004.12832. Peer-reviewed paper.
ColBERT independently encodes queries and documents into sets of contextualized embeddings and estimates relevance through cheap late interaction via MaxSim. It matches BERT-based methods in effectiveness while achieving over 170x speedup and four orders of magnitude fewer FLOPs in reranking.
[32] Nogueira, R., Jiang, Z., Pradeep, R., Lin, J. Document Ranking with a Pretrained Sequence-to-Sequence Model (monoT5). Findings of EMNLP 2020. arXiv:2003.06713. Peer-reviewed paper.
monoT5 adapts the sequence-to-sequence model T5 as a reranker, generating “true/false” from query-document pairs and ranking by the probability of those logits. It reaches MRR@10 of .383 on MS MARCO, surpassing BERT-large (.372), and shows particular strength under data-scarce conditions.
[33] Robertson, S., Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, Vol.3, No.4, 2009. DOI:10.1561/1500000019. Peer-reviewed paper.
This expository work systematizes the probabilistic relevance framework, which probabilistically estimates relevance in information retrieval and ranks documents by descending probability of relevance, presenting in unified form the term-weighting function BM25 and BM25F derived from the binary independence model and relevance feedback. It is the canonical reference for the sparse side of hybrid retrieval.
[34] Cormack, G. V., Clarke, C. L. A., Büttcher, S. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods (RRF). SIGIR 2009, pp.758–759. DOI:10.1145/1571941.1572114. Peer-reviewed paper.
Reciprocal Rank Fusion (RRF) scores each document by summing the reciprocal ranks 1/(k+r) across all systems. It fuses rankings using only rank positions and no training examples, outperforming the best individual systems and methods such as CombMNZ by 4-5% on average across TREC tasks (with k=60 optimal).
[35] Gao, L., Ma, X., Lin, J., Callan, J. Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). ACL 2023. arXiv:2212.10496. Peer-reviewed paper.
HyDE has an instruction-following LLM generate a hypothetical document answering the query, then embeds it with an unsupervised encoder to retrieve neighbors of real documents. Requiring no labels in a zero-shot setting, it reaches nDCG@10 of 61.3 on TREC DL19, far exceeding Contriever (44.5) and BM25 (50.6).
[36] Chen, T., Wang, H., Chen, S., et al. Dense X Retrieval: What Retrieval Granularity Should We Use? EMNLP 2024. arXiv:2312.06648. Peer-reviewed paper.
This work proposes the “proposition”—a minimal, self-contained factual unit—as a new retrieval unit and builds FactoidWiki by segmenting English Wikipedia at three granularities. Proposition-level retrieval improves unsupervised retrievers’ Recall@5 by +9 to +12 points on average and also raises downstream QA EM@500.
[37] Duarte, A. V., Marques, J., Graça, M., et al. LumberChunker: Long-Form Narrative Document Segmentation. Findings of EMNLP 2024. arXiv:2406.17526. Peer-reviewed paper.
LumberChunker is a dynamic chunking method that has an LLM judge content shift points to split documents into variable-length segments. It outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) and, when integrated into a RAG pipeline, is more effective than other chunking methods and Gemini 1.5 Pro.
[38] Günther, M., Mohr, I., Williams, D. J., Wang, B., Xiao, H. Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models. Preprint, 2024. arXiv:2409.04701. Preprint (not peer-reviewed).
Late chunking first embeds the entire document with a long-context embedding model and only then pools at the chunk level, rather than splitting into chunks before embedding. Each chunk embedding thus retains the document-wide context, improving nDCG@10 by a relative 2.7 to 3.6% over naive splitting.
[39] Qu, R., Tu, R., Bao, F. S. Is Semantic Chunking Worth the Computational Cost? Findings of NAACL 2025. arXiv:2410.13070. Peer-reviewed paper.
This study evaluates at scale, across three proxy tasks (document retrieval, evidence retrieval, and answer generation), whether semantic chunking yields performance gains commensurate with its added computational cost. It concludes that the gains are highly task-dependent, inconsistent, and often fail to justify the extra cost, serving as a counterpoint on the cost-effectiveness of advanced chunking.
[40] Li, Z., Li, C., Zhang, M., Mei, Q., Bendersky, M. Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. EMNLP 2024 (Industry Track). arXiv:2407.16833. Peer-reviewed paper.
Using state-of-the-art LLMs, this work systematically compares RAG with long-context (LC) and proposes Self-Route, a hybrid that first processes queries cheaply with RAG and routes only the queries RAG is unconfident about to LC. While LC outperforms RAG in nearly all settings, Self-Route achieves LC-equivalent performance at substantially lower cost.
[41] Yu, T., Xu, A., Akkiraju, R. In Defense of RAG in the Era of Long-Context Language Models. Preprint, 2024. arXiv:2409.01666. Preprint (not peer-reviewed).
Order-preserve RAG (OP-RAG) arranges retrieved chunks in their original order of appearance in the document rather than by relevance. Answer quality follows an inverted-U curve as the number of retrieved chunks grows, and OP-RAG attains higher answer quality with far fewer tokens than feeding the full context to a long-context LLM.
[42] Mathew, M., Karatzas, D., Jawahar, C. V. DocVQA: A Dataset for VQA on Document Images. WACV 2021. arXiv:2007.00398. Peer-reviewed paper.
DocVQA is an extractive QA dataset built from real industrial documents, requiring interpretation of structure such as layout, tables, forms, and figures to answer. It comprises 50,000 questions over 12,767 document images and shows that a large gap remains between existing models and human performance (94.36% accuracy).
[43] Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding. Preprint, 2025. arXiv:2510.15253. Preprint (not peer-reviewed).
This is the first systematic survey of multimodal RAG for document understanding, classifying methods by domain, retrieval modality, granularity, graph integration, and agentic extensions. It notes that representative document-RAG benchmarks require 20-200M visual tokens, far exceeding existing MLLMs’ context lengths, and that the number of papers surged from 2024.
[44] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD 2020. arXiv:1912.13318. Peer-reviewed paper.
LayoutLM is the first model to jointly pre-train text and layout (2D positional embeddings) within a single framework. It improves form understanding (FUNSD) from 70.72 to 79.27 and document image classification (RVL-CDIP) from 93.07 to 94.42, achieving new state-of-the-art on multiple document understanding tasks.
[45] Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., et al. LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding. ACL-IJCNLP 2021. arXiv:2012.14740. Peer-reviewed paper.
LayoutLMv2 integrates text, vision, and layout via a multimodal Transformer from the pre-training stage, introducing new Text-Image Alignment and Matching tasks and a spatial-aware self-attention mechanism. It surpasses the original on six tasks, including FUNSD and DocVQA (0.7295 to 0.8672), achieving state-of-the-art.
[46] Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. ACM MM 2022. arXiv:2204.08387. Peer-reviewed paper.
LayoutLMv3 is the first document-AI multimodal model free of CNN dependence, pre-training text and image with unified masking (MLM plus MIM) and learning cross-modal alignment via Word-Patch Alignment. It achieves F1=92.08 on FUNSD and mAP 95.1 on PubLayNet layout analysis, reaching state of the art on both text-centric and image-centric tasks.
[47] Kim, G., Hong, T., Yim, M., et al. OCR-free Document Understanding Transformer (Donut). ECCV 2022. arXiv:2111.15664. Peer-reviewed paper.
Donut is an end-to-end Transformer (Swin Transformer encoder plus BART decoder) that generates structured output directly from raw images without OCR. It surpasses OCR-dependent methods in accuracy, speed, and memory to reach state of the art, offering a strong approach for unstructured data on which OCR is unreliable.
[48] Appalaraju, S., Jasani, B., Urala Kota, B., Xie, Y., Manmatha, R. DocFormer: End-to-End Transformer for Document Understanding. ICCV 2021. arXiv:2106.11539. Peer-reviewed paper.
DocFormer is an encoder-only Transformer with a novel multimodal self-attention layer that shares spatial embeddings across modalities, fusing the visual, textual, and spatial modalities for document understanding without relying on object detectors or custom OCR. It achieves state of the art on four datasets, on some tasks surpassing models roughly four times its size.
[49] Blecher, L., Cucurull, G., Scialom, T., Stojnic, R. Nougat: Neural Optical Understanding for Academic Documents. Preprint, 2023. arXiv:2308.13418. Preprint (not peer-reviewed).
Nougat is a Visual Transformer (Swin Transformer encoder plus mBART decoder) that converts document page images into lightweight markup (Markdown). Trained as a 350M-parameter model on data built from arXiv papers, it demonstrates the conversion of scientific documents containing equations into structured text.
[50] Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z. TableBank: Table Benchmark for Image-based Table Detection and Recognition. LREC 2020. arXiv:1903.01949. Peer-reviewed paper.
TableBank is a large-scale table dataset built by automatically applying markup as weak supervision to the source code of Word and LaTeX documents. Containing 417,234 labeled tables, it achieves F1=0.9625 on ICDAR2013 and shows that cross-domain training aids generalization in table detection and recognition.
[51] Zhong, X., ShafieiBavani, E., Jimeno Yepes, A. Image-based Table Recognition: Data, Model, and Evaluation (PubTabNet / EDD). ECCV 2020. arXiv:1911.10683. Peer-reviewed paper.
This work releases PubTabNet, one of the largest public table-recognition datasets (about 568k images with HTML representations) derived from PubMed Central, and proposes the Encoder-Dual-Decoder (EDD), comprising a structure decoder and a cell decoder, along with TEDS, a tree-edit-distance-based metric. EDD surpasses the prior state of the art by an absolute 9.7% in TEDS.
[52] Nassar, A., Livathinos, N., Lysak, M., Staar, P. TableFormer: Table Structure Understanding with Transformers. CVPR 2022. arXiv:2203.01017. Peer-reviewed paper.
TableFormer is a Transformer-based model that jointly predicts table structure and cell bounding boxes end to end, avoiding the training of a custom OCR and remaining language-independent so as to handle non-English tables. It improves TEDS to 98.5% for simple tables and 95% for complex tables, surpassing the prior state of the art.
[53] Smock, B., Pesala, R., Abraham, R. PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents. CVPR 2022. arXiv:2110.00061. Peer-reviewed paper.
PubTables-1M is a large-scale dataset of about one million tables (948K) built from PMCOA, covering the three tasks of table detection, structure recognition, and functional analysis and annotating all rows, columns, and cells including empty ones. It introduces a canonicalization procedure to correct oversegmentation and first applies DETR to the three tasks, demonstrating its effectiveness.
[54] Masry, A., Do, X. L., Tan, J. Q., Joty, S., Hoque, E. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. Findings of ACL 2022. arXiv:2203.10244. Peer-reviewed paper.
ChartQA is a large-scale benchmark for question answering over bar, line, and pie charts that addresses compositional visual and logical reasoning, comprising 32,719 questions over 20,882 charts collected from four real-world sources. It proposes a Transformer-based QA model that integrates visual features with structured data tables extracted from the charts.
[55] Liu, F., Eisenschlos, J. M., Piccinno, F., et al. DePlot: One-shot Visual Language Reasoning by Plot-to-Table Translation. Findings of ACL 2023. arXiv:2212.10505. Peer-reviewed paper.
DePlot decomposes visual-language reasoning into plot-to-table translation and LLM reasoning over the translated table, proposing a module that converts plots into linearized tables and plugs into existing LLMs. In a one-shot setting it achieves a 29.4% improvement on human-written questions over the then state of the art fine-tuned on thousands of examples.
[56] Liu, F., Piccinno, F., Krichene, S., et al. MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering. ACL 2023. arXiv:2212.09662. Peer-reviewed paper.
MatCha strengthens visual-language pre-training by introducing two families of pre-training tasks, chart derendering and mathematical reasoning, starting from Pix2Struct. On ChartQA and PlotQA it surpasses even state-of-the-art models assuming access to the underlying data table, exceeding the prior state of the art without data tables by up to roughly 20%.
[57] Smith, R. An Overview of the Tesseract OCR Engine. ICDAR 2007, pp.629–633. DOI:10.1109/ICDAR.2007.4376991. Peer-reviewed paper.
Tesseract is an OCR engine adopting a staged pipeline that begins with connected-component analysis (line finding, baseline fitting, character segmentation, and two-pass recognition with an adaptive classifier). On the UNLV Fourth Annual OCR Accuracy Test its new version improves the overall character error rate by 7.31% and the word error rate by 5.39% over the prior version, serving as a foundational technology for extracting text from unstructured documents.
[58] OpenAI. GPT-4V(ision) System Card. Vendor technical report, 2023. Link (Not peer-reviewed.)
The GPT-4V System Card evaluates and mitigates multimodal-specific safety concerns and limitations such as hallucination, person identification, and jailbreaks for the deployment of image-capable GPT-4V. In internal evaluation it refuses over 98% of person-identification requests and, combined with a refusal system, reaches a 100% refusal rate for image jailbreaks, while also reporting limitations such as reading specialized documents yet erroneously merging nearby texts.
[59] Liu, H., Li, C., Wu, Q., Lee, Y. J. Visual Instruction Tuning (LLaVA). NeurIPS 2023. arXiv:2304.08485. Peer-reviewed paper.
LLaVA is a multimodal LLM trained end to end by connecting CLIP’s visual encoder and Vicuna via a linear projection, using multimodal instruction-following data (158K instances total) generated with the language-only GPT-4. It attains a relative score of 85.1% against GPT-4 on synthetic data and, combined with GPT-4 on Science QA, achieves a new state-of-the-art accuracy of 92.53%.
[60] Bai, J., Bai, S., Yang, S., et al. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. Preprint, 2023. arXiv:2308.12966. Preprint (not peer-reviewed).
Qwen-VL is a 9.6B-parameter vision-language model built on Qwen-7B that introduces a ViT visual encoder and a position-aware cross-attention adapter compressing image features into a fixed length of 256, trained through a three-stage pipeline on a multilingual multimodal corpus. Equipped with grounding and text-reading abilities, it sets a new general-purpose VL state of the art at its scale across a wide range of benchmarks such as ChartQA and TextVQA.
[61] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML 2021. arXiv:2103.00020. Peer-reviewed paper.
The international standard ISO 12234-4:2026, published in March 2026, specifies Adobe’s DNG (Digital Negative) format as the standard file format for RAW image storage. It retains unprocessed sensor data and enables post-capture adjustment of white balance and tone mapping, and in this essay it serves as the primary technical source for treating RAW images as a representative example of unstructured data.
[62] Faysse, M., Sibille, H., Wu, T., et al. ColPali: Efficient Document Retrieval with Vision Language Models. ICLR 2025. arXiv:2407.01449. Peer-reviewed paper.
A tertiary reference outlining the RAW image format, explaining that a RAW file stores unprocessed mosaic sensor data captured through a color filter array (typically a Bayer filter) at high bit depth (usually 12 or 14 bits), deferring demosaicing and white-balance processing to later stages to maximize adjustment flexibility. In this essay it is used as a supplementary source for background on RAW images and sensor formats.
[63] ISO. ISO 12234-4:2026 — Digital imaging — Image storage — Part 4: Digital negative format. International standard, March 2026. Link (Paid standard. Referenced within the scope of catalog information.)
The international standard ISO 12234-4:2026, published in March 2026, specifies Adobe’s DNG (Digital Negative) format as the standard file format for creating, processing, managing, and archiving RAW images. It emphasizes that the RAW format retains unprocessed sensor data and enables post-capture adjustment, and in this essay it serves as the primary source for treating RAW images (DNG) as a representative example of unstructured data.
[64] Wikipedia contributors. Raw image format. Tertiary source. Link (Non-peer-reviewed tertiary source. Used only for background explanation.)
A tertiary reference outlining the RAW image format, explaining that a RAW file stores unprocessed mosaic sensor data captured through a color filter array (typically a Bayer filter) at high bit depth (usually 12 or 14 bits), intentionally deferring processing to later stages to maximize the flexibility of adjustments such as white balance and tone mapping. In this essay it is used as a supplementary source for background on sensor formats.
[65] Ignatov, A., Van Gool, L., Timofte, R. Replacing Mobile Camera ISP with a Single Deep Learning Model (PyNET). CVPRW 2020. arXiv:2002.05509. Peer-reviewed paper.
PyNET is an inverted-pyramid CNN that directly converts RAW Bayer data into RGB images without prior knowledge of the sensor or optics, jointly learning ISP stages such as demosaicing, color correction, and denoising in a single end-to-end model. On the Zurich RAW to RGB dataset it achieves PSNR 21.19 and MS-SSIM 0.8620, and in a user study it scored 2.77, surpassing the Huawei P20’s built-in ISP (MOS 2.56).
[66] Chen, C., Chen, Q., Xu, J., Koltun, V. Learning to See in the Dark. CVPR 2018. arXiv:1805.01934. Peer-reviewed paper.
Learning to See in the Dark replaces the entire conventional ISP pipeline with an end-to-end fully convolutional network (U-Net) that takes extremely low-light short-exposure RAW images directly as input, with the amplification ratio supplied as an external parameter. It introduces the SID dataset of 5,094 RAW images captured with Sony and Fujifilm cameras, achieving PSNR 28.88 and SSIM 0.787 on the Sony set and winning 92.4% preference over BM3D in perceptual evaluation.
[67] Unstructured Technologies. Unstructured (open-source document ETL library). OSS official documentation. Link (Not peer-reviewed.)
Unstructured is an open-source document-preprocessing toolkit that ingests more than 25 document formats—including PDF, HTML, Word, and images—and formats and structures them for LLMs. It provides partitioning of documents into structured elements, cleaning, information extraction, and semantic chunking, and in this essay it is cited as a concrete example of the practical workflow for preparing unstructured data for RAG.
[68] Amazon Web Services. Amazon Textract. Vendor official documentation. Link (Not peer-reviewed.)
Amazon Textract is an AWS managed service that extracts text, tables, and forms from images and PDFs without requiring machine-learning expertise. In addition to OCR detection of printed and handwritten text, it offers table and form extraction, targeted extraction via Queries, AnalyzeExpense for invoices and receipts, and AnalyzeID for identity documents, and in this essay it is cited as a practical example of a service for structuring unstructured documents.
[69] Amazon Web Services. Amazon Bedrock Knowledge Bases. Vendor official documentation. Link (Not peer-reviewed.)
Amazon Bedrock Knowledge Bases is an AWS managed RAG platform that integrates proprietary data into generative-AI applications to improve the relevance and accuracy of responses via RAG. It offers a Managed type that handles ingestion, embedding, indexing, and retrieval, as well as a Self-managed type, and supports automatic per-document-type parsing, multi-hop reasoning, cited responses, and reranking. In this essay it is cited as an example of integrating ingestion through retrieval and generation.
[70] Microsoft. Azure AI Document Intelligence. Vendor official documentation. Link (Not peer-reviewed.)
Azure AI Document Intelligence (formerly Form Recognizer) is a Microsoft Azure cloud service that performs machine-learning-based OCR and intelligent document processing to automate the extraction of key data from forms and documents. It provides a Read model for printed and handwritten text, a Layout model for text, tables, and structure, numerous Prebuilt models for invoices and similar documents, and a custom-trained Custom model, and in this essay it is cited as a cloud document-processing counterpart to Textract.
[71] Microsoft. Azure AI Search — Vector search overview. Vendor official documentation. Link (Not peer-reviewed.)
Vector search in Azure AI Search is an information-retrieval approach that supports indexing and querying over embedding vectors of content, enabling semantic-similarity matching across multiple languages and content types. In addition to similarity search, it provides hybrid search that combines vector and keyword search in a single request, as well as multimodal search, and in this essay it is cited as an example of a vector-search platform serving as the retrieval layer of RAG.
[72] Microsoft. Azure AI Search — Integrated vectorization. Vendor official documentation. Link (Not peer-reviewed.)
Integrated vectorization in Azure AI Search extends the indexing and query pipeline to perform vectorization automatically at both ingestion and query time. During indexer-driven ingestion it executes chunking via the Text Split skill and embedding generation via the AzureOpenAIEmbedding skill in one pass, and at query time it automatically converts the search string into a vector. In this essay it is cited as an example of reducing effort through automatic vectorization at ingestion time.
[73] Google. Google Cloud Document AI. Vendor official documentation. Link (Not peer-reviewed.)
Google Cloud Document AI is a Google Cloud document-processing platform that converts unstructured data within documents into structured data suitable for databases. It provides processors for Digitize (OCR with quality assessment and deskewing), Extract (value extraction and normalization for forms and tables via Form Parser, Layout Parser, etc.), and Classify (classification and splitting of document types), and in this essay it is cited as a comparison counterpart alongside Textract and Document Intelligence.
[74] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629. Peer-reviewed paper.
ReAct is a paradigm that interleaves reasoning traces with task-specific actions, using reasoning to guide and update action plans while using actions to interact with external sources such as Wikipedia to acquire knowledge. It suppresses hallucination on HotpotQA and Fever, and on the interactive decision-making benchmarks ALFWorld and WebShop it surpasses prior methods by 34% and 10% in absolute success rate using only one or two in-context examples.
[75] Schick, T., Dwivedi-Yu, J., Dessì, R., et al. Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. arXiv:2302.04761. Peer-reviewed paper.
Toolformer is an LLM that learns, in a self-supervised manner, which APIs to call, when, with what arguments, and how to incorporate the results, by sampling candidate API calls and retaining only those that actually reduce the next-token prediction loss for fine-tuning. Based on the 6.7B GPT-J, it substantially improves zero-shot performance, often rivaling GPT-3 (175B), and improves by up to 18.6 points on LAMA.
[76] Wang, L., Ma, C., Feng, X., et al. A Survey on Large Language Model based Autonomous Agents. Frontiers of Computer Science, 2023. arXiv:2308.11432. Peer-reviewed paper (survey).
A comprehensive survey of LLM-based autonomous agents organized along three axes—construction, application, and evaluation—proposing a unified framework comprising four modules: profile, memory, planning, and action. It serves as a reference for surveying the constituent components and application scope of agents.
[77] Anthropic. Tool use with Claude — Overview. Vendor official documentation, 2024–2025. Link (Not peer-reviewed.)
Anthropic’s official overview of tool use (function calling), describing how Claude decides on invocations based on the request and tool descriptions, distinguishing client tools from server tools (such as web_search and code_execution). It constitutes an agent loop via returned tool_use blocks and submitted tool_result messages.
[78] Anthropic. Introducing the Model Context Protocol (MCP). Vendor official documentation, November 2024. Link (Specification: Link ) (Not peer-reviewed.)
Anthropic’s official overview of the open standard Model Context Protocol (MCP), which aims to securely connect AI assistants with data sources and replace fragmented bespoke integrations with a single protocol. It provides the protocol specification and SDKs along with prebuilt connectors for Google Drive, Slack, GitHub, PostgreSQL, and others.
[79] Anthropic. Agent Skills — Overview. Vendor official documentation, October 2025. Link (Not peer-reviewed.)
Anthropic’s official overview of the modular Agent Skills feature, which packages instructions, metadata, and optional scripts or templates that Claude invokes automatically in relevant situations. It employs progressive disclosure, loading from metadata to body instructions incrementally, and serves as a means of supplying domain knowledge to agents.
[80] Brehme, L., Dornauer, B., Ströhle, T., Ehrhart, M., Breu, R. Retrieval-Augmented Generation in Industry: An Interview Study on Use Cases, Requirements, Challenges, and Evaluation. KDIR 2025. arXiv:2508.14066. Peer-reviewed paper (conference).
A peer-reviewed interview study analyzing the practical adoption of RAG (use cases, requirements, challenges, and evaluation) from semi-structured interviews with 13 industry practitioners, finding that most implementations are QA tasks and that 12 of 13 remain at the prototype stage. Privacy/data protection was rated most important among requirements (8.9), while evaluation was almost entirely manual, with only two cases automated.
[81] Chen, L.-C., Pardeshi, M. S., Liao, Y.-X., Pai, K.-C. Application of retrieval-augmented generation for interactive industrial knowledge management via a large language model. Computer Standards & Interfaces, Vol.94, 103995, 2025. DOI:10.1016/j.csi.2025.103995. Peer-reviewed paper (journal).
A case study designing and implementing a custom RAG system for interactive knowledge management over industry-specific unstructured documents, retrieving top-k chunks using both BM25 and embeddings, reranking with a BAAI reranker, and generating with GPT-3.5 Turbo. On internal regulation documents it achieved 91.62% recall, 97.97% MRR, and 91.12% mAP.
[82] Heredia Álvaro, J. A., González Barreda, J. An advanced retrieval-augmented generation system for manufacturing quality control. Advanced Engineering Informatics, Vol.64, 103007, 2025 (online publication and DOI dated 2024). DOI:10.1016/j.aei.2024.103007. Peer-reviewed paper (journal).
Targeting quality control in ceramic tile manufacturing, this work builds an advanced RAG pipeline of preprocessing, indexing, retrieval, post-retrieval, and generation—retrieving with a bi-encoder, reranking with a cross-encoder, and generating with gpt-3.5-turbo-instruct over a defect catalog and academic papers as knowledge sources. It achieved an 85.81% F1 in retrieval and a mean ROUGE-L of 0.61 in generation, outperforming general GPT-4.
[83] Zhang, T., Patil, S. G., Jain, N., Shen, S., Zaharia, M., Stoica, I., Gonzalez, J. E. RAFT: Adapting Language Model to Domain Specific RAG. COLM 2024. arXiv:2403.10131. Peer-reviewed paper.
Proposes Retrieval Augmented Fine-Tuning (RAFT), a hybrid of RAG and fine-tuning that presents both helpful documents and irrelevant distractors and fine-tunes the model to ignore distractors while quoting relevant documents verbatim and reasoning in chain-of-thought form. It consistently improved over domain-specific fine-tuning and RAG alone on PubMed, HotpotQA, and Gorilla.
[84] Chen, J., Lin, H., Han, X., Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation (RGB). AAAI 2024, Vol.38 No.16, pp.17754–17762. arXiv:2309.01431. Peer-reviewed paper.
Constructs RGB, an English–Chinese benchmark measuring four fundamental abilities required for RAG—noise robustness, negative rejection, information integration, and counterfactual robustness—and evaluates six LLMs. At a noise ratio of 0.8 ChatGPT’s accuracy fell from 96.33% to 76.00%, and negative rejection rates reached at most 45% in English, quantifying capability bottlenecks under RAG.
[85] Es, S., James, J., Espinosa-Anke, L., Schockaert, S. Ragas: Automated Evaluation of Retrieval Augmented Generation. EACL 2024 (System Demonstrations). arXiv:2309.15217. Peer-reviewed paper (demo paper).
Proposes Ragas, a reference-free automated evaluation framework for RAG that measures three axes—faithfulness, answer relevance, and context relevance—via LLM prompting. On WikiEval its agreement with human judgments reached 0.95 for faithfulness, 0.78 for answer relevance, and 0.70 for context relevance, surpassing GPT Score and GPT Ranking.
[86] Zeng, S., Zhang, J., He, P., et al. The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG). Findings of ACL 2024. arXiv:2402.16893. Peer-reviewed paper.
A study examining leakage risks from both the retrieval database and the training data of RAG, demonstrating extraction of confidential retrieval data via structured prompt attacks composed of {information} plus {command}. Llama2-7b-Chat and GPT-3.5-turbo could output retrieval data verbatim with success rates near 50%, and summarization reduced untargeted-attack risk by about 50% but was limited against targeted attacks.
[87] Panasonic Connect. Results of one year of generative AI deployment and future utilization plans. Official press release, June 25, 2024. Link (Official corporate announcement. Not peer-reviewed.)
Panasonic Connect’s report on one year of deploying its in-house AI assistant ‘ConnectAI,’ built on an OpenAI LLM and rolled out to about 12,400 employees, which from April 2024 references 630 confidential quality-control documents (11,743 pages) via RAG. Over one year it reduced employee working hours by 186,000 hours, with access counts reaching about 1.4 million over twelve months.
[88] Toyota Motor Corporation, Advanced R&D and Engineering Company; Amazon Web Services. Toyota Motor Corporation — secure RAG environment on AWS. Vendor official customer story, 2024. Link (Official corporate and vendor announcement. Not peer-reviewed.)
An AWS case study in which Toyota’s Advanced R&D and Engineering Company consolidated departmentally fragmented RAG systems into a company-wide secure RAG infrastructure featuring hybrid search and query expansion on Amazon OpenSearch Service, plus file-level access control via internal authentication integration. It reduced research time by about 20% and the effort for information lookups by about 50%.
[89] Panasonic Connect. Development of a new technology in which an observation-driven AI agent references a knowledge graph for generative AI RAG to answer. Official press release, October 3, 2024 (results accepted at ACL 2024). Link (Official corporate announcement. Not peer-reviewed.)
Panasonic Connect’s announcement of an observation-driven AI agent technology that uses a knowledge graph rather than text as the reference source for generative-AI RAG, iterating three stages—observation, action, and reflection—to filter information and improve answer accuracy. The technology was accepted at ACL 2024 and achieved best performance on real-time performance.
[90] Fujitsu. Automatically generating specialized generative AI that meets enterprise needs with world-first technology! Providing an enterprise generative AI framework (Fujitsu Kozuchi). Official press release, June 4, 2024. Link (Official corporate announcement. Not peer-reviewed.)
Fujitsu’s announcement of an enterprise generative-AI framework comprising ‘knowledge-graph-enhanced RAG’ that structures enterprise data into a knowledge graph to expand the data an LLM can reference to over 10 million tokens, a ‘generative-AI mixing technology’ that automatically selects optimal models, and a world-first ‘generative-AI auditing technology’ that verifies compliance of answers with rules and laws. It achieved world-best accuracy on multi-hop QA (HotpotQA).