

GRNET announces, in the context of SmartAttica EDIH (European Digital Innovation Hub), the 13th Module of Τraining modules for SMEs with the subject "Retrieval-Augmented Generation (RAG) with Docling Document Parsing".
Date: June 24th, 2026
Location: Online via Zoom
Presentation Languages: Greek, English
Instructors: Nikos Bakas (GRNET), Roman Dolgopolyi (GRNET)
Duration: 3 hours
Description: This is a hands-on introduction to turning real-world documents into knowledge a language model can use. Real documents are not plain text - they contain headings, tables, figures, captions, and reading order. Participants learn how Docling converts a PDF into structured content using layout-aware vision models, how to inspect and chunk that content intelligently, and how to assemble a complete Retrieval-Augmented Generation (RAG) pipeline: chunk, embed, retrieve, and generate a grounded, citable answer.
Target Audience: This module is designed for SME developers, technical leads, and data scientists who want to build question-answering systems over their own documents. It is ideal for those looking to ground language models in trusted internal knowledge sources.
Learning Objectives:
By the end of this module, participants will be able to:
- Convert PDFs into structured documents with Docling and inspect the resulting document tree.
- Extract and work with text blocks, tables, figures, and captions.
- Split documents into retrieval-ready chunks while preserving structure and context.
- Embed chunks and retrieve the most relevant ones using semantic similarity.
- Assemble a RAG prompt and generate an answer that cites its sources.
Prerequisites:
Participants should have:
- Basic understanding of Python programming.
- Familiarity with embeddings and semantic similarity.
- Interest in NLP and document-processing applications.
- Some experience with machine learning will be helpful.
Indicative Content:
- Why Document Structure Matters. The gap between raw PDF text and structured, machine-usable content.
- Converting PDFs with Docling. Layout detection, reading order, and the structured document tree.
- Inspecting the Document. Exploring texts, groups, figures, and captions; previewing as Markdown.
- Working with Tables. Recovering rows, columns, and headers and exporting tables to Markdown.
- Chunking for RAG. Structure-aware splitting that keeps headings and context attached to each chunk.
- Embedding the Chunks. Turning chunks into vectors with Sentence Transformers.
- Retrieval. Embedding the query and finding the most relevant chunks via cosine similarity.
- Building the Prompt. Assembling labelled context and instructing the model to cite section and page.
- Generating the Answer. Passing the retrieved context to an LLM for a grounded response.
- Summary and Q&A. Key takeaways and open discussion.
The project is co-funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them.