Unified Document Representation

Forschungsthema/Bereich
Document Analysis, Document Understanding, Large Language Model, Artificial Intelligence, Deep Learning, Structure Analysis, Computer Vision
Typ der Abschlussarbeit
Master
Startzeitpunkt
-
Bewerbungsschluss
31.05.2026
Dauer der Arbeit
-

Beschreibung

The figure shows a document page annotated with structural elements. This example demonstrates how real-world documents typically contain both textual content (e.g., paragraphs, headers, bullet points) and visual components (e.g., tables, figures, and graphics). These elements are interrelated and must be understood in the broader context of the document to capture its meaning fully.

Unified Document Representation aims to comprehensively extract and structurally represent all of this information, bridging the gap between textual and visual modalities. It achieves this by combining formatted OCR outputs, which preserve textual layout and content, with embedded image captions or alternative text, which provide essential context.

The research will investigate whether an end-to-end approach, integrating these heterogeneous data sources into a single pipeline, can maximize document understanding. The ultimate goal is to process documents as unified, structured entities rather than handling text and visuals as separate components.

What you do:
● Survey existing work on structured OCR and document representation techniques.
● Implement state-of-the-art end-to-end methods for document information extraction and evaluate their performance.
● (Optional) Enhance the structured representation by integrating multimodal document understanding methods (combining text and image features).

What we offer:
● Getting started quickly with our open-source code
● Compute resources for model training and deployment
● Experienced guidance and open discussions with other team members
● Support publishing your work at top conferences (also attending conferences in person)

Related Work:
1. General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
2. SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
3. Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents

Further Information:
We have further topics, such as Computer Vision, large language models (LLMs), Generative Models, Retrieval-Augmented Generation (RAG), Document Analysis and understanding, etc. Please feel free to contact me (yufan.chen@kit.edu) with your CV and transcript of your records.

Voraussetzung

Voraussetzungen an Studierende
  • Interest in the topic of computer vision and doing task-oriented research
  • Python programming skills and knowledge of PyTorch/Tensorflow are desirable

Studiengangsbereiche
  • Ingenieurwissenschaften
    Elektrotechnik & Informationstechnik
    Geodäsie & Geoinformatik
    Informatik
    Mechatronik & Informationstechnik
    Sonstige Studienbereiche
    Remote Sensing and Geoinformatics
    Information System Engineering and Management


Betreuung

Titel, Vorname, Name
M.Sc., Yufan, Chen
Organisationseinheit
Computer Vision for Human-Computer Interaction Lab, Institute for Anthropomatics and Robotics (IAR)
E-Mail Adresse
yufan.chen@kit.edu
Link zur eigenen Homepage/Personenseite
Website

Bewerbung per E-Mail

Bewerbungsunterlagen
  • Lebenslauf
  • Notenauszug

E-Mail Adresse für die Bewerbung
Senden Sie die oben genannten Bewerbungsunterlagen bitte per Mail an yufan.chen@kit.edu


Zurück