Definition
The process of extracting structured information from unstructured or semi-structured data formats, such as PDFs, websites, or spreadsheets.
Why it matters (in Poovi’s context)
Essential for preparing raw data into a usable format for LLMs and RAG systems, especially when dealing with diverse and complex file types.
Key properties or components
- Text Extraction
- Table Extraction
- Image/Diagram Handling
- Format Conversion (e.g., PDF to Markdown)
Contradictions or debates
None.