35 4 75

Pierre-Carl Langlais

Pclanglais

Dorialexander

AI & ML interests

Open data & open LLMs

Articles

The case for specialized pre-training: ultra-fast foundation models for dedicated tasks

Aug 4

• 24

Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing

Jul 19

• 17

Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM

Apr 26

• 13

Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

Apr 18

• 21

Releasing Common Corpus: the largest public domain dataset for training LLMs

Mar 20

• 13

Organizations

Pclanglais's activity

posted an update about 2 months ago

Post

2174

We release today our first foundation model and experiment with a new category: specialized pre-training.

OCRonos-Vintage is a 124m parameters model trained end-to-end by Pleias on llm.c from 18 billion tokens from cultural heritage archives. Despite its small size it achieve nearly state of the art results for OCR correction of historical English sources. OCRonos-Vintage is also an historical model with an unusual cut-off date: December 29th, 1955…

We look forward to replicate this approach very soon on other "hard" tasks commonly associated with generalist LLMs/SLMs: RAG, function calling, summarization, document segmentation…

OCRonos-Vintage: PleIAs/OCRonos-Vintage
CPU Demo: PleIAs/OCRonos-Vintage-CPU
GPU Demo: PleIAs/OCRonos-Vintage-GPU
Our annoncement and call for specialized pre-training: https://huggingface.co/blog/Pclanglais/specialized-pre-training

posted an update 2 months ago

Post

793

Since it is release season, at PleIAs we announce our first suite of specialized language models for document processing tasks (OCR correction, text segmentation, bibliographic extraction) and the release of the largest multimodal dataset of financial document Finance Commons: https://huggingface.co/blog/Pclanglais/finance-commons-bad-data-toolbox

LLM research is currently focused on quality data. We went on the opposite direction and voluntarily trained models on bad data. Far from degrading models, it made them more resilient to text sources commonly used in production.

Having a wider range of real life data proved critical for this project. A few months after the release of Common Corpus, we expanded our pool of "training data commons" with a major multimodal ressource: document released as open financial data. Finance commons comprises 17 billion tokens and 1.25 PDF corporate documents released by the SEC, WTO, AMF, EU Tenders In a multiple languages with a large variety of document layouts and challenging sources to train more robust models.

With HuggingFace compute support, we release an entire pipeline to process bad data sources and make them usable in production for LLMOps or simply retrieval: PleIAs/PleIAs-Editor

This approach is based on our new series of specialized models for document processing, the "bad data toolbox" comprising:
*OCRonos, the best available model to date for OCR correction. PleIAs/OCRonos
*Segmentext, a pure semantic small model for text segmentation, working without any visual reference. PleIAs/Segmentext
*Bibtexer, a small model for bibliographic data extraction acting as a "reversed-Zotero." PleIAs/BibTexer

posted an update 5 months ago

Post

2300

Announcing that we are on our way to solve a long standing issue of document processing: correction of OCR mistakes. Pleias publishes the largest dataset to date with automated OCR correction, 1 billion words in English, French, German and Italian.

OCR quality is long-standing issue of digitization. Cultural heritage texts are especially concerned due to the primary sources being old documents (with many artifacts, blots, degradation) and to the limitation of OCR technology for historical scripts. When we released Common Corpus, a 500 Billion words corpus in the public domain, this was the primary criticism.

Recent breakthrough in post-OCR correction has been made possible thanks to progress in open LLM research and several months of dedicated training and alignment by Pleias as well as the HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay.

Announcement: https://huggingface.co/blog/Pclanglais/post-ocr-correction

Post-OCR-Correction dataset: PleIAs/Post-OCR-Correction

posted an update 6 months ago

Post

2454

Announcing today the release of Common Corpus, the largest collection of fully open corpus on HuggingFace: nearly 500b words (600-700b tokens) in public domain.

PleIAs/common-corpus-65d46e3ea3980fdcd66a5613

Common corpus is an international initiative coordinated by @pleias_fr with the support of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM and the involvement of the open science LLM community (Occiglot, Eleuther AI) and cultural heritage researchers.

We aim to create the same kind of ecosystem there is now for fine tuning at the pretraining stage, by creating a strong commons without copyright issues or "trade secret" gatekeeping. Contrary to what many AI companies say, Common Corpus shows it is possible to train Large Language Models on fully open corpus. Due to the complexity of copyright check, we have only released a partial amount of the text we hold and will release way more in the months.

Common Corpus is multilingual. It also includes to date the largest open collections in French (110 billion words), German (30 billion words), Spanish (23 billion words), Dutch (18 billion words), Italian (10 billion words) as well as a very long tail of middle to low resource languages.

Our conviction is that open corpora make future models more inclusive, democratic, and respectful of cultural diversity, as well as more qualitative. Common Corpus holds many long texts in book form, editorialized, with reasoning rich content that have never been used to date for LLM pretraining.

Common Corpus is an ongoing work and still need to get enhanced and completed. Sharing is caring: Common Corpus still needs more care to become "a common" like Wikipedia or Wikisource.

https://huggingface.co/blog/Pclanglais/common-corpus

posted an update 7 months ago

Post

Today I'm releasing marginalia, a python library to perform corpus analysis and retrieve structured annotations with open LLMs like Mistral Open-Hermes-2.5: https://github.com/Pleias/marginalia

marginalia leverages vllm inference speed to re-generate until all the output matches an expected json structure and to send batches of several unstructured elements for enhanced patterns detections. It works especially well for bibliographies. The demo transforms a very old list (Benjamin Franklin favorite's books from 1744) into well-structured data: https://colab.research.google.com/drive/1xKjK2mDDpXMaKG5YLpFhOM7jehxt0kEt?usp=sharing

While marginalia can be quite flexible, it definitely isn't a general purpose tool for json generation (like outlines). I don't intend so far to extend support to more complex json structure, but really looking forward potential feedbacks and suggestions.

replied to JustinLin610's post 8 months ago

Congratulations! With all the US/EU big players being more secretive than ever, you're not just bringing good models, but really making an incredible contribution to open research.

And I slightly disagree on one point: Qwen-500m is SOTA. Never thought it could be possible to pour results like this from such a small multilingual model for RAG tasks in French.

posted an update 8 months ago

Post

Hi everyone,
For my first post, I'm announcing a big release (in multiple ways): probably the largest open corpus in French to date, with 85 billion words in the public domain.
The dataset has been prepared in collaboration with Benoît de Courson and Benjamin Azoulay from Gallicagram (https://shiny.ens-paris-saclay.fr/app/gallicagram). Gallicagram is a major cultural analytics project in French, the open and better version of ngram viewer for large scale search of word and ngram occurrences.
The corpus is made of two different dataset for monographs (16B words) PleIAs/French-PD-Newspapers and newspapers/periodicals (69B) PleIAs/French-PD-Newspapers Along with the full text it also includes core provenance metadata.
Beyond research in digital humanities, the corpus can also be used to training open and reproducible LLMs. Being in the public domain means it can be released everywhere in any shape without restrictions.
The corpus is not perfect: digitization of cultural heritage is challenging and, especially for newspapers, we tackle with layout issues and a significant rate of optical character recognition mistake. Our conviction is that releasing corpus as a commons is the best way to improve on this. Sharing is caring.

1 reply