joy larkin's picture
4

joy larkin

joylarkin

AI & ML interests

AI Data, Datasets, Data workflows, AGI, LLMs, Fine-tuning, Evaluations, etc. ••• Head of Marketing/GTM @ Airtrain AI.

Organizations

Posts 2

view post
Post
2290
💬 Chat as a way to query SQL! The Airtrain AI team is happy to share a new Hugging Face Space that lets you interact with Hugging Face Hub datasets using a natural language chatbot. 🤗

Start Exploring 👉 airtrain-ai/hf-dataset-chat-to-sql

This Space is forked from davidberenstein1957/text-to-sql-hub-datasets by  @davidberenstein1957 and features chat capability with improved table naming. The tool works with Hugging Face’s recently released in-browser DuckDB-based SQL query engine for datasets.



view post
Post
2984
Introducing Fineweb-Edu-Fortified: An enhanced Fineweb-Edu dataset. 📚

This dataset is tailored for NLP tasks and helps streamline model training by offering a more refined, unique dataset. Perfect for startups and researchers looking for high-quality educational content to train, evaluate, or fine-tune AI models. The dataset is based on the Fineweb-Edu subset of the large Fineweb dataset and includes:

- Exact-match deduplication across all crawls
- Embeddings for each row using the TaylorAI/bge-micro model
- Count column indicating duplication frequency
- Includes data from 95 Common Crawl crawls (2013-2024)
- Rows have been reduced from 1.279B to 0.324B after deduplication
- It is comprised of ~375B tokens (down from 1,320B in Fineweb-Edu)

Access the entire Fineweb-Edu-Fortified dataset on Hugging Face → airtrain-ai/fineweb-edu-fortified

Try a semantic search demo via this Hugging Face Space → airtrain-ai/fineweb-edu-fortified-search-demo

Many thanks to the amazing @josh-sematic for his work on this project, the Fineweb/Fineweb-Edu team at Hugging Face for producing the original datasets and for their support during our work on Fineweb-Edu-Fortified, and also thanks to  @underspirit for pointing out the reduction in dataset size that could be achieved via deduplication. 🤗

models

None public yet

datasets

None public yet