Building an LLM fine-tuning Dataset
sentdex sentdex
1.32M subscribers
30,278 views
0

 Published On Mar 6, 2024

Going through the building of a QLoRA fine-tuning dataset for a language model.
NVIDIA GTC signup: https://nvda.ws/3XTqlB6

Fine-tuning code: https://github.com/Sentdex/LLM-Finetu...
5000-step Walls1337bot adapter: https://huggingface.co/Sentdex/Walls1...
WSB Dataset: https://huggingface.co/datasets/Sentd...
"I have every reddit comment" original reddit post and torrent info:   / i_have_every_publicly_available_reddit_com...  
2007-2015 Reddit Archive.org: https://archive.org/download/2015_red...
Reddit BigQuery 2007-2019 (this has other data besides reddit comments too!):   / 17_billion_reddit_comments_loaded_on_bigquery  

Contents:

0:00 - Introduction to Dataset building for fine-tuning.
02:53 - The Reddit dataset options (Torrent, Archive.org, BigQuery)
06:07 - Exporting BigQuery Reddit (and some other data)
14:44 - Decompressing all of the gzip archives
25:13 - Re-combining the archives for target subreddits
28:29 - How to structure the data
40:40 - Building training samples and saving to database
48:49 - Creating customized training json files
54:11 - QLoRA training and results


Neural Networks from Scratch book: https://nnfs.io
Channel membership:    / @sentdex  
Discord:   / discord  
Reddit:   / sentdex  
Support the content: https://pythonprogramming.net/support...
Twitter:   / sentdex  
Instagram:   / sentdex  
Facebook:   / pythonprogramming.net  
Twitch:   / sentdex  

show more

Share/Embed