Solving Real-World Data Science Problems with LLMs! (Historical Document Analysis)
Keith Galli Keith Galli
217K subscribers
12,296 views
0

 Published On Mar 20, 2024

In this video we walk through the process of analyzing historical documents using Python & Large Language Models. We start by setting up LLMs using both closed-source (OpenAI API) and open-source (Llama 2 via Ollama) options. Next, we walk through how we can leverage the LLMs to parse out entities from text. After this we actually start playing around with our data, loading in a specific subcategory of documents from Kaggle and see how we can connect pages from the same documents together. Once this is completed, we repeat the entity parsing process for our actual data to get pieces of information such as names, ages, and locations from our documents. Finally we analyze these entities to learn some insights from our document database.

Kaggle Dataset: https://www.kaggle.com/datasets/keith...
GitHub Repo: https://github.com/keithgalli/histori...
Project Website: https://freedmensbureau.info

Contributors:
Abdessalem Boukil (NLP Research & Analysis):   / abdessalem-boukil-37923637  
Trent Self (Kaggle Dataset Setup):   / trentonself  

If you enjoyed this project video, make sure to throw it a thumbs up & subscribe! Let me know in the comments if you have any questions. It would also be helpful for people to upvote the Kaggle dataset for visibility!

---------------------------

Video timeline!
0:00 - Video Overview & Reference Material
3:05 - Data & Code Setup
5:04 - Task #0: Configure LLM to use with Python (OpenAI API)
20:10 - Task #0 (continued): LLM Configuration with Open-Source Model (LLama 2 via Ollama)
27:39 - Task #1: Use LLM to Parse Simple Sentence Examples
41:22 - Sub-task #1: Convert string to Python Object
44:29 - Task #1 (continued): Use Open-Source LLM to Parse Sentence Examples w/ LangChain
56:24 - Quick note on a benefit of using LangChain (easily switching between models)
58:06 - Task #2 (warmup): Grab Apprenticeship Agreement rows from Dataframe
1:06:22 - Task #2: Connect Pages that Belong to the Same Documents
1:56:36 - Task #3: Parse out values from merged documents
2:12:44 - Task #4 (setup): Analyze Results
2:17:52 - Fixing up our results from task #3 quickly
2:20:41 - Task #4: Find the average age of apprentices in our merged contract documents
2:30:59 - Other analysis, wlho had the most apprentices?

-------------------------
If you are curious to learn how I make my tutorials, check out this video:    • How to Make a High Quality Tutorial V...  

Practice your Python Pandas data science skills with problems on StrataScratch!
https://stratascratch.com/?via=keith

Join the Python Army to get access to perks!
YouTube -    / @keithgalli  
Patreon -   / keithgalli  

*I use affiliate links on the products that I recommend. I may earn a purchase commission or a referral bonus from the usage of these links.

show more

Share/Embed