History Lab
OCR Article Parser
A python application to download the text file versions of OCR scans of old newspapers and extract all articles related to certain keywords. Undertaking for UBC History Lab Course for terms 2017W2 and 2018W1.
Features
- Uses Python's response library to download all the OCR scans of the newspapers
- Uses levenshtein distance metric to compare words and thus detect the presence of keywords in the article.
- Implemented for UBC History Lab course taught by Dr. Heidi Tworek in January 2018.
- Supports download and extraction from the following sources :
- "ChroniclingAmerica" : chroniclingamerica.loc.gov
- "BC" : open.library.ubc.ca
- "Oregon" : oregonnews.uoregon.edu
- "NewYork" : nyshistoricnewspapers.org
- "Georgia" : gahistoricnewspapers.galileo.usg.edu
- "Newspaper" : newspaper.com