Chronicling America Text Mining and Visualization Tool
This tool uses a local parquet dataset of the of the Chronicling America historical newspapers for keyword and date searches to build a corpus, collocation analysis, and geocoding (based on the city of the publisher).
The newspaper files are from AmericanStories (https://huggingface.co/datasets/dell-research-harvard/AmericanStories) Update (3/25/2025). The json files have been converted to parquet for efficient storage and searching. The local parquet files are different from the American Stories parquet files available on huggingface (https://huggingface.co/datasets/davanstrien/AmericanStories-parquet), which were based on version 1 of the AmericanStories dataset.
- Download the newspaper articles stored as parquet datasets here: https://emailsc-my.sharepoint.com/:f:/r/personal/w_kennedy_sc_edu/Documents/data_tx?csf=1&web=1&e=gHy9xJ
- If you downloaded the zip archive with all years between 1900 and 1922 ("AmericanStories_parquet1900-1922.zip"), unzip the archive.
- In Finder/File Explorer, you should see parquet files for each year, roughly 2-3 GB each (for 1900-1910s)
VS Code Instructions for first-time use VS Code needs Git installed and logged in (github account), if not done so already
- Set folder location for program files using File > Open Folder...
- Clone Repository (https://github.com/wrightkennedy/chronAm-project.git)
Cmd/Ctrl + Shift + P
type
Clone+ Enter paste https://github.com/wrightkennedy/chronAm-project.git + Enter - Open Terminal with Ctrl + Shift + ` The current directory should be set to the project folder by default
- In the Terminal, create a new virtual environment with the following command
python3 -m venv venv - Activate the virtual environment
source venv/bin/activate - Install dependencies
pip install -r requirements.txt - Run the python script
python app.py - The first start up is often slow, since the software downloads the dependencies
VS Code Instructions for first-time use
- Click Source Control tab on left
- Click "Download Git for Windows" a. https://git-scm.com/downloads/win b. Download and install c. Choose the Default editor used by Git > "Use Visual Studio Code as Git's default editor" d. Use default selections for everything else in installer
- In VS Code > Source Control (tab), click "Reload" to see Git
- Select "Initialize Repository"
- Cmd/Ctrl + Shift + P, then type "Clone" a. https://github.com/wrightkennedy/chronAm-project.git b. Select folder for clone c. (if first time using Git in VS Code, you will be asked to 'Sign in [to GitHub] with your browser" d. Would you like to open the repository or Add it to the Workspace > add it to the workspace.
- Select the Explorer tab on the left
- Top Menu > Terminal > New Terminal (or Ctrl + Shift + `) a. Select chronAm-project as the workspace; we will install the python virtual environment in this folder, and the .gitignore will make sure it is not uploaded to GitHub b. Windows (python -m venv venv) i. We noticed a new virtual environment; would you like to select it > "Yes"
- To activate the venv on Windows, use
venv\Scripts\activatea. If terminal returns a security error, do the following: i. Open PowerShell as Administrator 1) Right-click the Start button and select Windows PowerShell (Admin) or search for "PowerShell," then right-click and choose "Run as Administrator." ii. Check current execution policy: Run the following command to check the current execution policy:Get-ExecutionPolicyiii. Change execution policy: To allow scripts to run, you need to set the execution policy to RemoteSigned (which allows locally created scripts to run). Run this command:Set-ExecutionPolicy RemoteSigned -Scope CurrentUseriv. Return to VS Code and try activating the venv again:.\venv\Scripts\Activateb. To undo the change made to the execution policy: i. Open PowerShell as Administrator 1) Right-click the Start button and select Windows PowerShell (Admin) or search for "PowerShell," then right-click and choose "Run as Administrator." ii. To set the execution policy back to its default value (Restricted), run the following command:Set-ExecutionPolicy Restricted -Scope CurrentUseriii. Verify the change to the execution policy: Run the following command to verify the current execution policy:Get-ExecutionPolicy
- Use File > New Project to create a new project folder (generally a good idea to keep it close to or in the chronam-project folder
- In Finder/File Explorer, add a folder to chronam/data named parquet.
chronam/data/parquet/and move the parquet files to this folder (downloaded in "Download Parquet Files"). - [Optional] if you have already used the chronam software, use File > Open Project to open the folder location of the previous chronAm-project folder (this folder should contain your parquet folder with datasets).
- Start by selecting "A) Search Dataset"
- Enter a Search Term into the box. Search is case-insensitive
- Enter a Start Date and End Date. Mind the format
[YYYY]-[MM]-[DD]Tip: start small - either use a less common term/phrase or a small date range - Select "Run Download" if using the parquet files, the tool runs locally and is not downloading new data
- Watch as the tool identifies and extracts all articles containing the search term within the date range.
- The tool creates a JSON file in
data/processed/[search term]/[search term]_[start date]_[end date].json