Project Date: Fall 2025
This project is an NLP-powered platform for analyzing LinkedIn job postings using advanced natural language processing techniques including Named Entity Recognition (NER), Topic Modeling, and Word Embeddings.
Final Presentation PDF - Comprehensive project presentation covering business background, methodology, results, and evaluation.
Samar Deen
Email: [email protected]
Applied Text and Natural Language Analytics - Fall 2025 Team Members
- Boni Vasius Rosen @bvr2105-boni
- Minkyung (Ginny) Kim @ginny-1334
- Kas Kiatsukasem
- Kibaek Kim @hyper07
- Suchakrey (Philip) Nitisanon
- LinkedIn Job Scraper: Automated scraping of job postings from LinkedIn
- HuggingFace Job Data: Additional job postings from HuggingFace datasets
- Named Entity Recognition (NER): Extract skills, technologies, qualifications, and entities
- Topic Modeling: Discover themes using LDA and LSA
- Word Embeddings: Word2Vec and Sentence-BERT for semantic analysis
- Resume Matching: Match resumes to job descriptions
- Streamlit Web App: User-friendly interface for data exploration
- EDA Visualizations: Interactive charts and statistics
- NLP Analytics: Run NLP models and view results
- Real-time Analysis: Process job descriptions on-demand
nlp-fall-2025/
├── app-streamlit/ # Streamlit web application
│ ├── Home.py # Main dashboard
│ ├── utils.py # Utility functions and path helpers
│ ├── pages/ # Application pages
│ │ ├── 0_Job_Crawling.py # LinkedIn job scraper interface
│ │ ├── 1_EDA.py # Exploratory data analysis
│ │ ├── 2_Data_Cleaning.py # Data cleaning and preprocessing
│ │ ├── 3_NLP_Analytics.py # NLP analysis tools (NER, Topic Modeling)
│ │ ├── 4_Import_Embeddings.py # Import word embeddings
│ │ ├── 5_Synthetic_Resume_Generator.py # Generate synthetic resumes
│ │ ├── 6_Resume_Evaluation.py # Evaluate resume quality
│ │ ├── 7_Resume_Matching.py # Match resumes to jobs
│ │ └── 8_Meet_Our_Team.py # Team information
│ ├── functions/ # Helper functions and utilities
│ │ ├── database.py # Database operations
│ │ ├── nlp_database.py # NLP data loading utilities
│ │ ├── nlp_models.py # NLP model implementations
│ │ ├── nlp_components.py # NLP visualization components
│ │ ├── nlp_config.py # NLP configuration
│ │ ├── eda_components.py # EDA visualization components
│ │ ├── visualization.py # General visualization utilities
│ │ ├── components.py # UI components
│ │ ├── downloaderCSV.py # CSV download utilities
│ │ └── menu.py # Navigation menu
│ ├── components/ # Reusable UI components
│ │ ├── header.py # Page header component
│ │ ├── footer.py # Page footer component
│ │ └── card.py # Card component
│ ├── styles/ # CSS styling
│ │ └── app.css # Main stylesheet
│ ├── images/ # Image assets
│ ├── locales/ # Internationalization files
│ ├── requirements.txt # Python dependencies
│ ├── requirements-dev.txt # Development dependencies
│ ├── Dockerfile # Docker configuration
│ └── create_jobs_table.sql # Database schema
├── workspace/ # Analysis notebooks and data
│ ├── Data/ # Job datasets and processing
│ ├── Data_Cleaning/ # Data cleaning notebooks
│ ├── NER/ # Named Entity Recognition notebooks
│ ├── Topic Modeling/ # LDA and LSA implementations
│ ├── Word Embedding/ # Word2Vec and SBERT notebooks
│ ├── Scrapers/ # Scraper testing notebooks
│ ├── Resume_testing/ # Resume matching experiments
│ ├── models/ # Trained models and embeddings
│ │ ├── word2vec_model.joblib
│ │ ├── job_embeddings_sbert_*.npy
│ │ ├── job_embeddings_w2v_*.npy
│ │ ├── topic_model_lda_*.joblib
│ │ ├── topic_model_lsa_*.joblib
│ │ └── ner_results.json
│ └── Proposal/ # Project proposal documents
├── jupyter/ # Jupyter notebook environment
│ ├── Dockerfile # Jupyter Docker configuration
│ └── requirements.txt # Jupyter dependencies
├── linkedin-jobs-scraper/ # LinkedIn scraper library
├── scraps/ # Raw scraped data (CSV files)
├── images/ # Project images and screenshots
├── linkedin.py # LinkedIn scraper script
├── docker-compose.yml # Docker Compose configuration
├── Final_Presentation_NLP.pdf # Final project presentation
├── Final_Presentation_NLP.pptx # Final project presentation (PowerPoint)
└── README.md # This file
- Python 3.8+
- Chrome browser (for scraping)
- LinkedIn account
- Clone the repository
git clone https://github.com/yourusername/nlp-fall-2025.git
cd nlp-fall-2025-
Install spaCy models (Required for NER functionality)
** Manual installation**
python -m spacy download en_core_web_sm python -m spacy download en_core_web_md python -m spacy download en_core_web_lg # Optional: Large model for better NERNote: If using Docker, the spaCy models are automatically installed during the Docker build process.
The project includes Docker Compose configuration for easy setup:
# Start all services (PostgreSQL, Jupyter, Streamlit)
docker-compose up -dServices will be available at:
- Streamlit App: http://localhost:48501
- Jupyter Notebook: http://localhost:48888
- PostgreSQL: localhost:45432
Note: If running locally, ensure PostgreSQL is set up and configured in your .env file.
You need to obtain your li_at cookie from LinkedIn (see "Additional Notes" section below for detailed instructions).
Then run:
LI_AT_COOKIE="your_cookie_here" python linkedin.pyThe Streamlit app provides interactive interfaces for:
- Job Crawling (Page 0): Scrape LinkedIn jobs directly from the UI
- EDA (Page 1): Explore and visualize job data
- Data Cleaning (Page 2): Clean and preprocess job descriptions
- NLP Analytics (Page 3): Run NER, Topic Modeling, and other NLP analyses
- Import Embeddings (Page 4): Import and manage word embeddings
- Synthetic Resume Generator (Page 5): Generate synthetic resumes for testing
- Resume Evaluation (Page 6): Evaluate resume quality
- Resume Matching (Page 7): Match resumes to job postings
- Meet Our Team (Page 8): Team information
Each NLP technique has dedicated Jupyter notebooks in the workspace/ directory:
Named Entity Recognition:
jupyter notebook workspace/NER/NER.ipynb
# or workspace/NER/NER_kas_edit.ipynbTopic Modeling:
jupyter notebook workspace/Topic\ Modeling/Word Embeddings:
jupyter notebook workspace/Word\ Embedding/Data Cleaning:
jupyter notebook workspace/Data_Cleaning/Complete Job Analysis:
jupyter notebook workspace/Final_Job_Anaylsis.ipynbThis comprehensive notebook includes:
- Vector search with PostgreSQL
- Final score calculation for job matching
- Complete resume evaluation pipeline
- Integration of all NLP techniques
The scraper can be run via:
- Command line:
python linkedin.py(withLI_AT_COOKIEenvironment variable) - Streamlit UI: Navigate to "Job Crawling" page (Page 0)
Scrapes job listings with filters for:
- Full-time and internship positions
- Remote work options
- Mid to senior experience level
- $100K+ base salary
- Posted within the last month
Output CSV includes:
- Job Title
- Company
- Company Link
- Date
- Date Text
- Job Link
- Insights
- Description Length
- Description
Scraped data is saved to scraps/ directory with timestamped filenames: linkedin_jobs_YYYYMMDD_HHMMSS.csv
1. Named Entity Recognition (NER)
- Extract skills and technologies
- Custom entity types for job-specific information
2. Topic Modeling
- LDA (Latent Dirichlet Allocation)
- LSA (Latent Semantic Analysis)
- Discover hidden themes in job descriptions
- Visualize topic distributions
3. Word Embeddings
- Word2Vec for word-level semantics
- Sentence-BERT for document similarity
4. Resume Matching
- Extract skills from resumes (PDF parsing)
- Compute similarity scores using embeddings
- Comprehensive Final Score Calculation: Combines semantic, topic, and skill scores
- Semantic Score: Vector similarity using SBERT or Word2Vec embeddings
- Topic Score: LSA-based topic modeling similarity
- Skill Score: Jaccard similarity of extracted skills
- Final Score: Weighted combination:
avg_ts + (1 - avg_ts) * skill_scorewhereavg_ts = (topic_score + semantic_score) / 2
- Rank jobs by compatibility using final scores
- Provide detailed match explanations with breakdown of all scores
5. Synthetic Resume Generation
- Generate synthetic resumes for testing
- Customize resume content and skills
- Export in various formats
6. Resume Evaluation
- Evaluate resume quality and completeness
- Multi-dimensional Scoring: Comprehensive evaluation using multiple metrics
- Semantic similarity (embedding-based)
- Topic modeling similarity (LSA)
- Skill matching (Jaccard similarity)
- Final composite score combining all metrics
- Score resumes against job requirements with detailed breakdowns
- Provide improvement suggestions based on missing skills and score components
Create a .env file in the project root or app-streamlit/ directory:
# Database Configuration
POSTGRES_USER=admin
POSTGRES_PASSWORD=PassW0rd
POSTGRES_DB=db
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
DATABASE_URL=postgresql://admin:PassW0rd@localhost:5432/db
# LinkedIn Scraper
LI_AT_COOKIE=your_linkedin_cookie_here
# Optional: OpenAI API (for advanced features)
OPENAI_API_KEY=your_openai_api_key_hereEdit linkedin.py or use the Streamlit UI (Page 0) to customize job titles. The default list includes 70+ job titles across various industries.
In linkedin.py or the Streamlit interface, adjust filters:
- Experience level
- Work location (remote/hybrid/onsite)
- Salary range
- Time posted
- Job type (full-time/internship)
- Collection: LinkedIn scraper → CSV files in
scraps/- Can be run via command line or Streamlit UI (Page 0)
- Data Cleaning: Clean and preprocess data →
workspace/Data/- Use Streamlit Page 2 or Jupyter notebooks in
workspace/Data_Cleaning/
- Use Streamlit Page 2 or Jupyter notebooks in
- Storage: Store cleaned data in PostgreSQL database
- Database schema defined in
app-streamlit/create_jobs_table.sql
- Database schema defined in
- Analysis: NLP models process descriptions → Extract insights
- NER: Extract skills, technologies, qualifications
- Topic Modeling: Discover themes using LDA/LSA
- Word Embeddings: Generate semantic representations
- Visualization: Streamlit app displays results
- Interactive dashboards and visualizations
- Export capabilities for analysis results
- Only scrapes publicly available job postings
- Respects LinkedIn's rate limits with delays
- No personal data collection
- For educational/research purposes only
- Final Presentation PDF - Complete project presentation with methodology, results, and evaluation
- spaCy Documentation
- Gensim Topic Modeling
- Sentence-BERT
- Streamlit Documentation
- LinkedIn Jobs Scraper
- pgvector (PostgreSQL vector extension)
This project is available for educational and research purposes. If you would like to use this code, data, or any part of this project for commercial purposes or in your own projects, please contact the project maintainers to request permission.
For use requests, please include:
- Your name and affiliation
- Intended use case
- Scope of usage (commercial, academic, personal, etc.)
We're happy to discuss usage terms and are generally open to collaboration and sharing, but we'd like to know how the project is being used.
You MUST obtain the li_at cookie from your LinkedIn account for the scraper to work.
- Open Chrome (or your preferred browser) and log into your LinkedIn account
- Open Developer Tools (F12 or Right-click → Inspect)
- Go to the Application tab (Chrome) or Storage tab (Firefox)
- In the left sidebar, expand Cookies and click on
https://www.linkedin.com - Find the cookie named
li_at - Copy the Value of the
li_atcookie (it will be a long string)
Note: Keep this cookie value private and secure. Do not share it or commit it to version control. Store it in your .env file or pass it as an environment variable.
- Uses random delays (60-240 seconds) between job title searches to avoid rate limiting
- 2-second delay between individual requests
- Runs in headless mode by default
- All scraped data is timestamped:
linkedin_jobs_YYYYMMDD_HHMMSS.csv - Data saved to
scraps/directory








