CompanyProfileCrawler

Brief Description of Crawler Design

The links to each company will be retrieved from a given page. All links will be enqueued to a queue by one thread which keeps going through the page containing the links.

A thread pool retrieve the links from the queue, read the page, store the company profile in Mongo DB.

There will a "Event" object used to block the main thread. After all links are retrieved, this "Event" will be cleared, no longer blocking the main thread.

After "Event" is set AND queue is empty, the main thread exit.

Now this is finished as my first web crawler. Just figured out what to do next. And Let's keep going.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.idea		.idea
MongoDB_playground		MongoDB_playground
MultiThreading_Playground		MultiThreading_Playground
PhantomJS_Playground		PhantomJS_Playground
Selenium_Playground		Selenium_Playground
Main.py		Main.py
README.md		README.md
TargetFinder.py		TargetFinder.py
Worker.py		Worker.py
function_test.py		function_test.py
function_test_worker.py		function_test_worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CompanyProfileCrawler

Brief Description of Crawler Design

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CompanyProfileCrawler

Brief Description of Crawler Design

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages