Skip to content

Conversation

@honzajavorek
Copy link
Collaborator

@honzajavorek honzajavorek commented Jan 14, 2026

Wikipedia is unreliable as a scraping target, people are advised to use it's API to get the content of the articles. Also after a recent security issues, GitHub tightened up the npm registry. So this PR is mostly about replacing exercises which target those two websites.

Meanwhile IMDb changed its structure, so I fix that here too. I also had to change some tests as they proved to be too draconian.

Some time I spent on this was not fruitful, because I reworked the examples to use the UNESCO website, but it later proved to be very unreliable. Also I found apify/crawlee-python#1673 and spent some time debugging it. Fixes #2113, at least for now 😅

This is a proof from my local machine:

npm run test:academy

> [email protected] test:academy
> bats --print-output-on-failure -r .

./sources/academy/webscraping/scraping_basics_javascript/exercises/test.bats
 ✓ outputs the HTML with Star Wars products
 ✓ counts the number of F1 Academy teams
 ✓ counts the number of F1 Academy drivers
 ✓ lists IMO countries
 ✓ lists IMO countries with a single selector
 ✓ lists Guardian F1 article titles
 ✓ prints warehouse stock counts
 ✓ prints warehouse stock counts using regex
 ✓ prints Guardian F1 titles with publish dates
 ✓ filters products from JSON
 ✓ lists WTA player links
 ✓ lists Guardian F1 article links
 ✓ lists WTA player birthplaces
 ✓ lists Guardian F1 authors
 ✓ lists JavaScript GitHub repos with the LLM topic
 ✓ finds the shortest CNN sports article
 ✓ scrapes F1 Academy driver details with Crawlee
 ✓ scrapes Netflix ratings with Crawlee
./sources/academy/webscraping/scraping_basics_python/exercises/test.bats
 ✓ outputs the HTML with Star Wars products
 ✓ counts the number of F1 Academy teams
 ✓ counts the number of F1 Academy drivers
 ✓ lists IMO countries
 ✓ lists IMO countries with a single selector
 ✓ lists Guardian F1 article titles
 ✓ prints warehouse stock counts
 ✓ prints warehouse stock counts using regex
 ✓ prints Guardian F1 titles with publish dates
 ✓ filters products from JSON
 ✓ lists WTA player links
 ✓ lists Guardian F1 article links
 ✓ lists WTA player birthplaces
 ✓ lists Guardian F1 authors
 ✓ lists Python database jobs
 ✓ finds the shortest CNN sports article
 ✓ scrapes F1 Academy driver details with Crawlee
 ✓ scrapes Netflix ratings with Crawlee

36 tests, 0 failures

I have no idea whether all of the exercises will correctly work from the data center IPs of GitHub Actions, we'll see that once the tests run there. But at least they'll work for students trying to pass the courses.


Note

Shifts Academy scraping exercises to more reliable targets and aligns code, lessons, and tests.

  • Replace Wikipedia- and npm-based exercises with: IMO members (listing), WTA rankings (links and player birthplaces), and GitHub Topics LLM projects (JS); remove old Wikipedia/npm scripts and add new implementations for both JS and Python
  • Update lesson markdowns (JS/Python) to import new exercises, tweak instructions, examples, and hints; add tip for Cheerio .eq(); adjust headings and sample outputs
  • Modify Crawlee Netflix ratings: change IMDb search result selector to .ipc-title-link-wrapper, only process first 5 films, and export dataset; mirror changes in Python version
  • Revise tests: update expectations to new sources/outputs, relax strict counts (e.g., drivers > 6), change publish-date check to Mon, stock count expectation 77→76, add quieter uv -q, and adapt JSON filtering threshold (min_price > 50000)

Written by Cursor Bugbot for commit d1cc411. Configure here.

@honzajavorek honzajavorek added the t-academy Issues related to Web Scraping and Apify academies. label Jan 14, 2026
@honzajavorek honzajavorek changed the title Fix broken Academy exercises fix: rework broken Academy exercises Jan 14, 2026
@honzajavorek honzajavorek force-pushed the honzajavorek/fix-wikipedia branch from b56a4fc to 0e71496 Compare January 15, 2026 08:43
@apify-service-account
Copy link

Preview for this PR was built for commit b2dee1b and is ready at https://pr-2180.preview.docs.apify.com!

@apify-service-account
Copy link

Preview for this PR was built for commit 404760c and is ready at https://pr-2180.preview.docs.apify.com!

@apify-service-account
Copy link

Preview for this PR was built for commit c01c0a8 and is ready at https://pr-2180.preview.docs.apify.com!

@apify-service-account
Copy link

Preview for this PR was built for commit d1cc411 and is ready at https://pr-2180.preview.docs.apify.com!

@honzajavorek honzajavorek marked this pull request as ready for review January 15, 2026 15:08
@honzajavorek honzajavorek requested a review from TC-MO as a code owner January 15, 2026 15:08
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Comment @cursor review or bugbot run to trigger another review on this PR

@apify-service-account
Copy link

Preview for this PR was built for commit 37a06ef and is ready at https://pr-2180.preview.docs.apify.com!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-academy Issues related to Web Scraping and Apify academies.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix broken exercises

3 participants