fix: rework broken Academy exercises #2180

honzajavorek · 2026-01-14T16:30:54Z

Wikipedia is unreliable as a scraping target, people are advised to use it's API to get the content of the articles. Also after a recent security issues, GitHub tightened up the npm registry. So this PR is mostly about replacing exercises which target those two websites.

Meanwhile IMDb changed its structure, so I fix that here too. I also had to change some tests as they proved to be too draconian.

Some time I spent on this was not fruitful, because I reworked the examples to use the UNESCO website, but it later proved to be very unreliable. Also I found apify/crawlee-python#1673 and spent some time debugging it. Fixes #2113, at least for now 😅

This is a proof from my local machine:

npm run test:academy

> [email protected] test:academy
> bats --print-output-on-failure -r .

./sources/academy/webscraping/scraping_basics_javascript/exercises/test.bats
 ✓ outputs the HTML with Star Wars products
 ✓ counts the number of F1 Academy teams
 ✓ counts the number of F1 Academy drivers
 ✓ lists IMO countries
 ✓ lists IMO countries with a single selector
 ✓ lists Guardian F1 article titles
 ✓ prints warehouse stock counts
 ✓ prints warehouse stock counts using regex
 ✓ prints Guardian F1 titles with publish dates
 ✓ filters products from JSON
 ✓ lists WTA player links
 ✓ lists Guardian F1 article links
 ✓ lists WTA player birthplaces
 ✓ lists Guardian F1 authors
 ✓ lists JavaScript GitHub repos with the LLM topic
 ✓ finds the shortest CNN sports article
 ✓ scrapes F1 Academy driver details with Crawlee
 ✓ scrapes Netflix ratings with Crawlee
./sources/academy/webscraping/scraping_basics_python/exercises/test.bats
 ✓ outputs the HTML with Star Wars products
 ✓ counts the number of F1 Academy teams
 ✓ counts the number of F1 Academy drivers
 ✓ lists IMO countries
 ✓ lists IMO countries with a single selector
 ✓ lists Guardian F1 article titles
 ✓ prints warehouse stock counts
 ✓ prints warehouse stock counts using regex
 ✓ prints Guardian F1 titles with publish dates
 ✓ filters products from JSON
 ✓ lists WTA player links
 ✓ lists Guardian F1 article links
 ✓ lists WTA player birthplaces
 ✓ lists Guardian F1 authors
 ✓ lists Python database jobs
 ✓ finds the shortest CNN sports article
 ✓ scrapes F1 Academy driver details with Crawlee
 ✓ scrapes Netflix ratings with Crawlee

36 tests, 0 failures

I have no idea whether all of the exercises will correctly work from the data center IPs of GitHub Actions, we'll see that once the tests run there. But at least they'll work for students trying to pass the courses.

Note

Shifts Academy scraping exercises to more reliable targets and aligns code, lessons, and tests.

Replace Wikipedia- and npm-based exercises with: IMO members (listing), WTA rankings (links and player birthplaces), and GitHub Topics LLM projects (JS); remove old Wikipedia/npm scripts and add new implementations for both JS and Python
Update lesson markdowns (JS/Python) to import new exercises, tweak instructions, examples, and hints; add tip for Cheerio .eq(); adjust headings and sample outputs
Modify Crawlee Netflix ratings: change IMDb search result selector to .ipc-title-link-wrapper, only process first 5 films, and export dataset; mirror changes in Python version
Revise tests: update expectations to new sources/outputs, relax strict counts (e.g., drivers > 6), change publish-date check to Mon, stock count expectation 77→76, add quieter uv -q, and adapt JSON filtering threshold (min_price > 50000)

^{Written by Cursor Bugbot for commit d1cc411. Configure here.}

Also fix stock unit counts and make some tests more benevolent.

…aScript

apify-service-account · 2026-01-15T10:48:18Z

Preview for this PR was built for commit b2dee1b and is ready at https://pr-2180.preview.docs.apify.com!

Discovered apify/crawlee-python#1673 when working on this.

apify-service-account · 2026-01-15T14:07:00Z

Preview for this PR was built for commit 404760c and is ready at https://pr-2180.preview.docs.apify.com!

apify-service-account · 2026-01-15T14:43:59Z

Preview for this PR was built for commit c01c0a8 and is ready at https://pr-2180.preview.docs.apify.com!

apify-service-account · 2026-01-15T14:57:40Z

Preview for this PR was built for commit d1cc411 and is ready at https://pr-2180.preview.docs.apify.com!

cursor

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Comment @cursor review or bugbot run to trigger another review on this PR

sources/academy/webscraping/scraping_basics_python/exercises/crawlee_netflix_ratings.py

apify-service-account · 2026-01-15T19:42:10Z

Preview for this PR was built for commit 37a06ef and is ready at https://pr-2180.preview.docs.apify.com!

honzajavorek added the t-academy Issues related to Web Scraping and Apify academies. label Jan 14, 2026

honzajavorek changed the title ~~Fix broken Academy exercises~~ fix: rework broken Academy exercises Jan 14, 2026

honzajavorek added 9 commits January 15, 2026 09:41

feat: use IMO website instead of Wikipedia

5e531ee

Also fix stock unit counts and make some tests more benevolent.

feat: improve exercise guidance, shorten the JS code example

804fa84

fix: price should be in cents

f5aee7b

feat: migrate follow-up exercise from Wikipedia to IMO

08b9e3f

chore: remove redundant exercise code

5469966

feat: move links exercise from Wikipedia to UNESCO

b1d441c

feat: move links exercise from Wikipedia to UNESCO, Python course

3296ec9

feat: move away from Wikipedia to UNESCO for a crawling exercise

3bafd42

feat: move away from Wikipedia to UNESCO for a crawling exercise, Jav…

0e71496

…aScript

honzajavorek force-pushed the honzajavorek/fix-wikipedia branch from b56a4fc to 0e71496 Compare January 15, 2026 08:43

honzajavorek added 6 commits January 15, 2026 09:48

fix: edit exercise text to decribe the new, simplified approach

79cceba

fix: use PascalCase so that we do UnescoWhsCount, not UNESCOWHSCount

33a7195

fix: bad extension in an import

a4618ae

style: make linter happy

c896843

fix: limit UNESCO scraping to 10 countries not to DoS them

1c1767d

fix: replace UNESCO with WTA, because UNESCO is super unreliable

b2dee1b

fix: modify Netflix/IMDb exercise so that the tests pass

404760c

Discovered apify/crawlee-python#1673 when working on this.

fix: don't scrape the npm registry, as it became highly protected

c01c0a8

honzajavorek mentioned this pull request Jan 15, 2026

Replace the 'finds the shortest CNN sports article' exercise #2183

Open

style: make linter happy

d1cc411

honzajavorek marked this pull request as ready for review January 15, 2026 15:08

honzajavorek requested a review from TC-MO as a code owner January 15, 2026 15:08

cursor bot reviewed Jan 15, 2026

View reviewed changes

sources/academy/webscraping/scraping_basics_python/exercises/crawlee_netflix_ratings.py Outdated Show resolved Hide resolved

fix: remove leftover line

37a06ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: rework broken Academy exercises #2180

fix: rework broken Academy exercises #2180

Uh oh!

honzajavorek commented Jan 14, 2026 •

edited by cursor bot

Loading

Uh oh!

apify-service-account commented Jan 15, 2026

Uh oh!

apify-service-account commented Jan 15, 2026

Uh oh!

apify-service-account commented Jan 15, 2026

Uh oh!

apify-service-account commented Jan 15, 2026

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

apify-service-account commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: rework broken Academy exercises #2180

Are you sure you want to change the base?

fix: rework broken Academy exercises #2180

Uh oh!

Conversation

honzajavorek commented Jan 14, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apify-service-account commented Jan 15, 2026

Uh oh!

apify-service-account commented Jan 15, 2026

Uh oh!

apify-service-account commented Jan 15, 2026

Uh oh!

apify-service-account commented Jan 15, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Uh oh!

Uh oh!

apify-service-account commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

honzajavorek commented Jan 14, 2026 •

edited by cursor bot

Loading