Skip to content

feat: implement pydantic + automatic OpenAPI spec generation#1357

Open
twangodev wants to merge 45 commits intomainfrom
feat/pydantic
Open

feat: implement pydantic + automatic OpenAPI spec generation#1357
twangodev wants to merge 45 commits intomainfrom
feat/pydantic

Conversation

@twangodev
Copy link
Copy Markdown
Owner

@twangodev twangodev commented Dec 30, 2025

Summary by CodeRabbit

  • New Features

    • Public OpenAPI spec, generated API types, manifest file, and direct upload-to-S3 option; unified API client for the frontend.
  • Refactor

    • Data models consolidated into validated schemas; search switched to API-first data loading; frontend types aligned with generated schemas.
  • Bug Fixes

    • Improved null handling and parsing resilience, expanded retries and pagination, and cleanup of stale remote objects after uploads.
  • Chores

    • Simplified CI/deploy, removed legacy submodule, updated dependencies and docs.

✏️ Tip: You can customize this high-level summary in your review settings.

Copilot AI review requested due to automatic review settings December 30, 2025 05:57
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 30, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Move internal models to Pydantic schemas, replace JsonSerializable with BaseModel.model_dump, add an endpoint registry and OpenAPI generation, introduce S3 upload utilities and an optional upload step, emit a manifest, refactor search to an API-first DataLoader, and migrate frontend to a generated API client and OpenAPI-derived types.

Changes

Cohort / File(s) Summary
Schemas package & models
generation/schemas/__init__.py, generation/schemas/course.py, generation/schemas/enrollment.py, generation/schemas/grades.py, generation/schemas/instructor.py, generation/schemas/requirement_ast.py, generation/schemas/openapi.py
New Pydantic schema modules and central re-exports exposing Course/CourseReference/CoursePrerequisites, GradeData, School/Meeting/MeetingLocation/Enrollment/TermData, AST Leaf/Node/RequirementAbstractSyntaxTree; OpenAPI component extraction and spec assembly.
Generation modules → schema-backed
generation/course.py, generation/enrollment_data.py, generation/instructors.py, generation/requirement_ast.py, generation/enrollment.py, generation/madgrades.py, generation/cache.py
Replace legacy models with schema subclasses, add factory methods (from_json/from_block/from_madgrades_async), switch serialization to model_dump, adjust call sites (keyword args, TermData init), and update prerequisites/term/grade parsing.
Endpoint registry & declarations
generation/schemas/endpoint.py, generation/schemas/endpoints.py
Add Endpoint, EndpointParam, EndpointRegistry and endpoint() factory; declare centralized endpoints (paths, params, response models) used by OpenAPI generator.
OpenAPI generator & docs
generation/schemas/openapi.py, docs/public/openapi.json, docs/public/static-openapi.yaml, docs/usage/static-api.md
New generator to emit OpenAPI 3.1 spec from registered endpoints and Pydantic schemas; adds generated openapi.json, removes old static-openapi.yaml, and updates docs reference.
I/O, save & manifest
generation/save.py, generation/json_serializable.py
Remove JsonSerializable, support BaseModel.model_dump in recursive_sort_data and write_file, change course output to index files, and write a top-level manifest.
S3 upload & CLI/CI deploy
generation/upload.py, generation/main.py, .github/workflows/generation.yml
New S3 utilities (get_s3_client, upload_directory, delete_stale_objects, upload_file), add "upload" CLI step and OpenAPI export in main.py, CI workflow simplified to direct S3 upload and uses S3 env vars.
Search refactor → DataLoader
search/data.py, search/data_loader.py, search/app.py, search/pyproject.toml
Remove local JSON helpers, add DataLoader (API-first with local fallback, manifest-aware), update search app to use DataLoader, add requests dependency.
Frontend: API client & generated types
package.json, src/lib/api.ts, src/lib/types/api.d.ts, src/lib/types/*.ts, src/routes/**, src/lib/components/**
Add openapi-fetch client and createApiClient, generate TypeScript API types, replace ad‑hoc fetch calls with api / createApiClient, update routes/components to use OpenAPI-derived models (Course, FullInstructor, CourseReference, etc.).
UI prop/type adjustments
src/lib/components/**, src/lib/seo/**, src/lib/types/**, src/routes/**
Components and SEO helpers updated to accept null for numeric props, use regenerated FullInstructor/Course types, rename functions (getFullInstructor) and update signatures to API types.
Requirement AST & parser relocation
generation/requirement_ast.pygeneration/schemas/requirement_ast.py
Replace older AST with Pydantic AST models, add RequirementParser, model_dump serialization, tree printing, and course_combinations utility.
Config / deps / repo layout
generation/pyproject.toml, search/pyproject.toml, .gitmodules, .gitignore, docker-compose.yml, .env.example, data
Add dependencies (pydantic, boto3, tenacity, requests, openapi tooling), remove git submodule and its reference, add data/ to .gitignore, remove search volume mount, add S3 env vars and update PUBLIC_API_URL.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor CLI as "generation CLI"
  participant Gen as "Generator (main.py)"
  participant Registry as "Endpoint Registry"
  participant SaveMod as "generation.save"
  participant OpenAPI as "schemas.openapi"
  participant UploadMod as "generation.upload"
  participant S3 as "S3-compatible storage"

  CLI->>Gen: run build (generate data)
  Gen->>Registry: import endpoints (register models & paths)
  Gen->>SaveMod: write files (serialize BaseModel.model_dump)
  Gen->>OpenAPI: export_openapi_spec(output_path)
  OpenAPI-->>Gen: wrote openapi.json

  alt upload step requested and creds present
    CLI->>Gen: run upload step
    Gen->>UploadMod: upload_directory(data_dir, endpoint, creds)
    UploadMod->>S3: concurrent PUTs (files)
    S3-->>UploadMod: per-file responses
    UploadMod-->>Gen: return (successes, failures)
    Gen-->>CLI: log upload summary
  else upload skipped or creds missing
    Gen-->>CLI: skip upload
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐇 I hopped from JSON shells to Pydantic green,
I parsed prereqs and courses, tidy and keen.
I sketched an OpenAPI and packed schemas tight,
Then flapped my boto3 wings and sent bytes into night.
A rabbit’s tiny build — neat, swift, and bright.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.79% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: implementing Pydantic models for data schemas and automatic OpenAPI specification generation, which are the dominant themes throughout the changeset.
✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e330c20 and 970789d.

⛔ Files ignored due to path filters (1)
  • generation/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • generation/enrollment.py
  • generation/http_utils.py
  • generation/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (2)
  • generation/pyproject.toml
  • generation/http_utils.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/enrollment.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-web
  • GitHub Check: build-web
🔇 Additional comments (10)
generation/enrollment.py (10)

1-1: LGTM: Imports align with new retry and concurrency features.

The additions of Semaphore for concurrency control, tenacity for retry logic, and updated enrollment_data imports for Pydantic models are appropriate for the enhanced data fetching pipeline.

Also applies to: 9-9, 14-14


23-27: LGTM: Constants are well-documented.

The MAX_PAGE_SIZE and MAX_CONCURRENT_REQUESTS constants are appropriately defined with clear documentation explaining their purpose.


92-97: LGTM: Graceful error handling for initial fetch.

The try/except block properly handles failures when fetching the course count, with appropriate logging and graceful degradation by returning an empty dictionary.


136-137: LGTM: Proper concurrency control.

The Semaphore is correctly initialized with MAX_CONCURRENT_REQUESTS to limit concurrent API requests and prevent rate limiting.


141-152: LGTM: Task creation properly passes semaphore.

The task list comprehension correctly passes the semaphore to each process_hit call and properly iterates over all_hits.


325-325: LGTM: Proper parameter addition.

The semaphore: Semaphore parameter is correctly added to the function signature with proper type annotation.


336-336: LGTM: Keyword arguments improve clarity.

Using explicit keyword arguments (subjects=, course_number=) for Course.Reference construction enhances code readability and maintainability.


436-441: LGTM: Proper MeetingLocation construction.

The call to MeetingLocation.get_or_create_with_capacity correctly passes building name, room, coordinates, and class capacity to construct meeting locations.


445-454: LGTM: Meeting construction includes all required fields.

The Meeting constructor properly passes all relevant fields including timing, type, location, enrollment data, instructors, and course reference.


463-463: LGTM: TermData initialization updated.

The initialization of TermData() without arguments aligns with the new Pydantic-based model structure.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (14)
generation/schemas/grades.py (1)

94-116: Consider defensive error handling in from_madgrades.

Direct dictionary key access will raise KeyError if the API response is missing expected fields. For better debuggability, consider wrapping in a try-except or using .get() with validation.

This is optional since the Madgrades API format is likely stable.

generation/schemas/openapi.py (1)

21-42: DRY violation: model list is duplicated.

The models list appears twice - here in get_all_schemas() and again in generate_openapi_spec() (lines 57-69). Consider extracting to a module-level constant.

🔎 Proposed fix
+# Core models for schema generation
+_SCHEMA_MODELS = [
+    GradeData,
+    CourseReference,
+    CoursePrerequisites,
+    Course,
+    School,
+    MeetingLocation,
+    Meeting,
+    EnrollmentData,
+    TermData,
+    RMPData,
+    FullInstructor,
+]
+

 def get_all_schemas() -> dict[str, Any]:
     """Get JSON schemas for all models."""
-    models = [
-        GradeData,
-        CourseReference,
-        CoursePrerequisites,
-        Course,
-        School,
-        MeetingLocation,
-        Meeting,
-        EnrollmentData,
-        TermData,
-        RMPData,
-        FullInstructor,
-    ]
-
     schemas = {}
-    for model in models:
+    for model in _SCHEMA_MODELS:
         schema = model.model_json_schema(ref_template="#/components/schemas/{model}")
         schemas[model.__name__] = schema

     return schemas
generation/schemas/__init__.py (1)

29-52: Minor: TermData is listed under wrong section comment.

TermData is imported from schemas.enrollment (line 14) but listed under the # Course comment section in __all__. Consider moving it to the # Enrollment section for consistency.

🔎 Proposed fix
 __all__ = [
     # Grades
     "GradeData",
     # Course
     "CourseReference",
     "CoursePrerequisites",
-    "TermData",
     "Course",
     # Enrollment
     "School",
     "MeetingLocation",
     "Meeting",
     "EnrollmentData",
+    "TermData",
     # Instructor
generation/schemas/enrollment.py (1)

63-64: Mutable default [] is safe in Pydantic but worth noting.

Pydantic v2 handles mutable defaults by creating a copy for each instance, so instructors: list[str] = [] is safe. However, for explicitness you could use Field(default_factory=list).

This is a minor style preference - the current implementation works correctly.

generation/instructors.py (2)

101-139: Duplicate from_rmp_data implementation.

The from_rmp_data method (lines 111-135) duplicates the implementation in schemas/instructor.py (lines 52-74 from snippets). Since RMPData extends _RMPDataBase, it inherits this method.

If the implementations are identical, consider removing this override. If there are subtle differences, document why.

#!/bin/bash
# Compare from_rmp_data implementations
echo "=== schemas/instructor.py ===" 
rg -A 25 "def from_rmp_data" generation/schemas/instructor.py

echo "=== instructors.py ==="
rg -A 25 "def from_rmp_data" generation/instructors.py

142-145: Potentially redundant model_config.

The base class _FullInstructorBase (from schemas/instructor.py line 83) already defines model_config = ConfigDict(arbitrary_types_allowed=True). This redefinition on line 145 is redundant unless you intend to override or extend it.

generation/requirement_ast.py (1)

324-339: Consider using keyword arguments for Pydantic model construction.

Positional arguments work here because Node has fields operator then children in order, but using keyword arguments would be more explicit and resilient to field reordering:

Node(operator="OR", children=children)

This is a minor readability suggestion.

generation/course.py (3)

45-81: Code duplication: from_string is duplicated between base and extended class.

The from_string method here (lines 51-69) duplicates the same implementation in generation/schemas/course.py (lines 47-65 in relevant snippets). Consider removing this override since the base class already provides the same functionality, or call super().from_string() if you need additional processing.

🔎 Suggested simplification

If no additional behavior is needed, remove the override entirely:

 class CourseReference(_CourseReferenceBase):
     """Extended CourseReference for backward compatibility."""

     # Allow use as dict key
     model_config = ConfigDict(frozen=True)

-    @classmethod
-    def from_string(cls, course_reference_str: str) -> CourseReference:
-        """Parse a course reference from a string like 'CS/ECE 252'."""
-        course_reference_str = cleanup_course_reference_str(course_reference_str)
-        if not course_reference_str:
-            raise ValueError("Empty course reference string")
-
-        match = re.match(r"(\D+)(\d+)", course_reference_str)
-        if not match:
-            raise ValueError(f"Invalid course reference format: {course_reference_str}")
-
-        course_subject_str = match.group(1).replace(" ", "").strip()
-        raw_course_subjects = course_subject_str.split("/")
-        course_subjects = frozenset(
-            subject.replace(" ", "") for subject in raw_course_subjects
-        )
-        course_number = int(match.group(2).strip())
-
-        return cls(subjects=course_subjects, course_number=course_number)
-
     @classmethod
     def from_json(cls, json_data: dict) -> CourseReference:

115-125: Consider parsing abstract_syntax_tree into the proper type.

The abstract_syntax_tree is passed as a raw dict from JSON (line 124), while the CoursePrerequisites schema types it as Any | None (RequirementAbstractSyntaxTree). For consistency with the Pydantic migration, consider parsing it:

from requirement_ast import RequirementAbstractSyntaxTree
# ...
abstract_syntax_tree=RequirementAbstractSyntaxTree.from_json(prereqs_data.get("abstract_syntax_tree"))
    if prereqs_data.get("abstract_syntax_tree") else None,

This would ensure full type safety and enable proper serialization behavior.


218-225: Potential unhandled exception from CourseReference.from_string.

Line 220 calls CourseReference.from_string(title) which can raise ValueError for invalid course reference formats. Unlike the parser.parse() call (which has exception handling), this could cause the entire course parsing to fail on a malformed link.

Consider wrapping this in a try-except:

🔎 Proposed fix
         for node in requisites_data.contents:
             if isinstance(node, NavigableString):
                 linked_requisite_text.append(node.get_text())
             elif node.name == "a":
                 title = node.get("title", "").strip()
-                reference = CourseReference.from_string(title)
-
-                requisites_courses.add(reference)
-                linked_requisite_text.append(reference)
+                try:
+                    reference = CourseReference.from_string(title)
+                    requisites_courses.add(reference)
+                    linked_requisite_text.append(reference)
+                except ValueError:
+                    # Malformed course reference, treat as plain text
+                    linked_requisite_text.append(title)
             else:
                 linked_requisite_text.append(node.get_text(strip=True))
generation/schemas/instructor.py (1)

53-77: Consider adding defensive handling for missing API fields.

The from_rmp_data factory directly accesses nested keys (e.g., rmp_data["ratings"]["edges"], rmp_data["id"]) which will raise KeyError if the RMP API response is malformed or incomplete. Consider using .get() with defaults for optional fields to improve robustness.

generation/schemas/course.py (1)

173-178: Potential ValueError if term keys are non-numeric.

get_latest_term_data assumes all term keys can be converted to integers (line 177). If the data contains a non-numeric term key, int(x) will raise ValueError.

Consider adding validation or using a safer approach:

🔎 Proposed defensive approach
     def get_latest_term_data(self) -> TermData | None:
         """Get the latest term data for the course."""
         if not self.term_data:
             return None
-        latest_term = max(self.term_data.keys(), key=lambda x: int(x))
+        # Filter to numeric term keys only
+        numeric_terms = [k for k in self.term_data.keys() if k.isdigit()]
+        if not numeric_terms:
+            return None
+        latest_term = max(numeric_terms, key=int)
         return self.term_data[latest_term]
generation/enrollment_data.py (2)

35-64: Custom __init__ override may conflict with Pydantic internals.

Overriding __init__ in a Pydantic model (lines 47-55) can lead to subtle issues with validation and serialization. The private _instructors_set (line 45) won't be serialized or properly handled by Pydantic.

Consider using a model_validator with mode="after" instead for post-initialization logic:

🔎 Suggested pattern using model_validator
from pydantic import model_validator

class GradeData(_GradeDataBase):
    """Extended GradeData with set-based instructors for internal processing."""

    model_config = ConfigDict(populate_by_name=True, frozen=False)

    _instructors_set: set[str] | None = None

    @model_validator(mode="after")
    def _convert_instructors_to_set(self) -> "GradeData":
        """Cache instructors as a set for efficient operations."""
        if self.instructors is not None:
            object.__setattr__(self, "_instructors_set", set(self.instructors))
        return self

184-239: Class-level cache _all_locations may cause issues in testing and long-running processes.

The _all_locations dict (line 190) is a class variable that:

  1. Persists indefinitely, potentially causing memory growth in long-running processes
  2. Is not thread-safe for concurrent access
  3. Can cause test pollution across test cases

Consider adding a clear_cache() class method for testing and monitoring the cache size in production:

🔎 Suggested addition
@classmethod
def clear_cache(cls) -> None:
    """Clear the location cache. Useful for testing."""
    cls._all_locations.clear()

@classmethod
def cache_size(cls) -> int:
    """Return the current cache size for monitoring."""
    return len(cls._all_locations)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b0fd80 and 1f29a06.

⛔ Files ignored due to path filters (1)
  • generation/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (14)
  • generation/course.py
  • generation/enrollment_data.py
  • generation/instructors.py
  • generation/json_serializable.py
  • generation/pyproject.toml
  • generation/requirement_ast.py
  • generation/save.py
  • generation/schemas/__init__.py
  • generation/schemas/course.py
  • generation/schemas/enrollment.py
  • generation/schemas/grades.py
  • generation/schemas/instructor.py
  • generation/schemas/openapi.py
  • generation/schemas/requirement_ast.py
💤 Files with no reviewable changes (1)
  • generation/json_serializable.py
🧰 Additional context used
📓 Path-based instructions (2)
{generation,search}/**/{pyproject.toml,uv.lock}

📄 CodeRabbit inference engine (CLAUDE.md)

Manage Python service dependencies with uv (not pip/pipenv/conda)

Files:

  • generation/pyproject.toml
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/schemas/grades.py
  • generation/schemas/__init__.py
  • generation/instructors.py
  • generation/schemas/enrollment.py
  • generation/save.py
  • generation/course.py
  • generation/schemas/openapi.py
  • generation/schemas/instructor.py
  • generation/schemas/course.py
  • generation/schemas/requirement_ast.py
  • generation/requirement_ast.py
  • generation/enrollment_data.py
🧬 Code graph analysis (7)
generation/schemas/grades.py (1)
generation/enrollment_data.py (4)
  • GradeData (35-181)
  • empty (67-88)
  • merge_with (90-129)
  • from_madgrades (132-153)
generation/instructors.py (1)
generation/schemas/instructor.py (3)
  • RMPData (40-77)
  • FullInstructor (80-108)
  • from_rmp_data (54-77)
generation/save.py (1)
generation/schemas/requirement_ast.py (2)
  • model_dump (28-30)
  • model_dump (73-75)
generation/course.py (2)
generation/schemas/course.py (5)
  • CourseReference (26-87)
  • CoursePrerequisites (90-127)
  • Course (130-193)
  • cleanup_course_reference_str (18-23)
  • from_string (48-65)
generation/requirement_ast.py (3)
  • to_dict (265-267)
  • to_dict (287-289)
  • to_dict (305-307)
generation/schemas/instructor.py (1)
generation/instructors.py (4)
  • RMPData (101-139)
  • from_rmp_data (112-135)
  • to_dict (137-139)
  • to_dict (179-181)
generation/schemas/requirement_ast.py (1)
generation/requirement_ast.py (6)
  • Leaf (250-267)
  • to_dict (265-267)
  • to_dict (287-289)
  • to_dict (305-307)
  • Node (270-289)
  • RequirementAbstractSyntaxTree (292-307)
generation/requirement_ast.py (1)
generation/schemas/requirement_ast.py (5)
  • Leaf (10-33)
  • Node (36-58)
  • RequirementAbstractSyntaxTree (61-116)
  • model_dump (28-30)
  • model_dump (73-75)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-web
  • GitHub Check: build-web
🔇 Additional comments (26)
generation/pyproject.toml (1)

22-22: LGTM!

The Pydantic dependency addition is appropriate for the migration. The version constraint >=2.12.5 ensures access to Pydantic v2 features used throughout the PR (ConfigDict, model_dump, field_serializer, etc.).

generation/save.py (3)

10-11: LGTM!

Clean migration from JsonSerializable to Pydantic BaseModel. The import is correctly placed.


384-386: LGTM!

The BaseModel handling correctly delegates to model_dump() before recursively sorting. This works well even with custom model_dump() implementations (like in requirement_ast.py) that may return strings instead of dicts.


417-430: LGTM!

The type validation and conversion logic correctly handles Pydantic models:

  • Type check on line 417 accepts BaseModel
  • Conversion on lines 429-430 properly calls model_dump()
  • Error message is updated to reflect the new supported type
generation/schemas/grades.py (2)

60-72: LGTM!

The merge_with method correctly handles all numeric field summation and instructor set union. The use of set() for deduplication is appropriate even though it doesn't preserve original ordering.


8-15: The GradeData classes serve distinct purposes and are not duplicates. generation/enrollment_data.py explicitly inherits from schemas.grades.GradeData (imported as _GradeDataBase) to create an extended version for internal processing. The base schema in schemas/grades.py is the public API model, while the extended version in enrollment_data.py adds set-based instructor handling via _instructors_set and instructors_as_set for optimization. This is intentional architecture, not duplication.

Likely an incorrect or invalid review comment.

generation/schemas/openapi.py (1)

493-519: LGTM!

The CLI implementation is clean with well-documented arguments. The --schemas-only flag provides useful flexibility for different use cases.

generation/schemas/enrollment.py (2)

29-49: Redundant __hash__ and __eq__ with frozen=True.

When frozen=True is set in ConfigDict, Pydantic automatically generates __hash__ based on all fields. The explicit implementations here use only a subset of fields (building, room, coordinates), excluding capacity.

If this exclusion is intentional (e.g., two locations are "equal" regardless of capacity), the explicit implementation is correct. Otherwise, consider removing the manual implementations.


119-137: LGTM!

The from_enrollment factory method cleanly maps API response fields to the model. The late import of safe_parse avoids potential circular imports. The empty instructors={} suggests it's populated in a subsequent processing step.

generation/instructors.py (1)

147-177: LGTM!

The from_json factory method cleanly handles:

  • Optional fields via .get()
  • Nested object deserialization (GradeData, Course.Reference, RMPData)
  • Proper null handling for optional nested structures
generation/requirement_ast.py (2)

310-363: LGTM! Clean recursive descent parser.

The RequirementParser implementation is well-structured with clear separation of precedence levels (parse_or > parse_and > parse_primary). The tokenizer integration and AST construction are clean.


250-307: LGTM!

The extended classes (Leaf, Node, RequirementAbstractSyntaxTree) properly implement:

  • from_json factory methods for deserialization with appropriate type handling
  • to_dict methods delegating to model_dump() for consistent serialization

The inheritance pattern (extending schema base classes with factory methods) is clean and maintains separation between schema definitions and processing logic.

generation/schemas/requirement_ast.py (4)

10-34: LGTM on Leaf model structure.

The Leaf model correctly handles polymorphic payload serialization. The model_dump override returning the payload directly is appropriate for this flattened serialization approach.


36-58: LGTM on Node model.

The recursive serialization of children is correctly implemented, and the structural difference from Leaf (returning a dict with operator/children vs. just the payload value) is intentional and appropriate for the AST representation.


80-116: LGTM on course_combinations logic.

The recursive implementation correctly handles AND (cartesian product) and OR (union) operators. The deduplication at the end appropriately filters empty combinations and uses a sorted tuple key for uniqueness.


119-148: LGTM on tree representation and forward reference handling.

The _tree_repr helper provides useful debugging output, and the model_rebuild() calls are correctly placed to resolve forward references for Node and RequirementAbstractSyntaxTree.

generation/course.py (1)

256-258: LGTM on serialization method.

The to_dict method correctly delegates to model_dump() for Pydantic-based serialization.

generation/schemas/instructor.py (2)

12-38: LGTM on supporting data models.

MandatoryAttendance, RatingsDistribution, and Rating are well-structured with appropriate defaults. These provide clear type definitions for the RMP data structures.


80-108: LGTM on FullInstructor schema.

The use of list[Any] for courses_taught with a comment is a reasonable approach to avoid circular imports with CourseReference. The custom serializer correctly handles different object types.

generation/schemas/course.py (4)

14-23: LGTM on utility functions.

The remove_extra_spaces and cleanup_course_reference_str functions provide consistent string normalization for course reference parsing.


26-87: LGTM on CourseReference model.

The frozen configuration with explicit __hash__ and __eq__ implementations ensures reliable dict key usage. The from_string parser correctly handles multi-subject courses like 'CS/ECE 252'.


90-127: LGTM on CoursePrerequisites model.

The serializers correctly handle the polymorphic linked_requisite_text and the duck-typed abstract_syntax_tree. Using Any for the AST type is a reasonable trade-off to avoid circular imports.


180-193: Verify: __hash__ and __eq__ use different fields.

__hash__ uses only get_identifier() (line 181) while __eq__ also checks course_title and description (lines 186-189). This is technically valid (hash collisions are allowed), but can lead to unexpected behavior: two courses with the same identifier but different titles will hash the same but not be equal, potentially causing set/dict lookup issues.

Consider aligning these methods or documenting this intentional asymmetry.

generation/enrollment_data.py (3)

257-286: LGTM on Meeting factory method.

The inline import of CourseReference is a reasonable approach to avoid circular dependencies, and the use of model_validate aligns with Pydantic v2 patterns.


289-331: LGTM on EnrollmentData factory methods.

The from_json and from_enrollment methods correctly handle nested structures and provide sensible defaults. The credit_count conversion from list to tuple is safe given the default of [0, 0].


384-389: LGTM on GradeData reconstruction with instructors.

The pattern of using **grade_data.model_dump() with an updated instructors field is the correct way to create a modified copy of a Pydantic model.

Comment thread generation/schemas/openapi.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment on lines 366 to +367
logger.error(f"Failed to fetch Madgrades data from {url}: {e}")
raise
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception is raised but the flow continues. If an exception occurs after all retry attempts, it should be raised to propagate the error properly. However, the current code raises the exception after logging but doesn't prevent the subsequent code from executing if the raise somehow fails. Consider restructuring to ensure the error is properly propagated.

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +188
model_config = ConfigDict(frozen=False) # Allow mutation for capacity updates

Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model_config setting frozen=False conflicts with the frozen=True setting from the base class. The base _MeetingLocationBase in schemas/enrollment.py is frozen, but this extended class overrides it to allow mutation. This inconsistency could lead to confusion. Consider either keeping the base model mutable or finding a different approach to handle capacity updates (e.g., using a separate mutable cache class).

Suggested change
model_config = ConfigDict(frozen=False) # Allow mutation for capacity updates

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
generation/schemas/openapi.py (2)

22-76: Extract the models list to eliminate duplication.

The same models list is defined in both get_all_schemas() (lines 24-39) and generate_openapi_spec() (lines 61-76). This creates a maintenance burden—adding or removing a model requires changes in both places, increasing the risk of inconsistency.

🔎 Proposed refactor
+# Shared list of all models for schema generation
+_ALL_MODELS = [
+    GradeData,
+    CourseReference,
+    CoursePrerequisites,
+    Course,
+    School,
+    MeetingLocation,
+    Meeting,
+    EnrollmentData,
+    TermData,
+    RMPData,
+    FullInstructor,
+    RequirementAbstractSyntaxTree,
+    Leaf,
+    Node,
+]
+
+
 def get_all_schemas() -> dict[str, Any]:
     """Get JSON schemas for all models."""
-    models = [
-        GradeData,
-        CourseReference,
-        CoursePrerequisites,
-        Course,
-        School,
-        MeetingLocation,
-        Meeting,
-        EnrollmentData,
-        TermData,
-        RMPData,
-        FullInstructor,
-        RequirementAbstractSyntaxTree,
-        Leaf,
-        Node,
-    ]
-
     schemas = {}
-    for model in models:
+    for model in _ALL_MODELS:
         schema = model.model_json_schema(ref_template="#/components/schemas/{model}")
         schemas[model.__name__] = schema

     return schemas

Then in generate_openapi_spec():

     # Get component schemas
     component_schemas = {}

-    # Core models with their schemas
-    models = [
-        GradeData,
-        CourseReference,
-        CoursePrerequisites,
-        Course,
-        School,
-        MeetingLocation,
-        Meeting,
-        EnrollmentData,
-        TermData,
-        RMPData,
-        FullInstructor,
-        RequirementAbstractSyntaxTree,
-        Leaf,
-        Node,
-    ]
-
-    for model in models:
+    for model in _ALL_MODELS:
         schema = model.model_json_schema(ref_template="#/components/schemas/{model}")

481-497: Consider adding error handling for file I/O operations.

Both export functions write to files without error handling. While not critical, adding try-except blocks would provide clearer error messages if file operations fail (e.g., directory doesn't exist, insufficient permissions).

🔎 Example implementation
 def export_openapi_spec(
     output_path: str = "openapi.json",
     **kwargs,
 ) -> None:
     """Export the OpenAPI spec to a JSON file."""
-    spec = generate_openapi_spec(**kwargs)
-    with open(output_path, "w") as f:
-        json.dump(spec, f, indent=2)
-    print(f"OpenAPI spec exported to {output_path}")
+    try:
+        spec = generate_openapi_spec(**kwargs)
+        with open(output_path, "w") as f:
+            json.dump(spec, f, indent=2)
+        print(f"OpenAPI spec exported to {output_path}")
+    except OSError as e:
+        print(f"Error writing OpenAPI spec to {output_path}: {e}")
+        raise


 def export_json_schemas(output_path: str = "schemas.json") -> None:
     """Export all JSON schemas to a file."""
-    schemas = get_all_schemas()
-    with open(output_path, "w") as f:
-        json.dump(schemas, f, indent=2)
-    print(f"JSON schemas exported to {output_path}")
+    try:
+        schemas = get_all_schemas()
+        with open(output_path, "w") as f:
+            json.dump(schemas, f, indent=2)
+        print(f"JSON schemas exported to {output_path}")
+    except OSError as e:
+        print(f"Error writing JSON schemas to {output_path}: {e}")
+        raise
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f29a06 and bc5db06.

📒 Files selected for processing (1)
  • generation/schemas/openapi.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/schemas/openapi.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-web
  • GitHub Check: build-web
🔇 Additional comments (1)
generation/schemas/openapi.py (1)

36-38: Past review feedback addressed.

The requirement AST models (RequirementAbstractSyntaxTree, Leaf, Node) are now properly included in the models list, addressing the previous review comment. These types will be correctly exported in the OpenAPI schema.

Comment thread generation/schemas/openapi.py Outdated
Comment thread .github/workflows/generation.yml Fixed
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
.env.example (1)

23-26: Quote the empty string values for consistency.

The empty string values for S3 configuration should be wrapped in quotes for consistency with other environment variables in the file and to ensure compatibility with various .env parsers. Additionally, add a blank line at the end of the file.

🔎 Proposed fix
-S3_ENDPOINT_URL='' # S3-compatible endpoint URL (e.g., https://<account_id>.r2.cloudflarestorage.com)
-S3_ACCESS_KEY_ID='' # S3 access key ID
-S3_SECRET_ACCESS_KEY='' # S3 secret access key
-S3_BUCKET='' # S3 bucket name
+S3_ENDPOINT_URL='' # S3-compatible endpoint URL (e.g., https://<account_id>.r2.cloudflarestorage.com)
+S3_ACCESS_KEY_ID='' # S3 access key ID
+S3_SECRET_ACCESS_KEY='' # S3 secret access key
+S3_BUCKET='' # S3 bucket name
+
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc5db06 and b4773d7.

⛔ Files ignored due to path filters (1)
  • generation/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • .env.example
  • .github/workflows/generation.yml
  • generation/main.py
  • generation/pyproject.toml
  • generation/save.py
  • generation/schemas/openapi.py
  • generation/upload.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • generation/save.py
  • generation/pyproject.toml
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/schemas/openapi.py
  • generation/upload.py
  • generation/main.py
🧠 Learnings (2)
📚 Learning: 2025-08-24T08:36:25.768Z
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Expose and consume configuration via environment variables across services (e.g., PUBLIC_API_URL, PUBLIC_SEARCH_API_URL)

Applied to files:

  • .env.example
📚 Learning: 2025-08-24T08:36:25.768Z
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps

Applied to files:

  • generation/main.py
🧬 Code graph analysis (1)
generation/main.py (2)
generation/schemas/openapi.py (1)
  • export_openapi_spec (482-490)
generation/upload.py (1)
  • upload_directory (70-136)
🪛 dotenv-linter (4.0.0)
.env.example

[warning] 8-8: [QuoteCharacter] The value has quote characters (', ")

(QuoteCharacter)


[warning] 23-23: [UnorderedKey] The S3_ENDPOINT_URL key should go before the SITEMAP_BASE key

(UnorderedKey)


[warning] 23-23: [ValueWithoutQuotes] This value needs to be surrounded in quotes

(ValueWithoutQuotes)


[warning] 25-25: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)


[warning] 25-25: [UnorderedKey] The S3_SECRET_ACCESS_KEY key should go before the SITEMAP_BASE key

(UnorderedKey)


[warning] 25-25: [ValueWithoutQuotes] This value needs to be surrounded in quotes

(ValueWithoutQuotes)

🪛 GitHub Check: CodeQL
.github/workflows/generation.yml

[warning] 57-91: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-web
  • GitHub Check: build-web
🔇 Additional comments (12)
generation/main.py (4)

108-114: LGTM: Clean opt-in pattern for sensitive operations.

The OPT_IN_STEPS mechanism ensures that sensitive operations like uploading to S3 require explicit opt-in and won't run automatically with --step all. This is a good safety pattern.


446-448: Verify OpenAPI export timing.

The OpenAPI spec is generated unconditionally whenever --no_build is not set, regardless of which step is executed. This means running --step courses will also generate the OpenAPI spec. Confirm this is the intended behavior, as it may be more appropriate to only generate the spec during the "aggregate" or a dedicated step.


450-476: LGTM: Robust credential validation and error handling.

The upload step implementation includes:

  • Proper credential validation with clear warning messages
  • Graceful handling of missing credentials
  • Appropriate logging for both success and failure cases

49-50: No changes needed. The imports in lines 49-50 work correctly as written. The generation directory is executed as a script module with working-directory: ./generation set before running python main.py, making the unqualified imports (from schemas.openapi import, from upload import) resolve correctly. This pattern is consistent throughout other files in the generation module and requires no modification.

.github/workflows/generation.yml (1)

29-29: LGTM: Using version file improves consistency.

Using .python-version file ensures consistent Python versions across jobs and makes version management easier.

generation/schemas/openapi.py (3)

23-479: LGTM: Comprehensive OpenAPI 3.1.0 specification.

The implementation includes:

  • All relevant data models (including AST types)
  • Proper schema reference handling with $defs flattening
  • Comprehensive endpoint definitions with proper documentation
  • Correct OpenAPI 3.1.0 structure

482-527: LGTM: Clean export utilities and CLI.

The export functions are simple and focused, and the CLI provides useful options for generating either the full OpenAPI spec or just the JSON schemas.


8-18: The imports in generation/schemas/openapi.py (lines 8-18) use the standard absolute import pattern that is consistently applied throughout the entire codebase. The presence of generation/schemas/__init__.py and the project structure in generation/pyproject.toml confirms schemas is properly configured as a package. This same import pattern is used in generation/main.py (line 49: from schemas.openapi import export_openapi_spec) and across all other modules in the generation package, demonstrating it is the correct and intentional convention for this project. No changes are required.

generation/upload.py (4)

18-32: LGTM: Appropriate S3 client configuration.

The client configuration uses adaptive retry mode with 3 attempts, which is suitable for handling transient network issues during uploads.


35-42: LGTM: Robust content-type detection.

The fallback chain ensures proper MIME types:

  1. Standard mimetypes library
  2. Explicit .json handling
  3. Safe default (application/octet-stream)

45-67: LGTM: Proper file upload with appropriate headers.

The function correctly:

  • Uses binary mode for file reading
  • Sets ContentType and CacheControl headers
  • Handles exceptions gracefully
  • Returns clear success/failure status

The 1-week cache control is appropriate for relatively static API data.


70-136: LGTM: Well-designed concurrent upload implementation.

The function demonstrates good practices:

  • Efficient concurrent uploads with ThreadPoolExecutor
  • Progress reporting via tqdm
  • Proper error tracking and logging
  • The .json suffix removal (line 106) aligns with serving JSON files without extensions in URLs

The 10-worker default provides a good balance between throughput and resource usage.

Comment thread .github/workflows/generation.yml
name: Deploy to S3
needs: generate
if: github.ref == 'refs/heads/main'
#if: github.ref == 'refs/heads/main'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove or activate the commented-out condition.

The if: github.ref == 'refs/heads/main' guard is commented out. Either remove it if intentional, or activate it to prevent accidental deployments from non-main branches.

🔎 Proposed fix

Remove the comment if deployments should run from any branch:

-    #if: github.ref == 'refs/heads/main'

Or activate it to restrict to main branch only:

-    #if: github.ref == 'refs/heads/main'
+    if: github.ref == 'refs/heads/main'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#if: github.ref == 'refs/heads/main'
Suggested change
#if: github.ref == 'refs/heads/main'
if: github.ref == 'refs/heads/main'
🤖 Prompt for AI Agents
.github/workflows/generation.yml around line 59: the workflow contains a
commented-out guard "#if: github.ref == 'refs/heads/main'"; either remove the
commented line if you intend the job to run on all branches, or activate it by
un-commenting (removing the leading '#') to restrict the job to only run on the
main branch; update the file accordingly and commit the change so the workflow
behavior matches the intended deployment policy.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/generation.yml (1)

59-59: Commented-out branch guard remains unresolved.

The if: github.ref == 'refs/heads/main' condition is still commented out. This was flagged in a previous review. Either remove it if deployments should run from any branch, or activate it to restrict deployments to the main branch only.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4773d7 and 6af907d.

📒 Files selected for processing (1)
  • .github/workflows/generation.yml
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-web
  • GitHub Check: build-web
🔇 Additional comments (3)
.github/workflows/generation.yml (3)

61-62: LGTM! Permissions block correctly added.

The explicit permissions: contents: read block has been added as recommended in previous reviews, properly restricting the deploy job's token scope to read-only access.


70-79: LGTM! Deploy job setup mirrors generate job correctly.

The checkout, Python setup, and dependency installation steps are properly configured and necessary for running the Python upload script in the deploy job.


85-93: The S3 upload step is properly configured. The main.py script supports the --step upload and --no_build arguments, and all required environment variables (DATA_DIR, SITEMAP_BASE, S3_ENDPOINT_URL, S3_ACCESS_KEY_ID, S3_SECRET_ACCESS_KEY, S3_BUCKET) are consumed. The implementation includes error handling for missing S3 credentials with an informative message.

Comment thread .github/workflows/generation.yml Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/generation.yml (1)

59-59: The commented-out branch guard remains unresolved.

The if: github.ref == 'refs/heads/main' condition is still commented out. This was previously flagged but remains unchanged. Consider either removing it entirely if deployments should run from any branch, or activating it to restrict deployments to the main branch only.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6af907d and 3e7dba4.

📒 Files selected for processing (1)
  • .github/workflows/generation.yml
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
📚 Learning: 2025-08-24T08:36:25.768Z
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to {generation,search}/**/{pyproject.toml,uv.lock} : Manage Python service dependencies with uv (not pip/pipenv/conda)

Applied to files:

  • .github/workflows/generation.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: Generate Data
🔇 Additional comments (3)
.github/workflows/generation.yml (3)

85-93: LGTM! S3 credentials are properly secured.

The S3 upload step correctly uses GitHub secrets for all sensitive credentials (endpoint URL, access keys, bucket name). The --no_build flag appropriately skips data generation since the artifact was already built in the generate job.


63-65: Good addition of working-directory default.

Setting the working directory to ./generation is consistent with the generate job and necessary since the deploy job now runs Python commands that expect to execute from the generation directory.


29-29: No action needed — generation/.python-version exists and is properly configured.

The file generation/.python-version exists at the referenced path and contains a valid Python version (3.12). Both the generate and deploy jobs will successfully set up Python using this file.

Comment thread .github/workflows/generation.yml Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
generation/requirement_ast.py (1)

292-307: Consider using Leaf.from_json for consistency.

Line 301 directly constructs Leaf(payload=json_data) while line 299 uses Node.from_json(json_data). For consistency, consider:

🔎 Proposed consistency fix
     def from_json(cls, json_data) -> RequirementAbstractSyntaxTree | None:
         """Create RequirementAbstractSyntaxTree from JSON data."""
         if isinstance(json_data, dict):
             return cls(root=Node.from_json(json_data))
         elif isinstance(json_data, str):
-            return cls(root=Leaf(payload=json_data))
+            return cls(root=Leaf.from_json(json_data))
         else:
             return None
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e7dba4 and bef64f4.

📒 Files selected for processing (4)
  • .gitmodules
  • generation/enrollment.py
  • generation/madgrades.py
  • generation/requirement_ast.py
💤 Files with no reviewable changes (1)
  • .gitmodules
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/requirement_ast.py
  • generation/madgrades.py
  • generation/enrollment.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
🧬 Code graph analysis (2)
generation/requirement_ast.py (1)
generation/schemas/requirement_ast.py (5)
  • Leaf (10-33)
  • Node (36-58)
  • RequirementAbstractSyntaxTree (61-116)
  • model_dump (28-30)
  • model_dump (73-75)
generation/enrollment.py (1)
generation/course.py (1)
  • Course (88-258)
🪛 GitHub Actions: Python Linting
generation/requirement_ast.py

[error] 1-1: ruff format check failed. 1 file would be reformatted (requirement_ast.py). Run 'ruff format' to fix formatting.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: Generate Data
🔇 Additional comments (8)
generation/requirement_ast.py (4)

1-11: LGTM! Clean refactor to Pydantic base classes.

The import structure clearly distinguishes base classes from the extended implementations.


250-267: LGTM! Factory method correctly handles both string and course reference payloads.

The delegation to model_dump() ensures consistent serialization with the Pydantic base class.


270-290: LGTM! Recursive factory method correctly handles nested AST structures.

The union return type Node | Leaf properly reflects that the method can return either type depending on the JSON structure.


310-363: LGTM! Well-structured recursive descent parser with proper operator precedence.

The parser correctly implements OR < AND precedence and properly constructs typed AST nodes. The EOF edge case (line 358) is handled by returning an empty Leaf.

generation/enrollment.py (2)

403-409: No issues found. The code is correct as-is.

The TermData() no-argument instantiation is valid because both enrollment_data and grade_data are optional fields with None defaults. The pattern of creating an instance, retrieving existing data if present, updating only enrollment_data, and storing back is safe—partial field updates are intentional and supported by the optional field design. This pattern is consistent with usage elsewhere in the codebase (e.g., generation/madgrades.py:57).


264-264: No issues found with the CourseReference constructor call.

The CourseReference schema correctly defines subjects (frozenset[str]) and course_number (int) as fields. The keyword arguments at line 264 match the constructor signature exactly. The field validator automatically converts the subjects parameter to a frozenset, and this pattern is consistent across the codebase (e.g., generation/madgrades.py uses the same approach). The code is correct.

generation/madgrades.py (2)

57-57: TermData default constructor is correct.

The migration from TermData(None, None) to TermData() aligns with the schema definition where both enrollment_data and grade_data have default values of None. This pattern is used consistently in generation/madgrades.py and generation/enrollment.py when initializing empty term data before conditional assignment. The use of explicit keyword arguments in generation/course.py when values are immediately available is also valid and consistent.


43-43: Keyword argument migration is correct and consistent across the codebase.

The change to Course.Reference(subjects=subject_set, course_number=course_number) on line 43 aligns with the CourseReference Pydantic model definition. Both CourseReference instantiations in the codebase (generation/madgrades.py:43 and generation/enrollment.py:264) use consistent keyword arguments. The field validator automatically converts the passed set to frozenset as required by the schema.

Comment thread generation/requirement_ast.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
generation/enrollment_data.py (1)

187-239: Clarify the frozen configuration and behavior.

Line 187 sets frozen=False to "allow mutation for capacity updates," but the get_or_create_with_capacity method creates new instances (lines 222-229) instead of mutating, which is the pattern for frozen/immutable models. Meanwhile, the base class in schemas/enrollment.py has frozen=True (line 32).

This creates confusion about the intended behavior. If instances should be immutable, keep frozen=True and continue creating new instances. If they should be mutable, remove the new instance creation and mutate directly.

🔎 Recommended approach: Keep frozen=True
-    model_config = ConfigDict(frozen=False)  # Allow mutation for capacity updates
+    # Inherits frozen=True from base - creates new instances for updates

This matches the actual behavior in get_or_create_with_capacity where new instances are created.

🧹 Nitpick comments (3)
generation/schemas/enrollment.py (1)

64-64: Consider using a forward reference for better type safety.

The course_reference field is typed as Any to avoid circular imports. Consider using a string literal forward reference or importing under TYPE_CHECKING for better type safety.

🔎 Suggested improvement
 from __future__ import annotations
 
-from typing import Any
+from typing import TYPE_CHECKING, Any
 
 from pydantic import BaseModel, ConfigDict, field_serializer
 
 from schemas.grades import GradeData
 
+if TYPE_CHECKING:
+    from schemas.course import CourseReference

 # ... 

 class Meeting(BaseModel):
     """A class meeting/section."""
 
     model_config = ConfigDict(arbitrary_types_allowed=True)
 
     name: str
     type: str
     start_time: int | None
     end_time: int | None
     location: MeetingLocation | None = None
     current_enrollment: int | None = None
     instructors: list[str] = []
-    course_reference: Any | None = None  # CourseReference
+    course_reference: "CourseReference | None" = None
generation/enrollment_data.py (2)

377-389: Consider using Pydantic's model_copy for clarity.

Lines 387-389 create a new GradeData instance by unpacking model_dump() and adding instructors. While functional, this approach bypasses Pydantic's built-in mutation methods.

🔎 Alternative approach using model_copy
             for section in sections:
                 for instructor in section["instructors"]:
                     name = instructor["name"]
                     if name:
                         instructors.add(name)
 
             # Update instructors on the grade data
-            grade_data = GradeData(
-                **{**grade_data.model_dump(), "instructors": list(instructors)}
-            )
+            grade_data = grade_data.model_copy(update={"instructors": list(instructors)})
 
             by_term[term_code] = grade_data

44-64: The _instructors_set field is a reasonable performance optimization; no validation issues found.

The custom __init__ does not bypass Pydantic validation—it processes initialization data before calling super().__init__(). The _instructors_set is a private cache that stores the set representation for efficient operations in merge_with(), where set union is used to combine instructors. The implementation correctly handles None and empty cases, and the property provides transparent access to the set representation.

Consider using a @computed_field decorator if you want a more explicit Pydantic-native approach, but the current implementation is sound and avoids unnecessary complexity.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bef64f4 and 9ee9a04.

📒 Files selected for processing (2)
  • generation/enrollment_data.py
  • generation/schemas/enrollment.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/schemas/enrollment.py
  • generation/enrollment_data.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
🧬 Code graph analysis (1)
generation/schemas/enrollment.py (1)
generation/schemas/grades.py (1)
  • GradeData (8-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: Generate Data
🔇 Additional comments (7)
generation/schemas/enrollment.py (4)

12-27: LGTM!

The School model is well-structured with a clear factory method for API response parsing.


29-50: LGTM!

The custom __hash__ and __eq__ methods correctly implement deduplication by location identity (building, room, coordinates), intentionally excluding capacity. This is the right design for location-based deduplication.


103-138: LGTM!

The EnrollmentData model correctly handles credit count serialization (tuple to list) and provides a clean factory method for API response parsing.


140-144: LGTM!

The TermData model provides a clean composition of enrollment and grade data.

generation/enrollment_data.py (3)

67-181: LGTM!

The factory methods (empty, from_madgrades, from_json) and utilities (merge_with, to_dict) are well-implemented with correct logic for grade data aggregation and transformation.


257-286: LGTM!

The Meeting class extension provides clean factory methods with proper handling of nested models and circular import avoidance.


356-368: LGTM!

The exception handling with retry logic is correctly implemented. The raise statement on line 367 properly propagates the exception after all retries are exhausted. The previous review comment's concern about flow control is unfounded—the raise will immediately exit the function.

Comment on lines +311 to +327
@classmethod
def from_enrollment(cls, data: dict, terms: dict[int, str]) -> EnrollmentData:
"""Create EnrollmentData from enrollment API response."""
last_taught_term_code = safe_int(data.get("lastTaught"))
last_taught_term = None
if last_taught_term_code and last_taught_term_code in terms:
last_taught_term = terms[last_taught_term_code]

return cls(
school=School.from_enrollment(data["subject"]["schoolCollege"]),
last_taught_term=last_taught_term,
typically_offered=data.get("typicallyOffered"),
credit_count=(data["minimumCredits"], data["maximumCredits"]),
general_education=data.get("generalEd") is not None,
ethnics_studies=data.get("ethnicStudies") is not None,
instructors={},
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove duplicated from_enrollment method.

The from_enrollment method here is identical to the base class implementation in generation/schemas/enrollment.py (lines 119-137). Since this class extends _EnrollmentDataBase, it inherits the method automatically. This duplication adds maintenance burden with no benefit.

🔎 Proposed change
     @classmethod
     def from_json(cls, data: dict) -> EnrollmentData:
         """Create EnrollmentData from JSON dict."""
         school = None
         if data.get("school"):
             school = School.model_validate(data["school"])
 
         credit_count = data.get("credit_count", [0, 0])
 
         return cls(
             school=school,
             last_taught_term=data.get("last_taught_term"),
             typically_offered=data.get("typically_offered"),
             credit_count=(credit_count[0], credit_count[1]),
             general_education=data.get("general_education", False),
             ethnics_studies=data.get("ethnics_studies", False),
             instructors=data.get("instructors", {}),
         )
 
-    @classmethod
-    def from_enrollment(cls, data: dict, terms: dict[int, str]) -> EnrollmentData:
-        """Create EnrollmentData from enrollment API response."""
-        last_taught_term_code = safe_int(data.get("lastTaught"))
-        last_taught_term = None
-        if last_taught_term_code and last_taught_term_code in terms:
-            last_taught_term = terms[last_taught_term_code]
-
-        return cls(
-            school=School.from_enrollment(data["subject"]["schoolCollege"]),
-            last_taught_term=last_taught_term,
-            typically_offered=data.get("typicallyOffered"),
-            credit_count=(data["minimumCredits"], data["maximumCredits"]),
-            general_education=data.get("generalEd") is not None,
-            ethnics_studies=data.get("ethnicStudies") is not None,
-            instructors={},
-        )
-
     def to_dict(self) -> dict:
         """Convert to dict for JSON serialization."""
         return self.model_dump()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@classmethod
def from_enrollment(cls, data: dict, terms: dict[int, str]) -> EnrollmentData:
"""Create EnrollmentData from enrollment API response."""
last_taught_term_code = safe_int(data.get("lastTaught"))
last_taught_term = None
if last_taught_term_code and last_taught_term_code in terms:
last_taught_term = terms[last_taught_term_code]
return cls(
school=School.from_enrollment(data["subject"]["schoolCollege"]),
last_taught_term=last_taught_term,
typically_offered=data.get("typicallyOffered"),
credit_count=(data["minimumCredits"], data["maximumCredits"]),
general_education=data.get("generalEd") is not None,
ethnics_studies=data.get("ethnicStudies") is not None,
instructors={},
)
@classmethod
def from_json(cls, data: dict) -> EnrollmentData:
"""Create EnrollmentData from JSON dict."""
school = None
if data.get("school"):
school = School.model_validate(data["school"])
credit_count = data.get("credit_count", [0, 0])
return cls(
school=school,
last_taught_term=data.get("last_taught_term"),
typically_offered=data.get("typically_offered"),
credit_count=(credit_count[0], credit_count[1]),
general_education=data.get("general_education", False),
ethnics_studies=data.get("ethnics_studies", False),
instructors=data.get("instructors", {}),
)
def to_dict(self) -> dict:
"""Convert to dict for JSON serialization."""
return self.model_dump()
🤖 Prompt for AI Agents
In generation/enrollment_data.py around lines 311 to 327, there is a duplicate
from_enrollment classmethod that is identical to the implementation in the base
class generation/schemas/enrollment.py (lines 119-137); remove this entire
from_enrollment method from generation/enrollment_data.py so the class uses the
inherited implementation from _EnrollmentDataBase, and run tests/typechecks to
ensure nothing else depended on a local override.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
generation/enrollment_data.py (3)

39-43: Verify frozen=False is intentional given base model constraints.

The frozen=False configuration allows mutation but may conflict with the base class _GradeDataBase. A past review comment noted this inconsistency could lead to confusion. Consider whether mutations are truly needed during processing, or if creating new instances would be clearer.

Based on past review comments.


188-240: Inconsistent mutability approach in capacity updates.

The model has frozen=False but the get_or_create_with_capacity method creates new instances when updating capacity (lines 223-230), treating the model as immutable. This is inconsistent:

  • If frozen=False allows mutation, you could update existing_location.capacity directly
  • If immutability is desired, set frozen=True in the base and keep the instance-creation pattern

The current approach is confusing and may lead to unexpected behavior.

Based on past review comments.


312-328: Remove duplicated from_enrollment method.

This method is identical to the base class implementation in generation/schemas/enrollment.py (lines 119-137). The duplication was flagged in a previous review and should be removed to avoid maintenance burden.

🔎 Proposed fix
-    @classmethod
-    def from_enrollment(cls, data: dict, terms: dict[int, str]) -> EnrollmentData:
-        """Create EnrollmentData from enrollment API response."""
-        last_taught_term_code = safe_int(data.get("lastTaught"))
-        last_taught_term = None
-        if last_taught_term_code and last_taught_term_code in terms:
-            last_taught_term = terms[last_taught_term_code]
-
-        return cls(
-            school=School.from_enrollment(data["subject"]["schoolCollege"]),
-            last_taught_term=last_taught_term,
-            typically_offered=data.get("typicallyOffered"),
-            credit_count=(data["minimumCredits"], data["maximumCredits"]),
-            general_education=data.get("generalEd") is not None,
-            ethnics_studies=data.get("ethnicStudies") is not None,
-            instructors={},
-        )
-
     def to_dict(self) -> dict:
         """Convert to dict for JSON serialization."""
         return self.model_dump()

Based on past review comments.

🧹 Nitpick comments (1)
generation/enrollment_data.py (1)

91-109: Simplify instructor merging logic using the instructors_as_set property.

The merge logic manually checks and converts instructors to sets, but the instructors_as_set property (lines 58-65) already handles this conversion. This creates unnecessary duplication.

🔎 Proposed simplification
-    def merge_with(self, other: GradeData | None) -> GradeData:
-        """Merge this grade data with another, summing all values."""
-        if not other:
-            return self
-
-        self_instructors = self.instructors_as_set
-        other_instructors = (
-            other.instructors_as_set
-            if hasattr(other, "instructors_as_set")
-            else (set(other.instructors) if other.instructors else None)
-        )
-
-        merged_instructors: list[str] | None = None
-        if self_instructors is not None and other_instructors is not None:
-            merged_instructors = list(self_instructors | other_instructors)
-        elif self_instructors is not None:
-            merged_instructors = list(self_instructors)
-        elif other_instructors is not None:
-            merged_instructors = list(other_instructors)
+    def merge_with(self, other: GradeData | None) -> GradeData:
+        """Merge this grade data with another, summing all values."""
+        if not other:
+            return self
+
+        # Use instructors_as_set property for consistent handling
+        self_instructors = self.instructors_as_set
+        other_instructors = other.instructors_as_set if hasattr(other, "instructors_as_set") else (set(other.instructors) if other.instructors else None)
+
+        merged_instructors: list[str] | None = None
+        if self_instructors and other_instructors:
+            merged_instructors = list(self_instructors | other_instructors)
+        elif self_instructors:
+            merged_instructors = list(self_instructors)
+        elif other_instructors:
+            merged_instructors = list(other_instructors)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ee9a04 and fe237b7.

📒 Files selected for processing (8)
  • .gitignore
  • data
  • generation/cache.py
  • generation/enrollment.py
  • generation/enrollment_data.py
  • generation/schemas/course.py
  • generation/schemas/enrollment.py
  • generation/schemas/grades.py
💤 Files with no reviewable changes (1)
  • data
🚧 Files skipped from review as they are similar to previous changes (1)
  • generation/schemas/course.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/cache.py
  • generation/schemas/enrollment.py
  • generation/schemas/grades.py
  • generation/enrollment_data.py
  • generation/enrollment.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
📚 Learning: 2025-08-24T08:36:25.768Z
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps

Applied to files:

  • .gitignore
🧬 Code graph analysis (4)
generation/cache.py (1)
generation/enrollment_data.py (5)
  • Meeting (258-287)
  • from_json (157-178)
  • from_json (243-251)
  • from_json (262-283)
  • from_json (294-310)
generation/schemas/enrollment.py (1)
generation/enrollment_data.py (4)
  • from_enrollment (313-328)
  • MeetingLocation (185-255)
  • Meeting (258-287)
  • EnrollmentData (290-332)
generation/schemas/grades.py (2)
generation/enrollment_data.py (4)
  • GradeData (36-182)
  • empty (68-89)
  • merge_with (91-130)
  • from_madgrades (133-154)
generation/main.py (1)
  • instructors (142-159)
generation/enrollment.py (4)
generation/schemas/enrollment.py (4)
  • EnrollmentData (103-137)
  • MeetingLocation (29-49)
  • Meeting (52-100)
  • TermData (140-144)
generation/enrollment_data.py (4)
  • EnrollmentData (290-332)
  • MeetingLocation (185-255)
  • Meeting (258-287)
  • get_or_create_with_capacity (194-240)
generation/schemas/course.py (1)
  • Course (130-193)
generation/course.py (1)
  • Course (88-258)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: Generate Data
🔇 Additional comments (14)
.gitignore (1)

29-29: Appropriate addition to support generated data artifacts.

The addition of data/ to the ignore list aligns well with the PR's data generation and S3 upload workflows. Placing it under the "# Generation" section and adjacent to generation/.cache maintains logical organization and prevents runtime-generated artifacts (manifests, OpenAPI schemas, etc.) from being accidentally committed.

generation/cache.py (2)

8-8: LGTM - Clean migration to public Meeting model.

The import update correctly reflects the new public API structure where Meeting is a top-level export from enrollment_data rather than a nested class.


339-339: LGTM - Consistent usage of new Meeting API.

The update to use Meeting.from_json(meeting) correctly aligns with the new public schema structure. This change is consistent with the broader refactoring effort.

generation/schemas/grades.py (1)

8-131: LGTM - Well-designed Pydantic model with efficient instructor handling.

The GradeData model demonstrates good practices:

  • Field validator (lines 34-42) and serializer (lines 44-49) provide efficient set-based operations internally while maintaining list compatibility for JSON
  • The merge_with method correctly handles instructor union logic
  • Factory methods provide clean construction paths from different data sources
generation/enrollment.py (5)

13-13: LGTM - Correct import of public schema models.

The updated import statement properly reflects the migration from nested classes (EnrollmentData.Meeting, EnrollmentData.MeetingLocation) to top-level public models.


371-380: LGTM - Improved coordinates validation and location creation.

The coordinates handling correctly validates both latitude and longitude exist before creating the tuple (line 372). The MeetingLocation.get_or_create_with_capacity call properly uses the new public API.


384-393: LGTM - Meeting construction with proper field mapping.

The Meeting instantiation correctly maps all required fields including location, instructors, and course_reference. The migration to the top-level Meeting class is complete and correct.


402-402: TermData() constructor correctly uses default arguments.

The TermData class in generation/schemas/enrollment.py defines both enrollment_data and grade_data with None defaults, so the change from TermData(None, None) to TermData() is correct and idiomatic.


264-264: Course.Reference keyword arguments are correct.

The usage of Course.Reference(subjects=subjects, course_number=course_code) matches the CourseReference constructor signature. The CourseReference class (defined as a ClassVar alias on Course) accepts these keyword arguments, as confirmed by consistent usage patterns throughout the codebase.

generation/schemas/enrollment.py (4)

12-27: LGTM - Clean School model with factory method.

The School model and from_enrollment factory correctly map enrollment API fields to the Pydantic model.


29-50: LGTM - Proper MeetingLocation deduplication logic.

The frozen model with hash/eq based on (building, room, coordinates) correctly enables deduplication while excluding capacity (which can vary). This is the correct base for the extended class in enrollment_data.py.


103-137: LGTM - EnrollmentData factory with proper field mapping.

The from_enrollment factory correctly:

  • Handles optional last_taught_term lookup via terms dict
  • Maps API response fields to model fields
  • Uses local import of safe_int to avoid circular dependencies

140-144: LGTM - TermData with sensible defaults.

Both fields defaulting to None confirms the TermData() constructor change in generation/enrollment.py line 402 is correct.

generation/enrollment_data.py (1)

354-368: Exception handling is correct and properly raises after retry exhaustion.

The retry logic correctly exhausts attempts before raising. The raise on line 368 properly halts execution—lines 370+ are only reached on successful data fetch. Exceptions propagate cleanly to callers.

Comment on lines +75 to +100
def __hash__(self) -> int:
section = self.name.split("#")[0].strip() if self.name else ""
return hash(
(
section,
self.type,
self.start_time,
self.end_time,
tuple(sorted(self.instructors)),
)
)

def __eq__(self, other: object) -> bool:
if not isinstance(other, Meeting):
return False

self_section = self.name.split("#")[0].strip() if self.name else ""
other_section = other.name.split("#")[0].strip() if other.name else ""

return (
self_section == other_section
and self.type == other.type
and self.start_time == other.start_time
and self.end_time == other.end_time
and sorted(self.instructors) == sorted(other.instructors)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Meeting hash/eq relies on name parsing that may be fragile.

The hash and equality logic extracts section identifiers using name.split("#")[0].strip() (lines 76, 91-92). This assumes meeting names always follow the format "SECTION_ID #INDEX". If the format varies or names lack "#", this could cause incorrect deduplication.

Consider adding validation or documentation about the expected name format.

🔎 Suggested defensive code
     def __hash__(self) -> int:
-        section = self.name.split("#")[0].strip() if self.name else ""
+        # Extract section identifier before occurrence number
+        section = self.name.split("#")[0].strip() if self.name and "#" in self.name else (self.name or "")
         return hash(
             (
                 section,
                 self.type,
                 self.start_time,
                 self.end_time,
                 tuple(sorted(self.instructors)),
             )
         )

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
generation/enrollment.py (1)

282-283: Prefer a specific exception type over generic Exception.

While checking the status code before parsing is good defensive programming, raising a generic Exception makes error handling less precise. Consider using aiohttp.ClientResponseError or a custom exception type.

🔎 Suggested refactor
             if response.status != 200:
-                raise Exception(f"HTTP {response.status}")
+                response.raise_for_status()  # Raises ClientResponseError
             data = await response.json()

Or define a custom exception:

# At module level
class EnrollmentDataFetchError(Exception):
    pass

# Then in the code:
             if response.status != 200:
-                raise Exception(f"HTTP {response.status}")
+                raise EnrollmentDataFetchError(f"HTTP {response.status}: {response.reason}")
             data = await response.json()
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb856b7 and 1a04f57.

📒 Files selected for processing (1)
  • generation/enrollment.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/enrollment.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: Generate Data
🔇 Additional comments (5)
generation/enrollment.py (5)

13-13: LGTM! Import migration to top-level models.

The updated imports align with the Pydantic schema migration, moving from nested EnrollmentData classes to top-level models. Usage throughout the file is consistent with these changes.


264-264: LGTM! Improved clarity with keyword arguments.

The switch to keyword arguments (subjects=, course_number=) makes the API more explicit and aligns with the updated CourseReference public interface.


377-382: LGTM! Updated model usage.

The call to MeetingLocation.get_or_create_with_capacity correctly uses the top-level model and passes appropriate parameters.


386-386: LGTM! Meeting instantiation updated.

The Meeting instantiation correctly uses the top-level model with all required parameters properly populated from the enrollment data.


404-404: LGTM! Cleaner constructor usage.

The updated TermData() constructor call leverages default parameter values, simplifying the code and aligning with Pydantic model conventions.

Comment thread generation/enrollment.py
Comment on lines +373 to +374
coords = (building.get("latitude"), building.get("longitude"))
coordinates = coords if all(coords) else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Coordinate validation incorrectly treats 0 as invalid.

The logic all(coords) will reject coordinates where either latitude or longitude is exactly 0.0 (equator or prime meridian), as all() treats 0.0 as falsy. While Wisconsin coordinates are unlikely to be affected, this is technically incorrect.

🔎 Proposed fix
-                    coords = (building.get("latitude"), building.get("longitude"))
-                    coordinates = coords if all(coords) else None
+                    lat = building.get("latitude")
+                    lng = building.get("longitude")
+                    coordinates = (lat, lng) if lat is not None and lng is not None else None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
coords = (building.get("latitude"), building.get("longitude"))
coordinates = coords if all(coords) else None
lat = building.get("latitude")
lng = building.get("longitude")
coordinates = (lat, lng) if lat is not None and lng is not None else None
🤖 Prompt for AI Agents
In generation/enrollment.py around lines 373-374, the coordinate validation uses
all(coords) which treats 0.0 as falsy and will incorrectly drop valid
coordinates at the equator or prime meridian; change the check to explicitly
test for None (e.g., coordinates = coords if coords[0] is not None and coords[1]
is not None else None) or use any(x is None for x in coords) to ensure zero
values are accepted while missing values remain rejected.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
generation/upload.py (2)

84-95: The previously reported issue with Quiet: True masking deletion failures remains unaddressed.

Setting "Quiet": True on line 87 suppresses per-object error information in the S3 response. The code unconditionally increments deleted by len(batch) on line 90, which means if individual objects within a batch fail to delete, they will still be counted as successfully deleted.

This can result in:

  • Incorrect deletion counts in logs
  • Stale objects remaining in S3 without detection
  • False confidence in cleanup operations
🔎 Recommended fix
     for i in tqdm(range(0, len(stale_keys), 1000), desc="Deleting stale objects"):
         batch = stale_keys[i : i + 1000]
-        delete_request = {"Objects": [{"Key": key} for key in batch], "Quiet": True}
+        delete_request = {"Objects": [{"Key": key} for key in batch]}
         try:
-            client.delete_objects(Bucket=bucket, Delete=delete_request)
-            deleted += len(batch)
+            response = client.delete_objects(Bucket=bucket, Delete=delete_request)
+            deleted += len(response.get("Deleted", []))
+            
+            # Log any errors
+            for error in response.get("Errors", []):
+                logger.error(f"Failed to delete {error['Key']}: {error['Message']}")
         except Exception as e:
             logger.error(f"Failed to delete batch: {e}")

156-162: The previously reported Windows path separator issue remains unaddressed.

Line 160 converts relative_path (a Path object) to a string using str(relative_path), which uses OS-specific path separators. On Windows, this produces backslashes (\) instead of forward slashes (/), resulting in invalid S3 keys.

This will cause:

  • Broken S3 keys on Windows systems
  • Files to be inaccessible via S3
  • Deployment failures on Windows CI/CD runners
🔎 Required fix
         for filename in files:
             local_path = Path(root) / filename
             relative_path = local_path.relative_to(data_path)
-            key = f"{prefix}{relative_path}" if prefix else str(relative_path)
+            key = f"{prefix}{relative_path.as_posix()}" if prefix else relative_path.as_posix()
             key = key.removesuffix(".json")
             files_to_upload.append((local_path, key))
🧹 Nitpick comments (2)
generation/upload.py (2)

18-34: Add return type hint for better type safety.

The function lacks a return type hint. While boto3 clients are dynamically typed, adding a type hint improves code clarity and enables better IDE support.

🔎 Suggested improvement
+from typing import Any
+
 def get_s3_client(
     endpoint_url: str,
     access_key_id: str,
     secret_access_key: str,
     max_pool_connections: int = 100,
-):
+) -> Any:
     """Create an S3 client for S3-compatible object storage."""

Note: You can use Any for the boto3 client type, or use boto3.client with from mypy_boto3_s3 import S3Client if you add the boto3-stubs package for better type checking.


152-164: Consider validating the data directory before processing.

The function doesn't verify that data_dir exists or is a valid directory before walking it. This could lead to unclear error messages if the path is invalid.

🔎 Suggested improvement
     )
     data_path = Path(data_dir)
+    
+    if not data_path.exists():
+        raise ValueError(f"Data directory does not exist: {data_dir}")
+    if not data_path.is_dir():
+        raise ValueError(f"Path is not a directory: {data_dir}")

     # Collect all files to upload
     files_to_upload = []
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 074042d and f6af783.

📒 Files selected for processing (2)
  • generation/request_util.py
  • generation/upload.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • generation/request_util.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/upload.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-web
  • GitHub Check: build-web
🔇 Additional comments (3)
generation/upload.py (3)

37-44: LGTM!

The content type detection logic is correct with appropriate fallbacks for JSON files and binary content.


98-120: LGTM!

The file upload logic is correct with proper error handling and resource management. The function correctly:

  • Determines content type
  • Streams file content in a context manager
  • Sets appropriate headers (ContentType, CacheControl)
  • Returns success status

Consider adding a type hint for the client parameter (same suggestion as for get_s3_client), but this is a minor enhancement.


149-176: Concurrent S3 upload implementation is correct and follows best practices.

The code properly implements thread-safe concurrent uploads by creating a single boto3 S3 client before spawning threads and sharing it across the ThreadPoolExecutor. Boto3 clients are documented as thread-safe for concurrent operations, and the configuration with max_pool_connections appropriately tunes the connection pool for parallel requests. No changes needed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
generation/embeddings.py (1)

250-251: Consider renaming the comprehension variable to avoid shadowing.

The list comprehension variable course shadows the function parameter of the same name (line 230). While the logic is correct, renaming the comprehension variable (e.g., to prereq_course) would improve readability.

🔎 Suggested refactor
 branch_embeddings = [
-    get_embedding(cache_dir, model, course.get_full_summary())
-    for course in branch_as_courses
+    get_embedding(cache_dir, model, prereq_course.get_full_summary())
+    for prereq_course in branch_as_courses
 ]
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6af783 and c03ff04.

📒 Files selected for processing (1)
  • generation/embeddings.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/embeddings.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: Generate Data
🔇 Additional comments (1)
generation/embeddings.py (1)

285-290: LGTM! Defensive guard prevents potential crash.

The AST null-check correctly prevents an AttributeError on line 292 when attempting to call course_combinations(). The fallback to the original course_references is appropriate and consistent with the early-return pattern on lines 278-283.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
generation/upload.py (2)

84-95: Quiet mode masks per-object deletion failures.

Setting "Quiet": True suppresses individual object error details. Line 90 unconditionally increments the counter, potentially overcounting deletions when some objects fail.


159-161: Windows path separators create invalid S3 keys.

Line 160 converts Path to string using OS-specific separators. On Windows, backslashes produce invalid S3 keys. Use .as_posix() to ensure forward slashes.

🧹 Nitpick comments (2)
generation/upload.py (1)

160-160: Consider ensuring prefix ends with separator.

If prefix doesn't end with /, concatenation produces keys without proper separation (e.g., "v1course/..." instead of "v1/course/..."). The docstring example shows a trailing slash, but enforcing it would be more robust.

🔎 Optional enhancement
-            key = f"{prefix}{relative_path}" if prefix else str(relative_path)
+            # Ensure prefix ends with / if provided
+            normalized_prefix = prefix if not prefix or prefix.endswith("/") else f"{prefix}/"
+            key = f"{normalized_prefix}{relative_path.as_posix()}" if normalized_prefix else relative_path.as_posix()
generation/save.py (1)

417-430: Consider updating the docstring to reflect BaseModel support.

The function now accepts BaseModel instances and converts them via model_dump(). The implementation is correct, but the docstring at line 407 states "Writes a sorted dictionary or list to a JSON file" which doesn't mention BaseModel or the fact that model_dump() can return other types (including strings, as seen in RequirementNode and Requirement per the relevant code snippets).

Suggested docstring update
 def write_file(directory, directory_tuple: tuple[str, ...], filename: str, data):
     """
-    Writes a sorted dictionary or list to a JSON file.
+    Writes a sorted dictionary, list, or Pydantic BaseModel to a JSON file.
+    BaseModel instances are converted via model_dump() before serialization.
     - directory_tuple: Tuple representing the directory path.
     - filename: Name of the JSON file (without the .json extension).
-    - data: Dictionary or list to be written to the file.
+    - data: Dictionary, list, set, tuple, or Pydantic BaseModel to be written to the file.
     """
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c03ff04 and a4ecdc9.

📒 Files selected for processing (2)
  • generation/save.py
  • generation/upload.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/upload.py
  • generation/save.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
🧬 Code graph analysis (1)
generation/save.py (1)
generation/schemas/requirement_ast.py (2)
  • model_dump (28-30)
  • model_dump (73-75)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: Generate Data
🔇 Additional comments (4)
generation/upload.py (1)

18-34: LGTM: S3 client configuration follows best practices.

The adaptive retry strategy and configurable connection pool align well with concurrent upload requirements.

generation/save.py (3)

10-10: LGTM: Pydantic import added.

The import of BaseModel from pydantic is necessary for the migration from JsonSerializable to Pydantic-based schemas.


368-389: LGTM: BaseModel handling in recursive_sort_data is correct.

The function correctly handles Pydantic BaseModel instances by converting them via model_dump() before sorting. The implementation properly handles cases where custom model_dump() implementations may return different types (str, dict, list) as seen in the relevant code snippets for RequirementNode and Requirement.


634-641: Manifest generation correctly aligns with API consumer expectations.

The manifest schema matches the documented OpenAPI specification exactly. All fields (subjects, courses, instructors, updated_on) are present and properly typed as defined in generation/schemas/endpoints.py and the generated TypeScript types.

Comment thread generation/save.py
identifier_to_course.items(), desc="Courses", unit="course"
):
write_file(data_dir, ("course",), identifier, course)
write_file(data_dir, ("course", identifier), "index", course)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for hardcoded references to the old course path pattern
rg -nP --type=ts --type=py --type=js -C3 'course.*\.json|loadCourse|fetchCourse' --glob='!generation/save.py'

Repository: twangodev/uw-coursemap

Length of output: 2298


🏁 Script executed:

#!/bin/bash
# Look for course data loading/reading patterns more broadly
rg -nP 'course' --type=ts --type=py --type=js | grep -i 'path\|file\|read\|load\|fetch' | head -30

Repository: twangodev/uw-coursemap

Length of output: 2664


🏁 Script executed:

#!/bin/bash
# Check for any API endpoint definitions related to courses
rg -nP '/(course|api)' --type=ts --type=py --type=js | grep -E '(route|endpoint|path)' | head -20

Repository: twangodev/uw-coursemap

Length of output: 2004


🏁 Script executed:

#!/bin/bash
# Look for write_file calls related to courses to understand the full pattern
rg -nP 'write_file.*course' --type=py -A2 -B2

Repository: twangodev/uw-coursemap

Length of output: 2003


🏁 Script executed:

#!/bin/bash
# Check if there are any migration or cleanup scripts
fd -t f \( -name '*migrate*' -o -name '*cleanup*' -o -name '*upgrade*' \) --type f

Repository: twangodev/uw-coursemap

Length of output: 296


🏁 Script executed:

#!/bin/bash
# Look for documentation or README files that might reference course path
fd -t f \( -name 'README*' -o -name '*.md' \) | xargs grep -l 'course' 2>/dev/null || echo "No matches found"

Repository: twangodev/uw-coursemap

Length of output: 330


🏁 Script executed:

#!/bin/bash
# Verify the glob pattern issue in search/data_loader.py
cat -n search/data_loader.py | sed -n '175,190p'

Repository: twangodev/uw-coursemap

Length of output: 826


🏁 Script executed:

#!/bin/bash
# Check if search/data_loader.py has been updated in this PR
git diff HEAD -- search/data_loader.py | head -50

Repository: twangodev/uw-coursemap

Length of output: 48


🏁 Script executed:

#!/bin/bash
# Check the load_json method to understand the path resolution
rg -A5 'def load_json' search/data_loader.py

Repository: twangodev/uw-coursemap

Length of output: 307


Fix search/data_loader.py to load courses from the new nested directory structure.

The course path change from /course/{identifier}.json to /course/{identifier}/index.json breaks the search indexing layer. The search/data_loader.py file was not updated and has two incompatibilities:

  1. Line 182: local_dir.glob("*.json") finds no matches since course files are now in subdirectories (/course/{identifier}/index.json), not directly in /course/
  2. Line 189: self.load_json(f"course/{course_id}") attempts to load from course/{courseId} instead of course/{courseId}/index

Update search/data_loader.py to:

  • Use local_dir.glob("*/index.json") or iterate subdirectories to discover courses
  • Load courses using self.load_json(f"course/{course_id}/index") to match the new structure
🤖 Prompt for AI Agents
In search/data_loader.py around lines 182 and 189, the loader currently globbing
local_dir.glob("*.json") and calling self.load_json(f"course/{course_id}") which
fails with the new structure; change discovery to find nested course index files
(e.g., local_dir.glob("*/index.json") or iterate subdirectories and look for
index.json), extract course_id from the subdirectory name (parent folder of
index.json), and call self.load_json(f"course/{course_id}/index") so the loader
reads /course/{identifier}/index.json.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
generation/request_util.py (1)

8-8: Remove HTTP 403 from the retry status codes list.

HTTP 403 (Forbidden) indicates a client authorization error that won't resolve through retries. This issue was flagged in a previous review but the code still includes 403 in the status_forcelist. Including it violates HTTP semantics, may mask real authentication bugs, and wastes resources with up to 50 retry attempts and exponential backoff. Keep only genuinely transient codes: 429 (rate limit) and 5xx (server errors).

🔎 Recommended fix
-        status_forcelist=[202, 403, 429, 500, 502, 503, 504],  # Retry on these HTTP status codes
+        status_forcelist=[429, 500, 502, 503, 504],  # Retry on these HTTP status codes
generation/enrollment.py (1)

378-379: Coordinate validation incorrectly treats 0 as invalid.

This issue was previously identified: all(coords) will reject coordinates where latitude or longitude is exactly 0.0 (equator or prime meridian) since 0.0 is falsy. While Wisconsin coordinates won't hit this edge case, the logic is technically incorrect.

🔎 Proposed fix
-                    coords = (building.get("latitude"), building.get("longitude"))
-                    coordinates = coords if all(coords) else None
+                    lat = building.get("latitude")
+                    lng = building.get("longitude")
+                    coordinates = (lat, lng) if lat is not None and lng is not None else None
🧹 Nitpick comments (2)
generation/enrollment.py (2)

245-262: Well-designed retry mechanism.

The tenacity-based retry with exponential jitter is a good pattern for handling transient API failures. The configuration (10 attempts, 1-60s exponential backoff with jitter) provides good resilience without being too aggressive.

Minor style note: The pass in the exception class is unnecessary since the docstring already provides a body.

🔎 Optional cleanup
 class EnrollmentFetchError(Exception):
     """Raised when enrollment package fetch fails."""
-    pass

300-308: Overly broad exception handling.

Catching generic Exception along with specific exceptions will mask unexpected errors. If fetch_enrollment_package raises something other than JSONDecodeError or EnrollmentFetchError (e.g., asyncio.CancelledError, KeyboardInterrupt), those should likely propagate rather than being silently caught and logged as a fetch failure.

🔎 Proposed fix
     try:
         data = await fetch_enrollment_package(
             session, enrollment_package_url, course_ref.get_identifier()
         )
-    except (JSONDecodeError, EnrollmentFetchError, Exception) as e:
+    except (JSONDecodeError, EnrollmentFetchError) as e:
         logger.warning(
             f"Failed to fetch enrollment data for {course_ref.get_identifier()} after retries: {str(e)}"
         )
         return None
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4ecdc9 and 5ca2f9d.

⛔ Files ignored due to path filters (1)
  • generation/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • generation/enrollment.py
  • generation/pyproject.toml
  • generation/request_util.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • generation/pyproject.toml
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/request_util.py
  • generation/enrollment.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
🧬 Code graph analysis (1)
generation/enrollment.py (3)
generation/schemas/course.py (3)
  • Course (130-211)
  • get_identifier (67-70)
  • get_identifier (165-167)
generation/enrollment_data.py (4)
  • EnrollmentData (290-332)
  • MeetingLocation (185-255)
  • Meeting (258-287)
  • get_or_create_with_capacity (194-240)
generation/schemas/enrollment.py (4)
  • EnrollmentData (103-137)
  • MeetingLocation (29-49)
  • Meeting (52-100)
  • TermData (140-144)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: Generate Data
🔇 Additional comments (5)
generation/enrollment.py (5)

9-14: LGTM!

The new imports properly support the tenacity-based retry mechanism and the updated Pydantic model imports (MeetingLocation, Meeting, TermData) align with the refactored data model usage throughout the file.


284-284: LGTM!

Using keyword arguments for Course.Reference improves readability and aligns with the updated constructor pattern.


382-387: LGTM!

The updated call to MeetingLocation.get_or_create_with_capacity correctly uses the new import and passes all required parameters matching the factory method signature.


391-400: LGTM!

The Meeting instantiation correctly passes all fields matching the Pydantic model definition. Using keyword arguments enhances readability and maintainability.


409-409: LGTM!

Using TermData() without arguments is cleaner and correctly leverages the default None values defined in the Pydantic model.

total=50, # Total number of retries
backoff_factor=1, # Exponential backoff factor (e.g., 1, 2, 4 seconds)
status_forcelist=[429, 500, 502, 503, 504], # Retry on these HTTP status codes
status_forcelist=[202, 403, 429, 500, 502, 503, 504], # Retry on these HTTP status codes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove HTTP 202 from the retry status codes list.

HTTP 202 (Accepted) is a success response indicating that a request has been accepted for asynchronous processing. It is not an error condition and should never be retried. Including it in status_forcelist is semantically incorrect and may cause duplicate requests, potential data corruption if operations aren't idempotent, and indicates a misunderstanding of async API patterns.

🔎 Recommended fix
-        status_forcelist=[202, 403, 429, 500, 502, 503, 504],  # Retry on these HTTP status codes
+        status_forcelist=[429, 500, 502, 503, 504],  # Retry on these HTTP status codes
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
status_forcelist=[202, 403, 429, 500, 502, 503, 504], # Retry on these HTTP status codes
status_forcelist=[429, 500, 502, 503, 504], # Retry on these HTTP status codes
🤖 Prompt for AI Agents
In generation/request_util.py around line 8, the status_forcelist includes HTTP
202 which should be removed; edit the list to exclude 202 so only error/status
codes (e.g. 403, 429, 500, 502, 503, 504) remain, update the inline comment if
needed to reflect retry-only status codes, and run relevant tests/linters to
ensure no other code depends on 202 being retried.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
generation/enrollment.py (1)

422-423: Coordinate validation incorrectly treats 0 as invalid.

The logic all(coords) will reject coordinates where either latitude or longitude is exactly 0.0 (equator or prime meridian), as all() treats 0.0 as falsy. While Wisconsin coordinates are unlikely to be affected, this is technically incorrect.

🔎 Proposed fix
-                    coords = (building.get("latitude"), building.get("longitude"))
-                    coordinates = coords if all(coords) else None
+                    lat = building.get("latitude")
+                    lng = building.get("longitude")
+                    coordinates = (lat, lng) if lat is not None and lng is not None else None
🧹 Nitpick comments (1)
generation/enrollment.py (1)

109-128: Consider tracking and reporting failed pages.

Failed page fetches are logged but silently skipped, which could lead to incomplete data. Adding a summary of how many pages failed would improve visibility into data completeness.

🔎 Track and report failed pages
         # Fetch all pages and combine hits
         all_hits = []
+        failed_pages = []
         for page_num in range(1, total_pages + 1):
             page_post_data = {
                 "selectedTerm": selected_term,
                 "queryString": "",
                 "filters": [],
                 "page": page_num,
                 "pageSize": MAX_PAGE_SIZE,
             }
             try:
                 page_data = await fetch_course_listing_page(session, page_post_data, page_num)
                 page_hits = page_data.get("hits", [])
                 all_hits.extend(page_hits)
                 logger.debug(f"Fetched page {page_num}/{total_pages}: {len(page_hits)} courses")
             except EnrollmentFetchError as e:
                 logger.error(f"Failed to fetch page {page_num} for {term_name}: {e}")
+                failed_pages.append(page_num)
                 continue

         logger.info(f"Fetched {len(all_hits)} total course listings for {term_name}")
+        if failed_pages:
+            logger.warning(
+                f"Failed to fetch {len(failed_pages)} pages for {term_name}: {failed_pages}"
+            )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ca2f9d and a2dfad4.

📒 Files selected for processing (1)
  • generation/enrollment.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/enrollment.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: Generate Data
🔇 Additional comments (4)
generation/enrollment.py (4)

9-9: LGTM: Imports aligned with new retry logic and data models.

The tenacity imports enable robust retry handling, and the updated enrollment_data imports reflect the refactored public API surface.

Also applies to: 14-14


23-24: LGTM: Page size constant appropriately defined.

The constant correctly reflects the API's maximum page size limit and is well-documented.


328-328: LGTM: Data model usage aligned with refactored schemas.

The usage of Course.Reference, MeetingLocation.get_or_create_with_capacity, Meeting, and TermData correctly reflects the refactored public API surface and Pydantic schema migration.

Also applies to: 345-348, 426-431, 435-444, 453-453


88-94: LGTM: Appropriate error handling for initial fetch.

Returning an empty dict on failure allows the pipeline to continue gracefully while logging the error for visibility.

Comment thread generation/enrollment.py
Comment on lines +279 to +306
@retry(
stop=stop_after_attempt(10),
wait=wait_exponential_jitter(initial=1, max=60, jitter=5),
retry=retry_if_exception_type(EnrollmentFetchError),
reraise=True,
)
async def fetch_course_listing_page(session, post_data, page_num):
"""Fetch a page of course listings with retry logic for 202 responses."""
async with session.post(url=query_url, json=post_data) as response:
if response.status != 200:
logger.warning(f"HTTP {response.status} for course listing page {page_num}, retrying...")
raise EnrollmentFetchError(f"HTTP {response.status}")
return await response.json()


@retry(
stop=stop_after_attempt(10),
wait=wait_exponential_jitter(initial=1, max=60, jitter=5),
retry=retry_if_exception_type(EnrollmentFetchError),
reraise=True,
)
async def fetch_enrollment_package(session, url, course_identifier):
"""Fetch enrollment package with tenacity retry logic."""
async with session.get(url=url) as response:
if response.status != 200:
logger.warning(f"HTTP {response.status} for {course_identifier}, retrying...")
raise EnrollmentFetchError(f"HTTP {response.status}")
return await response.json()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Retry logic doesn't cover network errors.

The retry_if_exception_type(EnrollmentFetchError) configuration only retries when HTTP status != 200. Transient network errors (connection timeouts, DNS failures, etc.) raise aiohttp.ClientError or its subclasses, which will not trigger retry and cause immediate failure. This leads to incomplete data in the generation pipeline.

🔎 Expand retry to cover network errors
+from aiohttp import ClientError
+
 @retry(
     stop=stop_after_attempt(10),
     wait=wait_exponential_jitter(initial=1, max=60, jitter=5),
-    retry=retry_if_exception_type(EnrollmentFetchError),
+    retry=retry_if_exception_type((EnrollmentFetchError, ClientError)),
     reraise=True,
 )
 async def fetch_course_listing_page(session, post_data, page_num):
     """Fetch a page of course listings with retry logic for 202 responses."""
     async with session.post(url=query_url, json=post_data) as response:
         if response.status != 200:
             logger.warning(f"HTTP {response.status} for course listing page {page_num}, retrying...")
             raise EnrollmentFetchError(f"HTTP {response.status}")
         return await response.json()


 @retry(
     stop=stop_after_attempt(10),
     wait=wait_exponential_jitter(initial=1, max=60, jitter=5),
-    retry=retry_if_exception_type(EnrollmentFetchError),
+    retry=retry_if_exception_type((EnrollmentFetchError, ClientError)),
     reraise=True,
 )
 async def fetch_enrollment_package(session, url, course_identifier):
     """Fetch enrollment package with tenacity retry logic."""
     async with session.get(url=url) as response:
         if response.status != 200:
             logger.warning(f"HTTP {response.status} for {course_identifier}, retrying...")
             raise EnrollmentFetchError(f"HTTP {response.status}")
         return await response.json()
🤖 Prompt for AI Agents
In generation/enrollment.py around lines 279 to 306, the tenacity retry
currently only retries on EnrollmentFetchError which misses transient network
errors; update the retry condition to include aiohttp.ClientError (and
optionally asyncio.TimeoutError) so network-related exceptions are retried as
well, add the necessary imports (import aiohttp and import asyncio if adding
TimeoutError), and adjust the retry_if_exception_type argument to a tuple like
(EnrollmentFetchError, aiohttp.ClientError, asyncio.TimeoutError) while keeping
reraise=True and the same stop/wait settings.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
generation/enrollment.py (3)

286-298: Retry logic doesn't cover network errors.

The retry decorator only handles EnrollmentFetchError (HTTP status != 200), but transient network errors like aiohttp.ClientError or asyncio.TimeoutError won't trigger retry and will cause immediate failure.

As noted in previous review, expand the retry to include network errors:

from aiohttp import ClientError

@retry(
    stop=stop_after_attempt(10),
    wait=wait_exponential_jitter(initial=1, max=60, jitter=5),
    retry=retry_if_exception_type((EnrollmentFetchError, ClientError)),
    reraise=True,
)

301-313: Retry logic doesn't cover network errors.

Same issue as fetch_course_listing_page - the retry only handles EnrollmentFetchError, missing transient network failures.

Apply the same fix as above to include ClientError in the retry condition.


432-433: Coordinate validation incorrectly treats 0 as invalid.

The logic all(coords) will reject coordinates where either latitude or longitude is exactly 0.0 (equator or prime meridian), as all() treats 0.0 as falsy.

🔎 Proposed fix
-                    coords = (building.get("latitude"), building.get("longitude"))
-                    coordinates = coords if all(coords) else None
+                    lat = building.get("latitude")
+                    lng = building.get("longitude")
+                    coordinates = (lat, lng) if lat is not None and lng is not None else None
🧹 Nitpick comments (1)
generation/enrollment.py (1)

127-129: Consider failing fast or tracking partial data on page fetch errors.

When a page fetch fails, the code logs and continues, which could result in incomplete enrollment data. While the error is logged, downstream processing has no visibility into whether the data is complete or partial. Consider either:

  • Failing the entire term if any page fails (ensuring data completeness)
  • Tracking and propagating a "partial data" flag to inform consumers
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a2dfad4 and a758fe2.

📒 Files selected for processing (1)
  • generation/enrollment.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/enrollment.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-web
  • GitHub Check: build-web
🔇 Additional comments (2)
generation/enrollment.py (2)

23-27: Good practice: Module-level constants with clear documentation.

The constants are well-documented and will make it easier to adjust pagination and rate limiting behavior in the future.


352-357: Good practice: Semaphore correctly limits concurrent API requests.

The semaphore usage ensures that no more than MAX_CONCURRENT_REQUESTS are in-flight simultaneously, helping avoid rate limiting from the API.

Comment thread generation/enrollment.py
Comment on lines +358 to 362
except (JSONDecodeError, EnrollmentFetchError, Exception) as e:
logger.warning(
f"Failed to fetch enrollment data for {course_ref.get_identifier()}: {str(e)}"
f"Failed to fetch enrollment data for {course_ref.get_identifier()} after retries: {str(e)}"
)
if attempts > 0:
logger.info(f"Retrying {attempts} more times...")
await asyncio.sleep(1)
return await process_hit(
hit,
i,
course_count,
selected_term,
term_name,
terms,
course_ref_to_course,
session,
attempts - 1,
)
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Narrow the exception handling to avoid masking bugs.

Catching bare Exception will suppress programming errors (e.g., AttributeError, TypeError, KeyError) that indicate bugs in the code, making debugging difficult. Since the retry logic in fetch_enrollment_package already re-raises after exhausting attempts, you should only catch expected exceptions.

🔎 Proposed fix
-    except (JSONDecodeError, EnrollmentFetchError, Exception) as e:
+    except (JSONDecodeError, EnrollmentFetchError) as e:
         logger.warning(
             f"Failed to fetch enrollment data for {course_ref.get_identifier()} after retries: {str(e)}"
         )
         return None

If you need to catch additional specific exceptions (e.g., KeyError during data processing), add them explicitly rather than using the catch-all Exception.

🤖 Prompt for AI Agents
generation/enrollment.py around lines 358 to 362: the except block currently
catches (JSONDecodeError, EnrollmentFetchError, Exception) which masks
programming errors; change it to catch only the expected exceptions (e.g.,
JSONDecodeError, EnrollmentFetchError) and, if there are known data-processing
errors you expect (for example KeyError), add those explicitly; remove the bare
Exception, let unexpected exceptions propagate (or re-raise them) so they are
not silently swallowed, and keep the existing logger.warning message for the
explicitly caught exceptions.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
generation/enrollment_data.py (3)

46-56: Consider using Pydantic's PrivateAttr for private instance attributes.

The _instructors_set attribute is defined as a class-level annotation but used as instance state. In Pydantic 2.x, private attributes should be declared using PrivateAttr for proper initialization and handling.

🔎 Proposed refactor
+from pydantic import ConfigDict, PrivateAttr
 
 class GradeData(_GradeDataBase):
     """Extended GradeData with set-based instructors for internal processing."""
 
     model_config = ConfigDict(
         populate_by_name=True,
-        # Allow mutation for the instructors field during processing
         frozen=False,
     )
 
-    # Override to use set internally for efficient operations
-    _instructors_set: set[str] | None = None
+    _instructors_set: set[str] | None = PrivateAttr(default=None)
 
     def __init__(self, **data):
         # Convert list to set if needed for internal processing
         instructors = data.get("instructors")
+        super().__init__(**data)
         if isinstance(instructors, list):
-            self._instructors_set = set(instructors) if instructors else None
+            object.__setattr__(self, '_instructors_set', set(instructors) if instructors else None)
         elif isinstance(instructors, set):
-            self._instructors_set = instructors
-            data["instructors"] = list(instructors) if instructors else None
-        super().__init__(**data)
+            object.__setattr__(self, '_instructors_set', instructors)

91-101: Type hint vs runtime check inconsistency.

The method signature declares other: GradeData | None, but lines 97-101 use hasattr(other, "instructors_as_set") suggesting other might be a base class instance. Consider updating the type hint to GradeData | _GradeDataBase | None if mixing types is intentional, or remove the hasattr check if only GradeData instances are expected.


387-390: Consider using model_copy for cleaner attribute updates.

The pattern of unpacking model_dump() and reconstructing works but Pydantic 2.x provides model_copy(update={...}) for this use case.

🔎 Proposed refactor
-            # Update instructors on the grade data
-            grade_data = GradeData(
-                **{**grade_data.model_dump(), "instructors": list(instructors)}
-            )
+            # Update instructors on the grade data
+            grade_data = grade_data.model_copy(update={"instructors": list(instructors)})
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a758fe2 and 7106c23.

📒 Files selected for processing (1)
  • generation/enrollment_data.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/enrollment_data.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
🧬 Code graph analysis (1)
generation/enrollment_data.py (2)
generation/schemas/grades.py (4)
  • GradeData (8-131)
  • empty (52-73)
  • merge_with (75-107)
  • from_madgrades (110-131)
generation/schemas/enrollment.py (7)
  • School (12-26)
  • MeetingLocation (29-49)
  • Meeting (52-100)
  • EnrollmentData (103-137)
  • TermData (140-144)
  • from_enrollment (20-26)
  • from_enrollment (120-137)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: Generate Data
🔇 Additional comments (4)
generation/enrollment_data.py (4)

1-34: LGTM!

The module setup, imports, and __all__ exports are well-organized. The aliased imports with underscore prefixes clearly indicate base classes while the re-exports provide a clean public API.


258-288: LGTM!

The Meeting class extension is clean. The inline import for CourseReference is an acceptable pattern to avoid circular imports.


312-328: Override is necessary for correct return type, not a duplicate.

Contrary to the past review comment, this from_enrollment override is intentional and necessary. The base class _EnrollmentDataBase.from_enrollment returns _EnrollmentDataBase instances, but this override ensures EnrollmentData (the extended subclass) is returned when called on this class. Removing it would cause type mismatches when consumers expect EnrollmentData instances.


367-368: Exception handling is correct.

The past review comment suggesting "flow continues" is incorrect. The raise statement on line 368 properly re-raises the exception after logging, terminating execution as expected.

Comment on lines +191 to +240
_all_locations: ClassVar[dict[tuple, "MeetingLocation"]] = {}

@classmethod
def get_or_create_with_capacity(
cls,
building: str | None,
room: str | None,
coordinates: tuple[float, float] | None,
class_capacity: int | None,
) -> MeetingLocation:
"""
Get existing MeetingLocation or create new one, updating max capacity.

Args:
building: Building name
room: Room identifier
coordinates: Tuple of (latitude, longitude)
class_capacity: Current class capacity (used to estimate room capacity)

Returns:
MeetingLocation instance with updated max capacity
"""
location_key = (building, room, coordinates)

if location_key in cls._all_locations:
existing_location = cls._all_locations[location_key]
# Update capacity to maximum seen so far
if existing_location.capacity is None or (
class_capacity is not None
and class_capacity > existing_location.capacity
):
# Create new instance with updated capacity (Pydantic models are immutable by default)
updated = cls(
building=existing_location.building,
room=existing_location.room,
coordinates=existing_location.coordinates,
capacity=class_capacity,
)
cls._all_locations[location_key] = updated
return updated
return existing_location
else:
new_location = cls(
building=building,
room=room,
coordinates=coordinates,
capacity=class_capacity,
)
cls._all_locations[location_key] = new_location
return new_location
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Class-level cache lacks cleanup and thread safety.

_all_locations is a class-level mutable dict that accumulates entries indefinitely, creating:

  1. Memory leak: No mechanism to clear entries after processing completes
  2. Thread safety: Concurrent access without synchronization in async contexts
  3. Test isolation: Cache persists across test runs, potentially causing flaky tests

Consider adding a clear_cache() classmethod and documenting the expected lifecycle, or using a context-managed approach for the cache scope.

🔎 Proposed addition
     # Class-level dict to track all unique meeting locations for O(1) lookup
     _all_locations: ClassVar[dict[tuple, "MeetingLocation"]] = {}
 
+    @classmethod
+    def clear_cache(cls) -> None:
+        """Clear the location deduplication cache. Call after processing completes."""
+        cls._all_locations.clear()
+
     @classmethod
     def get_or_create_with_capacity(
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_all_locations: ClassVar[dict[tuple, "MeetingLocation"]] = {}
@classmethod
def get_or_create_with_capacity(
cls,
building: str | None,
room: str | None,
coordinates: tuple[float, float] | None,
class_capacity: int | None,
) -> MeetingLocation:
"""
Get existing MeetingLocation or create new one, updating max capacity.
Args:
building: Building name
room: Room identifier
coordinates: Tuple of (latitude, longitude)
class_capacity: Current class capacity (used to estimate room capacity)
Returns:
MeetingLocation instance with updated max capacity
"""
location_key = (building, room, coordinates)
if location_key in cls._all_locations:
existing_location = cls._all_locations[location_key]
# Update capacity to maximum seen so far
if existing_location.capacity is None or (
class_capacity is not None
and class_capacity > existing_location.capacity
):
# Create new instance with updated capacity (Pydantic models are immutable by default)
updated = cls(
building=existing_location.building,
room=existing_location.room,
coordinates=existing_location.coordinates,
capacity=class_capacity,
)
cls._all_locations[location_key] = updated
return updated
return existing_location
else:
new_location = cls(
building=building,
room=room,
coordinates=coordinates,
capacity=class_capacity,
)
cls._all_locations[location_key] = new_location
return new_location
_all_locations: ClassVar[dict[tuple, "MeetingLocation"]] = {}
@classmethod
def clear_cache(cls) -> None:
"""Clear the location deduplication cache. Call after processing completes."""
cls._all_locations.clear()
@classmethod
def get_or_create_with_capacity(
cls,
building: str | None,
room: str | None,
coordinates: tuple[float, float] | None,
class_capacity: int | None,
) -> MeetingLocation:
"""
Get existing MeetingLocation or create new one, updating max capacity.
Args:
building: Building name
room: Room identifier
coordinates: Tuple of (latitude, longitude)
class_capacity: Current class capacity (used to estimate room capacity)
Returns:
MeetingLocation instance with updated max capacity
"""
location_key = (building, room, coordinates)
if location_key in cls._all_locations:
existing_location = cls._all_locations[location_key]
# Update capacity to maximum seen so far
if existing_location.capacity is None or (
class_capacity is not None
and class_capacity > existing_location.capacity
):
# Create new instance with updated capacity (Pydantic models are immutable by default)
updated = cls(
building=existing_location.building,
room=existing_location.room,
coordinates=existing_location.coordinates,
capacity=class_capacity,
)
cls._all_locations[location_key] = updated
return updated
return existing_location
else:
new_location = cls(
building=building,
room=room,
coordinates=coordinates,
capacity=class_capacity,
)
cls._all_locations[location_key] = new_location
return new_location

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
generation/enrollment.py (3)

286-313: Retry logic doesn't cover network errors.

The retry_if_exception_type(EnrollmentFetchError) configuration only retries when HTTP status != 200. Transient network errors (connection timeouts, DNS failures, etc.) that raise aiohttp.ClientError or its subclasses will not trigger retry and cause immediate failure, leading to incomplete data.

🔎 Expand retry to cover network errors
+from aiohttp import ClientError
+
 @retry(
     stop=stop_after_attempt(20),
     wait=wait_exponential_jitter(initial=2, max=120, jitter=5),
-    retry=retry_if_exception_type(EnrollmentFetchError),
+    retry=retry_if_exception_type((EnrollmentFetchError, ClientError)),
     reraise=True,
 )
 async def fetch_course_listing_page(session, post_data, page_num):
     """Fetch a page of course listings with retry logic for 202 responses."""
     async with session.post(url=query_url, json=post_data) as response:
         if response.status != 200:
             logger.warning(f"HTTP {response.status} for course listing page {page_num}, retrying...")
             raise EnrollmentFetchError(f"HTTP {response.status}")
         return await response.json()


 @retry(
     stop=stop_after_attempt(20),
     wait=wait_exponential_jitter(initial=2, max=120, jitter=5),
-    retry=retry_if_exception_type(EnrollmentFetchError),
+    retry=retry_if_exception_type((EnrollmentFetchError, ClientError)),
     reraise=True,
 )
 async def fetch_enrollment_package(session, url, course_identifier):
     """Fetch enrollment package with tenacity retry logic."""
     async with session.get(url=url) as response:
         if response.status != 200:
             logger.warning(f"HTTP {response.status} for {course_identifier}, retrying...")
             raise EnrollmentFetchError(f"HTTP {response.status}")
         return await response.json()

Verify that aiohttp-client-cache uses the same exception hierarchy as aiohttp:

Does aiohttp-client-cache raise aiohttp.ClientError exceptions for network errors like connection timeouts and DNS failures?

352-362: Narrow the exception handling to avoid masking bugs.

Catching bare Exception (line 358) will suppress programming errors (e.g., AttributeError, TypeError, KeyError) that indicate bugs in the code, making debugging difficult. Since the retry logic in fetch_enrollment_package already re-raises after exhausting attempts, you should only catch expected exceptions.

🔎 Proposed fix
-    except (JSONDecodeError, EnrollmentFetchError, Exception) as e:
+    except (JSONDecodeError, EnrollmentFetchError) as e:
         logger.warning(
             f"Failed to fetch enrollment data for {course_ref.get_identifier()} after retries: {str(e)}"
         )
         return None

If you need to catch additional specific exceptions (e.g., KeyError during data processing), add them explicitly rather than using the catch-all Exception.


432-433: Coordinate validation incorrectly treats 0 as invalid.

The logic all(coords) (line 433) will reject coordinates where either latitude or longitude is exactly 0.0 (equator or prime meridian), as all() treats 0.0 as falsy. While Wisconsin coordinates are unlikely to be affected, this is technically incorrect.

🔎 Proposed fix
-                    coords = (building.get("latitude"), building.get("longitude"))
-                    coordinates = coords if all(coords) else None
+                    lat = building.get("latitude")
+                    lng = building.get("longitude")
+                    coordinates = (lat, lng) if lat is not None and lng is not None else None
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7106c23 and 70598fd.

📒 Files selected for processing (1)
  • generation/enrollment.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/enrollment.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: twangodev/uw-coursemap PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T08:36:25.768Z
Learning: Applies to generation/**/*.py : Implement the data generation pipeline in the generation service following the documented steps
🧬 Code graph analysis (1)
generation/enrollment.py (1)
generation/enrollment_data.py (3)
  • MeetingLocation (185-255)
  • Meeting (258-287)
  • get_or_create_with_capacity (194-240)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: Generate Data
🔇 Additional comments (5)
generation/enrollment.py (5)

1-14: LGTM!

The imports are well-organized and all appear to be used appropriately. The addition of Semaphore for concurrency control and tenacity for retry logic aligns with the new pagination and error-handling features.


23-27: LGTM!

The constants are well-documented and the values are appropriate. MAX_PAGE_SIZE = 500 aligns with the API limit, and MAX_CONCURRENT_REQUESTS = 20 provides reasonable rate-limiting protection.


316-343: LGTM!

The addition of the semaphore parameter enables proper concurrency control, and the use of keyword arguments for Course.Reference (line 336) improves code clarity and maintainability.


436-456: LGTM!

The usage of MeetingLocation.get_or_create_with_capacity (lines 436-441) correctly delegates deduplication and capacity management to the factory method. The Meeting instantiation (lines 445-454) properly uses all required fields and aligns with the new Pydantic schema structure.


463-463: LGTM!

The simplified TermData() instantiation (previously TermData(None, None)) aligns with the Pydantic schema migration and provides a cleaner API using default parameter values.

Comment thread generation/enrollment.py
Comment on lines +114 to +129
for page_num in range(1, total_pages + 1):
page_post_data = {
"selectedTerm": selected_term,
"queryString": "",
"filters": [],
"page": page_num,
"pageSize": MAX_PAGE_SIZE,
}
try:
page_data = await fetch_course_listing_page(session, page_post_data, page_num)
page_hits = page_data.get("hits", [])
all_hits.extend(page_hits)
logger.debug(f"Fetched page {page_num}/{total_pages}: {len(page_hits)} courses")
except EnrollmentFetchError as e:
logger.error(f"Failed to fetch page {page_num} for {term_name}: {e}")
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Failed pages are silently dropped, leading to incomplete data.

When fetch_course_listing_page raises EnrollmentFetchError after exhausting retries, the exception is caught (line 127) and the loop continues without re-attempting that page. This means courses on the failed page are permanently missing from all_hits, and downstream processing will have incomplete enrollment data.

Consider either:

  1. Failing the entire term processing if any page fails (remove continue, let exception propagate)
  2. Collecting failed page numbers and re-attempting them after the initial loop
  3. At minimum, add a warning after the loop if len(all_hits) < course_count
🔎 Proposed fix to detect incomplete data
         logger.info(f"Fetched {len(all_hits)} total course listings for {term_name}")
+        
+        if len(all_hits) < course_count:
+            logger.warning(
+                f"Incomplete data: expected {course_count} courses but only fetched {len(all_hits)}. "
+                f"Some pages may have failed after retries."
+            )

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
generation/madgrades.py (1)

50-52: The 50 retry attempts is intentional, but consider adopting exponential backoff.

The increase to 50 attempts was a deliberate feature addition to handle Madgrades API resilience (confirmed by commit history). However, with a fixed 1-second delay between retries, this could result in 50+ second wait times on failures.

The codebase already uses exponential backoff successfully in other async operations (e.g., fetch_course_listing_page in enrollment.py with wait_exponential_jitter(initial=2, max=120)). Consider applying the same pattern to from_madgrades_async to reduce unnecessary delays on transient failures while maintaining robustness for persistent issues.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70598fd and 8d53a52.

📒 Files selected for processing (1)
  • generation/madgrades.py
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/madgrades.py
🧬 Code graph analysis (1)
generation/madgrades.py (1)
generation/course.py (1)
  • Course (88-258)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: Generate Data
🔇 Additional comments (2)
generation/madgrades.py (2)

43-43: LGTM: Explicit keyword arguments improve readability.

The change to keyword arguments aligns with Pydantic best practices and makes the CourseReference construction more explicit and maintainable.


57-57: LGTM: Cleaner initialization with Pydantic defaults.

The change from TermData(None, None) to TermData() is more Pythonic and aligns with the Pydantic migration, leveraging default or optional field values.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
generation/enrollment.py (5)

122-131: Failed pages are silently dropped, leading to incomplete data.

This issue was previously flagged. When fetch_course_listing_page raises EnrollmentFetchError after exhausting retries (line 127), the exception is caught and the loop continues without re-attempting that page. Courses on the failed page are permanently missing from all_hits.

Additionally, there's no validation after the loop to detect incomplete data (e.g., if len(all_hits) < course_count).


286-298: Retry logic doesn't cover network errors.

This issue was previously flagged. The retry_if_exception_type(EnrollmentFetchError) only retries on HTTP status != 200. Transient network errors (connection timeouts, DNS failures, etc.) raise aiohttp.ClientError or its subclasses, which will not trigger retry and cause immediate failure.

🔎 Expand retry to cover network errors
+from aiohttp import ClientError
+
 @retry(
     stop=stop_after_attempt(20),
     wait=wait_exponential_jitter(initial=2, max=120, jitter=5),
-    retry=retry_if_exception_type(EnrollmentFetchError),
+    retry=retry_if_exception_type((EnrollmentFetchError, ClientError)),
     reraise=True,
 )

301-313: Retry logic doesn't cover network errors.

Same issue as fetch_course_listing_page: only retries EnrollmentFetchError, missing aiohttp.ClientError for network failures.

🔎 Expand retry to cover network errors
 @retry(
     stop=stop_after_attempt(20),
     wait=wait_exponential_jitter(initial=2, max=120, jitter=5),
-    retry=retry_if_exception_type(EnrollmentFetchError),
+    retry=retry_if_exception_type((EnrollmentFetchError, ClientError)),
     reraise=True,
 )

352-362: Narrow the exception handling to avoid masking bugs.

This issue was previously flagged. Catching bare Exception will suppress programming errors (e.g., AttributeError, TypeError, KeyError) that indicate bugs, making debugging difficult. Only catch expected exceptions.

🔎 Proposed fix
-    except (JSONDecodeError, EnrollmentFetchError, Exception) as e:
+    except (JSONDecodeError, EnrollmentFetchError) as e:
         logger.warning(
             f"Failed to fetch enrollment data for {course_ref.get_identifier()} after retries: {str(e)}"
         )
         return None

432-433: Coordinate validation incorrectly treats 0 as invalid.

This issue was previously flagged. The logic all(coords) will reject coordinates where either latitude or longitude is exactly 0.0 (equator or prime meridian), as all() treats 0.0 as falsy. While Wisconsin coordinates are unlikely to be affected, this is technically incorrect.

🔎 Proposed fix
-                    coords = (building.get("latitude"), building.get("longitude"))
-                    coordinates = coords if all(coords) else None
+                    lat = building.get("latitude")
+                    lng = building.get("longitude")
+                    coordinates = (lat, lng) if lat is not None and lng is not None else None
🧹 Nitpick comments (1)
generation/enrollment.py (1)

287-288: Consider the impact of aggressive retry settings.

With stop_after_attempt(20) and wait_exponential_jitter(max=120), a single failing request could retry for an extended period (potentially 20+ minutes in the worst case). While this may be intentional for flaky APIs, ensure this aligns with your pipeline timeout expectations.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d53a52 and e330c20.

⛔ Files ignored due to path filters (1)
  • generation/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • generation/enrollment.py
  • generation/http_utils.py
  • generation/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • generation/pyproject.toml
🧰 Additional context used
📓 Path-based instructions (1)
generation/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

generation/**/*.py: Implement the data generation pipeline in the generation service following the documented steps
Use GIST Large Embedding v0 for embedding generation (prefer over deprecated OpenAI embedding)

Files:

  • generation/enrollment.py
  • generation/http_utils.py
🧬 Code graph analysis (1)
generation/enrollment.py (1)
generation/enrollment_data.py (4)
  • EnrollmentData (290-332)
  • MeetingLocation (185-255)
  • Meeting (258-287)
  • get_or_create_with_capacity (194-240)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Generate Data
  • GitHub Check: build-web
  • GitHub Check: build-web
  • GitHub Check: build-docs
🔇 Additional comments (9)
generation/enrollment.py (8)

23-27: Good addition of rate-limiting and pagination constants.

These constants improve maintainability and make it easy to adjust API limits and concurrency behavior.


136-153: LGTM! Proper concurrency control.

The semaphore correctly limits concurrent API requests to avoid WAF blocks, and is properly passed to process_hit for enforcement.


281-283: LGTM!

Custom exception class follows standard patterns for retry control.


316-326: LGTM!

Function signature correctly updated to accept semaphore for concurrency control.


336-336: LGTM!

Correctly updated to use keyword arguments for Course.Reference construction, aligning with the Pydantic refactor.


436-441: LGTM!

Correctly uses the MeetingLocation.get_or_create_with_capacity factory method with proper keyword arguments, aligning with the Pydantic refactor.


445-454: LGTM!

Correctly updated to use Meeting constructor directly, matching the refactored import structure.


463-463: LGTM!

Correctly updated to use simplified TermData() instantiation, matching the refactored Pydantic model with default field values.

generation/http_utils.py (1)

10-10: LGTM!

The safe dictionary access with an empty string fallback is appropriate and prevents KeyError exceptions.

Comment thread generation/http_utils.py Outdated
Comment thread generation/http_utils.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants