-
Notifications
You must be signed in to change notification settings - Fork 373
feat(orchestrator): do not upgrade to recalled GuestOS versions #8609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(orchestrator): do not upgrade to recalled GuestOS versions #8609
Conversation
| started: Arc<AtomicBool>, | ||
| cancelled: Arc<AtomicBool>, | ||
| init_time: Time, | ||
| latest_certified_time: Arc<RwLock<Option<Time>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually do not need to store the Time. It would suffice to store a boolean (also see fn has_replicated_all_versions_certified_before_init). Though, I thought it could bring some added debugging value to be able to observe this value in metrics and dashboard.
| // certified after the replicator was started. | ||
| let latest_certified_time = Arc::new(RwLock::new( | ||
| (latest_certified_time > init_time).then_some(latest_certified_time), | ||
| )); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative to favor simplicity over pure performance is to keep initialize_local_store returning nothing and initializing self.latest_certified_time with Arc::new(RwLock::new(None)). Then the next call to poll will actually set self.latest_certified_time accordingly.
Writing this, I'm now actually quite convinced I'd prefer this simpler solution just described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, especially since initialize_local_store should return early anyway on nodes that have already registered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in daea5d8
| pub poll_duration: Histogram, | ||
| pub poll_count: IntCounterVec, | ||
| pub registry_version: IntGauge, | ||
| pub latest_certified_time: GenericGauge<AtomicU64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntGauge is signed, so I preferred using an AtomicU64 to avoid conversions when setting the metric.
rs/orchestrator/src/upgrade.rs
Outdated
| .inc(); | ||
|
|
||
| return Err(OrchestratorError::UpgradeError(format!( | ||
| "Not upgrading to recalled replica version {} at regitry version {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Not upgrading to recalled replica version {} at regitry version {}", | |
| "Not upgrading to recalled replica version {} at registry version {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| from_version: RegistryVersion, | ||
| ) -> Result<RegistryVersion, String> { | ||
| let (records, _, _) = registry_canister | ||
| ) -> Result<(RegistryVersion, RegistryVersion, Time), String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always find tuples where two or more fields have the same type a bit confusing. Maybe we can return a struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, done in 0c2367b
| // certified after the replicator was started. | ||
| let latest_certified_time = Arc::new(RwLock::new( | ||
| (latest_certified_time > init_time).then_some(latest_certified_time), | ||
| )); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, especially since initialize_local_store should return early anyway on nodes that have already registered.
| *latest_certified_time.write().unwrap() = maybe_certified_time; | ||
| if let Some(certified_time) = maybe_certified_time { | ||
| metrics | ||
| .latest_certified_time | ||
| .set(certified_time.as_nanos_since_unix_epoch()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe like this to avoid taking the lock if there are no changes
| *latest_certified_time.write().unwrap() = maybe_certified_time; | |
| if let Some(certified_time) = maybe_certified_time { | |
| metrics | |
| .latest_certified_time | |
| .set(certified_time.as_nanos_since_unix_epoch()); | |
| } | |
| if let Some(certified_time) = maybe_certified_time { | |
| metrics | |
| .latest_certified_time | |
| .set(certified_time.as_nanos_since_unix_epoch()); | |
| *latest_certified_time.write().unwrap() = Some(certified_time); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are no changes
Do you mean no changes in the latest_certified_time Option, i.e. from None to None? Indeed, your suggestion could save us from taking the lock for the first few seconds/minute until getting up-to-date.
Because even if there are no changes in the canister (including the certified time), the function will return a Some so we would grab the lock anyways.
Don't you think it would actually read more easily to always update latest_certified_time? The dashboard would better reflect the actual value of the variable, in case we ever change some logic in the future.
No description provided.