improvements in db.py#1210
Conversation
in aggregate_issues, we have both the Issue object and the deserialized data, so we can compute the identifier without relying on a fragile iteration on list of key_lists Also add some more precise typing
38c2e4b to
ed18bbb
Compare
ryneeverett
left a comment
There was a problem hiding this comment.
Only looked at the first commit so far (removed unused hamming distance functions) -- seems like hamdist could be replaced by an equality check without changing the behavior, but no longer shortening the longer of the two strings is a behavior change. Are we sure the hamming behavior isn't desirable? (I must admit, I can't think of any good reason for it myself but somebody did bother to implement it...)
If we're going to remove the hamming behavior we should at the very least rename the hamming parameter to annotations and update the docstring accordingly.
You might consider making a separate PR for this commit for a faster review/merge cycle.
|
Good point about string truncation. Even though I don't really see why we should handle this case (why would an annotation's length should change between calls ?), that's still a behavior change, I'll fix that.
My point was: About
Also note that the two other commits (simplify key_list and Move make_identifier function to collect.py) are pure code refactoring, and don't change the current behavior. |
All else equal, I don't think we should deduplicate annotations. If somebody accidentally posts the same comment twice on Github, ideally it would show up twice in the annotations. However, since taskw already uses sets I think it's probably better to do so more consistently and deduplicate on task creation. I also think it would be fine for the taskchampion backend to have the same behavior if it simplifies to the implementation.
I see your point but the only consequence of this is that all issues will be closed and re-opened and any data the user manually added to the task is lost. Given that we've been using this mechanism for quite a while and never needed to do this, it seems like both a low-likelihood and low-impact concern. |
ryneeverett
left a comment
There was a problem hiding this comment.
I didn't do a thorough review, but the basic ideas of the other two commits seems sound:
- simplify key_list: The keys of the former
key_listwere unused so we may as well just have lists (well actually tuples -- seems like thelist[str]type annotations were wrong before?). And since the UNIQUE_KEY is the same for each given service even if the user has multiple targets, they may as well be deduplicated in aset()to simplify the query. 👍 - Move make_identifier function to collect.py: We were formerly making the unique identifier after we'd already "forgotten" which service the issue belongs to which means we had to check each service's UNIQUE_KEY to figure out which one we were dealing with. So instead we calculate it at an earlier stage when we still have access to that data. This is a good opportunity to make a proper data structure
CollectedIssuerather than temporarily tack more data onto the task like we formerly were withtarget. 👍
| if issue['priority'] == '': | ||
| issue['priority'] = None | ||
|
|
||
| # Target was only tacked on to pass configuration to this function. |
There was a problem hiding this comment.
The above comment should be removed as it was only explaining the pop().
| log.info("Done aggregating remote issues.") | ||
|
|
||
|
|
||
| # NOTE: should this be a method of Issue instead ? |
There was a problem hiding this comment.
Did you mean a method of TaskConstructor? That would make sense to me. I think this is the kind of higher-level concern that I was trying to extract out of the Service/Issue base classes to minimize their scope.
There was a problem hiding this comment.
No I was really suggesting Issue, as UNIQUE_KEY is an attribute of Issue, and the make_unique_identifier method depends only on this attribute, and Issue.get_taskwarrior_record
we could have something like
from functools import cached_property
class Issue(abc.ABC):
UNIQUE_KEY: tuple[str, ...]
@abc.abstractmethod
@cached_property
def to_taskwarrior(self) -> TaskwarriorData:
"""Transform a foreign record into a taskwarrior dictionary."""
raise NotImplementedError()
@property
def unique_identifier(self) -> str:
subset = {key: self.to_taskwarrior[key] for key in self.UNIQUE_KEY}
return json.dumps(subset, sort_keys=True)Issue could also be a pydantic model, where to_taskwarrior is simply the model_dump method, and unique_identifier could be a computed_field, as I think it makes sense to attach the identifier to the Issue object.
But if you wanted to extract some logic away from this, I'm ok with what we have now.
There was a problem hiding this comment.
Yeah, I'd rather keep it out of Issue at this point since that would make it accessible to services as part of the stabilized API.
Issue could also be a pydantic model, where
to_taskwarrioris simply themodel_dumpmethod, andunique_identifiercould be a computed_field, as I think it makes sense to attach the identifier to theIssueobject.
This is very interesting. My original motivation for using pydantic for configuration is that I think the more we can make services declarative data rather than arbitrary code the better off we are and this seems like another great step in that direction.
There was a problem hiding this comment.
Maybe key_lists should be renamed unique_keys / build_unique_keys? It's been a while since any lists were involved and the name was already quite vague.
|
I created 1 MR for each commit in this one, for faster iterations. |
I've started looking at db.py to prepare for #1077 and #1078
I've removed some unused code and simplified some logic.
There are two improvements I have not implemented yet, because I think it's worth discussing it @ryneeverett
the smaller one:
merge_left/replace_left, and duplication / orderingtags, but taskwarrior treats them as a simple (ordered) set, so I don't really understand why we can simply dolocal["tags"] |= remote["tags"](merge) /local["tags"] = remote["tags"](replace), and order them if needed.annotationsthe tests I added intests/test_synchronize_ordering.pyalso show that deduplication and ordering are inconsistent, becausetaskwarriorconsider them as a normal list, buttaskwuses a set. Depending on task creation / update, we could end up with duplicates or not, ordered or not. My suggestion would be to always treat them as set (astaskwdoes), I don't see a good use case for annotation duplication, but maybe I'm wrong.bigger one: all functions involving
key_listsrely on finding issues in DB based on the UDAs that identify them (defined byUNIQUE_KEY. I see multiple issues with thatthe queries to retrieve a single issue from taskwarrior (
find_taskwarrior_uuid), or all of them (get_managed_task_uuids) are complex (especially the first one). As we already compute a bugwarrior identifier, why can't we store it as a UDA (hashing it first) for all tasks created by bugwarrior ? Retrieving a single issue from taskwarrior (or all of them) would be much simpler and straightforward.I'm not sure about this one, but if I understand correctly, the current behavior also explains why each service needs to have specific UDAs (
githuburl,gitlaburl,linearurl...). Wouldn't it be cleaner to have more generic UDAs (urlin that case) ? TheUNIQUE_KEYtuple could simply contain thetarget/service(depending on whether we want to consider duplicated issues across targets from the same service as actual duplicates or not) to ensure the identifier is indeed unique. We probably can't change this overnight, but I still think we should consider the ideaThis behavior in case of multiple targets using the same service (e.g. in my own config, multiple Gmail targets) is not explicit: should we consider duplicated issues from multiple targets as the same one (currently the case, based on
make_identifier) or not ?