Skip to content

improvements in db.py#1210

Open
Lotram wants to merge 4 commits into
GothenburgBitFactory:developfrom
Lotram:refactor-synchronize
Open

improvements in db.py#1210
Lotram wants to merge 4 commits into
GothenburgBitFactory:developfrom
Lotram:refactor-synchronize

Conversation

@Lotram
Copy link
Copy Markdown
Collaborator

@Lotram Lotram commented May 8, 2026

I've started looking at db.py to prepare for #1077 and #1078
I've removed some unused code and simplified some logic.

There are two improvements I have not implemented yet, because I think it's worth discussing it @ryneeverett

  • the smaller one: merge_left / replace_left, and duplication / ordering

    • on tags, but taskwarrior treats them as a simple (ordered) set, so I don't really understand why we can simply do local["tags"] |= remote["tags"] (merge) / local["tags"] = remote["tags"] (replace), and order them if needed.
    • on annotations the tests I added in tests/test_synchronize_ordering.py also show that deduplication and ordering are inconsistent, because taskwarrior consider them as a normal list, but taskw uses a set. Depending on task creation / update, we could end up with duplicates or not, ordered or not. My suggestion would be to always treat them as set (as taskw does), I don't see a good use case for annotation duplication, but maybe I'm wrong.
  • bigger one: all functions involving key_lists rely on finding issues in DB based on the UDAs that identify them (defined by UNIQUE_KEY. I see multiple issues with that

    • the queries to retrieve a single issue from taskwarrior (find_taskwarrior_uuid), or all of them (get_managed_task_uuids) are complex (especially the first one). As we already compute a bugwarrior identifier, why can't we store it as a UDA (hashing it first) for all tasks created by bugwarrior ? Retrieving a single issue from taskwarrior (or all of them) would be much simpler and straightforward.

    • I'm not sure about this one, but if I understand correctly, the current behavior also explains why each service needs to have specific UDAs (githuburl, gitlaburl, linearurl...). Wouldn't it be cleaner to have more generic UDAs (url in that case) ? The UNIQUE_KEY tuple could simply contain the target / service (depending on whether we want to consider duplicated issues across targets from the same service as actual duplicates or not) to ensure the identifier is indeed unique. We probably can't change this overnight, but I still think we should consider the idea

    • This behavior in case of multiple targets using the same service (e.g. in my own config, multiple Gmail targets) is not explicit: should we consider duplicated issues from multiple targets as the same one (currently the case, based on make_identifier) or not ?

Lotram added 4 commits May 7, 2026 17:41
in aggregate_issues, we have both the Issue object
and the deserialized data, so we can compute the identifier
without relying on a fragile iteration on list of key_lists
Also add some more precise typing
@Lotram Lotram force-pushed the refactor-synchronize branch from 38c2e4b to ed18bbb Compare May 8, 2026 09:13
Copy link
Copy Markdown
Collaborator

@ryneeverett ryneeverett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looked at the first commit so far (removed unused hamming distance functions) -- seems like hamdist could be replaced by an equality check without changing the behavior, but no longer shortening the longer of the two strings is a behavior change. Are we sure the hamming behavior isn't desirable? (I must admit, I can't think of any good reason for it myself but somebody did bother to implement it...)

If we're going to remove the hamming behavior we should at the very least rename the hamming parameter to annotations and update the docstring accordingly.

You might consider making a separate PR for this commit for a faster review/merge cycle.

@ryneeverett
Copy link
Copy Markdown
Collaborator

  • merge_left / replace_left, and duplication / ordering
    • tags: 👍
    • annotations: Doesn't "always treat them as set" mean we wouldn't preserve order any more? That seems wrong but perhaps I misunderstand.
  • UNIQUE_KEY
    • bugwarrior identifier: I suspect the reason we didn't go with our own UDA is to avoid adding visual junk to the default task report. I don't think the complexity of find_taskwarrior_uuid outweighs that virtue.
    • url prefixes: My recollection is that the UDA prefixes predate the use of UDA's to identify tasks and we used to use description pattern matching (which I think is why the default descriptions have "(bw)" in them). Regardless, I agree that removing the prefixes would be an improvement and since that is a "visual junk" reduction I think this helps justify a bugwarrior id UDA. I do think this would trigger a major version though.
    • duplicate issues from multiple targets: I think the current behavior is correct. Why would one want duplicate tasks?

@Lotram
Copy link
Copy Markdown
Collaborator Author

Lotram commented May 11, 2026

Are we sure the hamming behavior isn't desirable?
We already remove any special char from the string, I'm not sure hamming distance is necessary here. I don't see why annotations could differ slightly between calls. If that happens, I think it should be fixed per service, not here.

Good point about string truncation. Even though I don't really see why we should handle this case (why would an annotation's length should change between calls ?), that's still a behavior change, I'll fix that.

annotations: Doesn't "always treat them as set" mean we wouldn't preserve order any more? That seems wrong but perhaps I misunderstand

My point was: taskw does not preserver the order when updating annotations, but you're right, we should try to preserve it in our side. Are you ok with deduplicating annotations though ?

About UNIQUE_KEY:

  • One other argument in favor of a bugwarrior ID: for now, if we ever change the naming or the number of fields in UNIQUE_KEY, all behaviors related to key_list are broken, whereas with an identifier, the identifier computation becomes an implementation detail inside bugwarrior. If we later change the way we compute this identifier (which can be customized per service), the only broken feature is task deduplication (and we can find solutions to mitigate this), we'll still be able to retrieve all bugwarrior tasks. I'll write an issue for this
  • duplicate issues from multiple targets: I'm fine with this, but since the config is "per-target" rather than "per-service", I think it would make sense to explicit this in the documentation (and the code). It's quite hidden inside the key_list behavior right now

Also note that the two other commits (simplify key_list and Move make_identifier function to collect.py) are pure code refactoring, and don't change the current behavior.

@ryneeverett
Copy link
Copy Markdown
Collaborator

Are you ok with deduplicating annotations though ?

All else equal, I don't think we should deduplicate annotations. If somebody accidentally posts the same comment twice on Github, ideally it would show up twice in the annotations.

However, since taskw already uses sets I think it's probably better to do so more consistently and deduplicate on task creation. I also think it would be fine for the taskchampion backend to have the same behavior if it simplifies to the implementation.

One other argument in favor of a bugwarrior ID: for now, if we ever change the naming or the number of fields in UNIQUE_KEY, all behaviors related to key_list are broken...

I see your point but the only consequence of this is that all issues will be closed and re-opened and any data the user manually added to the task is lost. Given that we've been using this mechanism for quite a while and never needed to do this, it seems like both a low-likelihood and low-impact concern.

Copy link
Copy Markdown
Collaborator

@ryneeverett ryneeverett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do a thorough review, but the basic ideas of the other two commits seems sound:

  • simplify key_list: The keys of the former key_list were unused so we may as well just have lists (well actually tuples -- seems like the list[str] type annotations were wrong before?). And since the UNIQUE_KEY is the same for each given service even if the user has multiple targets, they may as well be deduplicated in a set() to simplify the query. 👍
  • Move make_identifier function to collect.py: We were formerly making the unique identifier after we'd already "forgotten" which service the issue belongs to which means we had to check each service's UNIQUE_KEY to figure out which one we were dealing with. So instead we calculate it at an earlier stage when we still have access to that data. This is a good opportunity to make a proper data structure CollectedIssue rather than temporarily tack more data onto the task like we formerly were with target. 👍

Comment thread bugwarrior/db.py
if issue['priority'] == '':
issue['priority'] = None

# Target was only tacked on to pass configuration to this function.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above comment should be removed as it was only explaining the pop().

Comment thread bugwarrior/collect.py
log.info("Done aggregating remote issues.")


# NOTE: should this be a method of Issue instead ?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean a method of TaskConstructor? That would make sense to me. I think this is the kind of higher-level concern that I was trying to extract out of the Service/Issue base classes to minimize their scope.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I was really suggesting Issue, as UNIQUE_KEY is an attribute of Issue, and the make_unique_identifier method depends only on this attribute, and Issue.get_taskwarrior_record

we could have something like

from functools import cached_property


class Issue(abc.ABC):
    UNIQUE_KEY: tuple[str, ...]

    @abc.abstractmethod
    @cached_property
    def to_taskwarrior(self) -> TaskwarriorData:
        """Transform a foreign record into a taskwarrior dictionary."""
        raise NotImplementedError()

    @property
    def unique_identifier(self) -> str:
        subset = {key: self.to_taskwarrior[key] for key in self.UNIQUE_KEY}
        return json.dumps(subset, sort_keys=True)

Issue could also be a pydantic model, where to_taskwarrior is simply the model_dump method, and unique_identifier could be a computed_field, as I think it makes sense to attach the identifier to the Issue object.

But if you wanted to extract some logic away from this, I'm ok with what we have now.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd rather keep it out of Issue at this point since that would make it accessible to services as part of the stabilized API.

Issue could also be a pydantic model, where to_taskwarrior is simply the model_dump method, and unique_identifier could be a computed_field, as I think it makes sense to attach the identifier to the Issue object.

This is very interesting. My original motivation for using pydantic for configuration is that I think the more we can make services declarative data rather than arbitrary code the better off we are and this seems like another great step in that direction.

Comment thread bugwarrior/db.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe key_lists should be renamed unique_keys / build_unique_keys? It's been a while since any lists were involved and the name was already quite vague.

@Lotram
Copy link
Copy Markdown
Collaborator Author

Lotram commented May 19, 2026

I created 1 MR for each commit in this one, for faster iterations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants