Add a dummy site at the end of the sequence#1058
Add a dummy site at the end of the sequence#1058jeromekelleher merged 3 commits intotskit-dev:mainfrom
Conversation
|
I'm not sure "breakpoint" here is a great bit of terminology as @hyanwong points out. It can be confused with How about |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1058 +/- ##
==========================================
+ Coverage 90.26% 90.35% +0.09%
==========================================
Files 19 19
Lines 7086 7205 +119
Branches 1170 1187 +17
==========================================
+ Hits 6396 6510 +114
- Misses 559 562 +3
- Partials 131 133 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
I finally got all the tests working (just need to fix some CI issues with Python 3.13). The main issue was that many tests create custom ancestor data/ tree sequences without a terminal site. I first tried adding a site in before matching, but I realised that this doesn't make sense: both I still need to clean up the edge extension code and add some tests, so it isn't ready for review quite yet. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Overall looks good. Some low-level changes would help. I can take a look when these are done,.
910f071 to
394c93f
Compare
bd81216 to
571e7be
Compare
|
@jeromekelleher this is ready for another review. Changing the low-level code to have a separate terminal site function simplified things considerably. Both match_ancestors and match_samples now always carry through any terminal site to the output and edges are unmodified until post-processing. This meant that I had to manually fix every test that didn't post-process the output tree sequence, but I think this is the most elegant solution by far. |
|
Great, thanks @Duncan-JR. This looks good to me - I'll pick it up next week once I've cleared off a couple off a couple of things. I'll take it from here. Can you squash the commits down please? |
2f8b2a8 to
ccfa7b3
Compare
f6aca21 to
5b38506
Compare
Since we use half-open intervals in tskit, and tsinfer works with discrete sites, ancestors that include the rightmost inference site (let's call its index
r) have to be defined as ending at site indexr+1. Since this extra site has no position or haplotypes, we have to handle it as a special case at times, hard-coding it to have a position ofts.sequence_length.This work-in-progress PR seeks to explicitly encode this rightmost extra site, which I've tentatively called a breakpoint, in the ancestor data. This is the first step toward adding support for multiple user-provided breakpoints; for example, the user could specify the start of the centromere as a breakpoint, such that tsinfer would run on each chromosome arm separately.
So far, I've implemented the extra site in the Python engine up until ancestor generation. Next step is to get ancestor and sample matching to work. Here is a small simulated example to show how the change affects ancestral haplotypes: