Skip to content

Training data contains too many invalid IPv6 addresses #959

@Elsensee

Description

@Elsensee

I’m operating a small instance and I’m amazed by the statistics and the tech behind this app. I was wondering, however, why for IPv4, precision and recall seems to be a lot more varying than for IPv6. Because I read about it on your blog, I had a hinge as to why the IPv6 model may appear to be more stable.

It really is just an RNG for any number in the respective IP address space,

This works fine for IPv4. The address space is relatively small and only a tiny fraction (224.0.0.0/3 — still about 1/8, though) is reserved and will never appear as a valid login attempt. (There are about 21,000,000 additional invalid addresses (> 2^24), but they are negligible. Also, the address exhaustion makes some companies use the addresses creatively, so we cannot know, how many of them will actually never appear as login attempt.)

The IPv6 address space is pretty large, though. So large, that the reverse is true here: Only 1/8 out of the total address space is currently allocated as Global Unicast (2000::/3). Out of that only the first half is even touched and probably not even half of that is actually assigned. Granted, there also is the ULA-space (fc00::/7) as well as LLA (fe80::/10), but we won’t exceed 1/8 of possible addresses.
This means, that 7 out of 8 randomly generated addresses will never appear as login attempt on the server, meaning these are garbage data. This cannot be a useful training data set, can it?

To be honest, I’m not an expert in the field of AI or neural nets, but as I understand this, this app should train against login attempts, that should theoretically be able to occur. Only then can it distinguish them from the real, valid login attempts.

I would therefore consider adjusting the random number generator, so that it will only focus on generating IP addresses that are able to occur as a valid source IP.

What do you think about it? Am I misunderstanding something about training and AI?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions