Releases: InternetHealthReport/internet-yellow-pages
v4.0.2
This release fixes two crawlers and adds a rerun functionality to postprocess scripts.
What's Changed
- RIPE Atlas introduced new fields to the measurement metadata, which broke the crawler.
- Cloudflare crawlers sometimes run into rate limiting, which was not handled correctly because they do not include a Retry-After header even though they claim to do so.
- Add rerun functionality to postprocess scripts.
Full Changelog: v4.0.1...v4.0.2
v4.0.1
What's Changed
- InetIntel changed their format, so this release fixes that crawler.
Full Changelog: v4.0.0...v4.0.1
v4.0.0
(Another) Prefix Node Rework
Remember the changes to the prefix nodes introduced in v3.0.0? Although adding multiple labels to one node is clean, but querying was just getting too complicated. We also found some cases where normal querying just did not work. This is why we are remodeling the prefix nodes (again)!
Instead of adding multiple labels to a single node, each node type now has its own node (but all still have the Prefix label). While this increases the number of nodes and relationships in the graph, it makes querying considerably simpler:
- Each
IPnode has onePART_OFrelationship to the most-specific covering prefix of each type. - Each
Prefixnode type has onePART_OFrelationship to the most-specific covering prefix of each type.
For the inter-prefix PART_OF relationships there is one catch: A PART_OF relationship between two different types can be between two nodes with the same prefix property (e.g., a BGPPrefix that is also a RPKIPrefix). For relationships between the same type, the PART_OF relationship indicates the most-specific covering prefix.
Warning: The prefix property is now only unique within each subtype. The general Prefix type is still there for convenience, but querying it will just return some prefix, even when prefix filter is specified. The Prefix type should thus not be used in most queries.
Example
MATCH p = (:IP {ip: '102.218.130.10'})-[:PART_OF]->+(:Prefix)
RETURN p
Although hard to see due to the cutoff label, here we see an IP node (blue), that is connected to its most-specific RIRPrefix (green), BGPPrefix (pink), RPKIPrefix (yellow), and RDNSPrefix (orange). The symmetric relationships between these prefixes indicates that they are actually all the same. The BGP and RDNS prefixes then have larger covering prefixes.
MATCH p = ((:RPKIPrefix)-[:PART_OF]->(:RPKIPrefix)){8}
RETURN p
LIMIT 1
The Donut of covering RPKI prefixes. Starts at a /24 and goes all the way to a /16.
What's Changed
- Prefix remodeling in #191
Full Changelog: v3.1.0...v4.0.0
v3.1.0
New Dataset
- OpenINTEL CRuX data in #181
- DNS Graph crawler for CRuX data in #185. Due to the size of this dataset, it is currently not included in the weekly dumps.
What's Changed
- Update Cloudflare crawler for better performance in #166
- Various documentation updates (OpenINTEL #183, gallery #186)
- Miscellaneous crawler fixes
Full Changelog: v3.0.1...v3.1.0
v3.0.1
This is a minor release that fixes several small bugs.
Full Changelog: v3.0.0...v3.0.1
v3.0.0
Prefix Node Rework
We are releasing this as a new major version since the changes introduced in #168 require special care when fetching PART_OF relationships of IP and Prefix nodes in the future.
In general, all Prefix nodes now have one or more subtype. Possible types (at the moment, see here for updated info) are:
- BGPPrefix
- GeoPrefix
- PeeringLAN
- RDNSPrefix
- RIRPrefix
- RPKIPrefix
This complicates the generation of PART_OF relationships for IP and Prefix nodes. As a tradeoff between ease-of-use and number of relationships, we proceed as followed:
- Build a prefix tree for all prefixes of the same type and connect them with
PART_OFrelationships - Map an IP to the most-specific prefix of each type
However, since a prefix can have multiple types (e.g., a BGPPrefix can also be an RPKIPrefix) this would create a lot of redundant relationships. Furthermore, it can lead to cases were the correct PART_OF relationship can not be inferred with a simple query.
For example, an IP x is part of a BGPPrefix a/24. Now there might exist another BGPPrefix b/23, that is covering a/24 and is also an RDNSPrefix. This would cause PART_OF relationships to both a and b (since both are the longest match for one prefix type). Therefore it would not be possible to only return the most-specific BGPPrefix with a single query.
// This query returns both BGPPrefix nodes.
MATCH p = (:IP {ip: x})-[:PART_OF]->(:BGPPrefix)
RETURN pAs a solution to this, and to reduce the number of relationships, we add a prefix_types property to the PART_OF relationship. It is a list that contains the labels of the prefix types for which this relationship indicates the longest match. If there are multiple PART_OF relationships originating from an IP node, the prefix_types properties of these will be non-overlapping. Thus, to get the most-specific BGPPrefix node from the example above:
MATCH p = (:IP {ip: x})-[po:PART_OF]->(:BGPPrefix)
WHERE 'BGPPrefix' in po.prefix_types
RETURN pA query for a specific prefix type with PART_OF thus always requires a filter on the prefix_types property!
First Steps Towards Geographical Data
In #177 we introduced modelling of geometric points that are already in our existing datasets. We introduced a new node type Point connected with LOCATED_IN relationships to existing resources (for now AS from CAIDA, and Facility / Organization from PeeringDB). This enables the use of spatial functions available in Cypher. The modelling is still very basic and will be enhanced in the future.
New Dataset
- IPinfo IP-to-country mapping by @maxmouchet in #178
- SimulaMet rDNS data: The crawler for this dataset was already implemented, but blocked by the prefix node rework.
What's Changed
- Add more specific prefix node types #168
- Introduce Point node label and add geolocation modelling to existing crawlers #177
- Added
unsortedstatus code to PCH crawler - Neo4j updated to version 5.26.3
- Updated pre-commit hooks
New Contributors
- @maxmouchet made their first contribution in #178
Full Changelog: v2.2.0...v3.0.0
v2.2.0
New Dataset
- CAIDA AS to organization mappings by @jehuddleston in #172
What's Changed
- Link to crawler README's from dataset page
- Update OpenINTEL data endpoint and API
- Handle invalid example tests in OONI webconnectivity
New Contributors
- @jehuddleston made their first contribution in #172
Full Changelog: v2.1.0...v2.2.0
v2.1.0
New Datasets
- OpenINTEL rDNS
- SimulaMet rDNS RIR data
- World Bank population estimate
- OONI
- CAIDA AS Relationship
- Google CrUX
- CNAMEs in OpenINTEL crawlers
What's Changed
- Add autodeploy scripts
- Enforce canonical IPv6 formatting
- Documentation updates
- Tables for data sources, node types, and relationship types
- Gallery updates
- More instructions
- Rework logging
- Use
elementId()instead of deprecatedid()in neo4j
Full Changelog: v2.0.0...v2.1.0
v2.0.0
Summary
The main change for this version is the remodeling of DNS data (including new node types, e.g. HostName), inclusion of a lot of new datasets, new reference attributes for relationships, and a lot of code cleaning and bug fixes.
List of changes
- new datasets:
- DNS resolution chain (OpenINTEL)
- DNS resolution for umbrella, NS, and MX (openINTEL)
- URL classification (Citizenlab)
- sibling ASes (InetIntel)
- Atlas probe (RIPE)
- Atlas measurement (RIPE)
- IXP (PCH)
- url2hostname (post-process)
- umbrella (CISCO)
- IPv6 AS Hegemony (IHR)
- AS Relationship IPv4 & IPv6 (BGPkit)
- Alice looking glass
- RoVista (Virginia Tech)
- support for node with multiple labels
- new reference attributes (
reference_time_modificationreference_time_fetchandreference_url_data,reference_url_info) replacingreference_timeandreference_url - most (all?) crawlers push nodes and links in batches
- docker service for public instance
- pre-commit checks
- automatically add neo4j constrains and indexes
- updated to neo4j 5.16
- code cleaning and numerous bug fixes
v1.1.0
Summary
Change the labels for nodes to be conform with Neo4j naming convention.
Features
- Renaming of node labels (e.g. DOMAIN_NAME is now DomainName)
- Simplified docker usage with docker_compose file