Releases · bodleian/wacksy

31 Oct 12:26

Immutable

v0.1.3

a07fc59

v0.1.3 Latest

Latest

After reading about the Lean Crate Initiative, I edited the Cargo manifest to include only the files necessary for building the crate when publishing.

I added a small function to read the Gzip magic number (0x1f0x8b) at the beginning of the file, to check whether or not it's a Gzip archive. This is going to be part of a bigger rewrite of the WARC reading code which I'm working on in a branch.

Dependencies

Motivated by easy performance wins, I've been looking at the dependencies of this library.

Remove surt-rs and replace it with the original simplified surt creating function. This effectively reverts commit 15d73c9. My main motivation for this was that surt-rs relies on the regex library, which made up about half of the size of the library. I might revisit this decision in future, with some tweaks to surt-rs.
Remove short uuids used for page identifiers, and replace them with a simple incrementing counter. Much like the above, this cut out some complexity and eliminated unnecessary code at no real cost.

Assets 3

30 Sep 09:25

github-actions

Immutable

v0.1.1

7369dcc

v0.1.1

This first minor release includes a new API, made up of two functions: from_file() takes a WARC file, indexes it and produces a structured representation of a WACZ object, and as_zip_archive() takes that structured representation and writes it out to a zip archive.

Note

Despite the more serious committment to a stable API in this version, this library is still not ready for professional use. I'm still not sure the indexer is correctly calculating the byte offset of each record in the WARC file. Output WACZes do not replay properly with ReplayWeb.page.

I've been trying to model the construction pattern to fit the conditions of the format. For example, all the resources in a collection must be defined in the datapackage, or a page record cannot be created without a corresponding CDX record. These conditions should be satisfied by a carefully ordered process, with each step flowing logically into the next. The indexer and the new API were (re)written with this goal in mind.

Pages.jsonl

Each page listed in pages.jsonl now gets assigned a short uuid, I am not sure if this is necessary, but without an id the page records don't pass validation against the frictionless datapackage schema.
The pages file also includes a header line in the form {"format":"json-pages-1.0","id":"pages","title":"All Pages"}".

Unit test coverage and JSON schemas

I've converted most of the integration tests into unit tests, and where possible replaced string comparison with validation against a JSON Schema.
Serialisation of values to JSON still feels too ✨magical✨ to me; I don't necessarily want to change any of the code here, but I would like to use serde more confidently.

Fixes

(indexer) move WARC record counter forward by 1 because the iterator ennumeration is zero-indexed. Easy mistake.

Other

Renamed zip() to as_zip_archive(), thanks to @anna-hope for the suggestion and @eviejmorris for the fix.
Replaced example code with a doctest, and added usage example to readme.
Bumped the MSRV to 1.87.
Moved the repository to the Bodleian organisation on GitHub.
Use pretty assertions in tests.
Wrote this changelog.

Dependencies

Updated rawzip to version 0.4.1, and refactored as_zip_archive to handle the new API.
Added short uuid for generating page ids.

Contributors

anna-hope and eviejmorris

Assets 3

06 Aug 15:34

github-actions

Immutable

v0.0.2

1f355dd

v0.0.2

This release involves some refactoring, different parts of the indexer are now in their own modules.
As a result of this, it was easier to write unit tests for each resource, so I've now done that, along with two integration tests.
The tests just cover the basics, I expect to expand these in future to check errors and other things.

The page record indexer now only indexes records according to a set of conditions which guarantee the record is a web document.
Unfortunately the WACZ spec does not define what a page is in terms we can use here, so I have come up with the following conditions:

The WARC record type is either Response, Revisit, or Resource
The HTTP content-type is either text/html, application/xhtml+xml, or text/plain.
The HTTP status code is 200 OK.

This is an imperfect best-guess attempt to pick out things which might be pages from a WARC file.
The reason I filter for successful status codes is I realised that some failed requests return HTML pages in the response along with a 404 error.
Those are definitely pages, but I guess they're not what people want out of the pages.jsonl index.

I made a brief attempt to replace sha256 with the faster blake3 hashing algorithm, but this breaks compatibility with py-wacz.
I think this is something which will have to wait until blake3 can be integrated into the python standard library as part of hashlib.

Dependencies

This library now depends on surt-rs to create searchable url strings. It's a fairly minimal library and is more comprehensive than my own attempt to write a surt-ing function.
Bump rawzip to 0.3 (#41), thanks @nickbabcock!

Contributors

nickbabcock

Assets 3

20 Jun 09:16

github-actions

v0.0.1

c8fab4c

v0.0.1

As of this point, the WACZ and indexer can output (almost) everything needed from a WARC file to a fully spec-compliant WACZ file.
The last thing missing was the pages.jsonl file, which is now produced when reading through the WARC file as part of the indexer.
I want to avoid reading through the WARC twice to produce two files, so have wrapped everything into one indexer, again there's probably a better way of doing this.

The other happy change in this release is removing code duplication from the WARC reader in case of gzipped and non-gzipped files.
First time I've tried using type generics in Rust, the code is messy, but it works.

Added

(indexer) Use type generics to eliminate code duplication when iterating through records, this finally gets rid of an awkward situation where I was having to maintain two separate iterators .
add pages indexer to wacz writer, with a struct for page records, this is the main thing in this release.

Fixed

add newline to page records, needed for pages.jsonl format, closes #37, nice and easy change
(indexer) skip serialising null fields in page record
(datapackage) pass cdxj_index_bytes through to the datapackage

Other

Lots more little documentation/readme changes and additions. Code refactoring, etc.

(indexer) use core instead of standard libraries for error formatting
add serde features to dependencies, update cargofile
(datapackage) move compose_datapackage into datapackage implementation
(datapackage) DataPackageResource::new now returns a result/error rather than panicking
(indexer) use httparse to parse http status code from response and remove the happily redundant cut_http_headers_from_record function

Assets 2

16 May 10:55

github-actions

v0.0.1-beta

91c46b6

v0.0.1-beta

Work on this version was mostly refactoring, adding structured types and error handling, and some documentation (only just started).

Still on my todo list is to use the indexer to also create pages.jsonl files.

Fixed

replace wrapping_add in loop counter with enumerate, closes #29
(indexer) return the same error message for gzipped and non-gzipped files. I have tried to simplify the code for processing both gzipped and non-gzipped files. There's still unnecessary duplication but it's the best I can do for the moment.

Other

document some DataPackage structs, better documentation coming once this is properly finished!
as a style change, this now uses explicit returns everywhere, and I have set lints in cargo.toml to enforce this
(indexer) many of the index functons are now implemented on types. The completed index is returned as a struct, which has a display implementation to write it out to json(l).
(datapackage) propogate errors upwards, there are still some panics, but structured error handling is a lot more comprehensive now. Happy and unhappy paths are a little clearer to identify.
update README with link to a funny meme :)

Assets 2

05 Apr 21:27

github-actions

v0.0.1-alpha

7e87fef

v0.0.1-alpha Pre-release

Pre-release

At this stage the library can read a WARC file to produce a CDXJ index, and a datapackage.

Added

(indexer) types for DataPackage and DataPackageResource
(indexer) various types for CXDJIndexRecord

Assets 2

Releases: bodleian/wacksy

v0.1.3

Dependencies

Uh oh!

v0.1.1

Pages.jsonl

Unit test coverage and JSON schemas

Fixes

Other

Dependencies

Contributors

Uh oh!

v0.0.2

Dependencies

Contributors

Uh oh!

v0.0.1

Added

Fixed

Other

Uh oh!

v0.0.1-beta

Fixed

Other

Uh oh!

v0.0.1-alpha

Added

Uh oh!