Skip to content

Releases: iipc/jwarc

v0.20.0: Release 0.20.0

10 Nov 12:48
@ato ato
Compare
Choose a tag to compare

New features:

  • WarcRevisit.Builder.refersTo() now accepts a String for targetURI #66 (Robert van Loenhout)

  • Added --save-ca-certificate to recorder tool.

  • Certificates issued by the recording proxy now include the subjectAltName extension to satisfy clients with stricter validation.

v0.19.0: Release 0.19.0

12 Sep 06:26
@ato ato
Compare
Choose a tag to compare

New features

  • jwarc will now attempt to leniently parse HTTP messages with Transfer-Encoding: chunked but where the body does not begin with a valid chunk header by assuming the body is not actually chunked encoded. This improves compatibility with tools like Browsertrix that strip chunked encoding but leave the HTTP header in place.

  • ExtractTool will now extract multiple records when given multiple offsets

  • CdxTool gained support for the 'N' (normalized SURT) field

  • CdxTool gained partial support for pywb's method of encoding request bodies in CDX records. This is still a work in progress and not yet fully compatible with pywb in all cases.

v0.18.1: Release 0.18.1

04 Apr 05:57
@ato ato
Compare
Choose a tag to compare

Bugs fixed

  • WarcReader.position(long) would not reset the GunzipChannel's buffer correctly causing an exception or the wrong record to be returned after seeking in a gzipped WARC.

v0.18.0: Release 0.18.0

04 Apr 02:05
@ato ato
Compare
Choose a tag to compare

New features

  • New cdx package for reading and writing CDX files

  • Added --format option to the cdx tool

  • Added a basic dedupe tool that can deduplicate records against a CDX server (such as OutbackCDX)

  • Added a stats tool that prints counts and sizes records by status, type, host

  • WarcReader now has a .position(long) method for seeking to the start of a particular record (if the underlying channel supports seeking)

Bugs fixed

  • HttpRequest/Response.serializeHeader() now returns the exact original bytes for records read from a WarcReader. This means the extract tool no longer reformats extracted HTTP headers.

  • Fixed an issue where null record ids would be written as "" instead of omitting the appropriate header. Notably this means revisit records can be constructed with WARC-Refers-To-Target-URI and WARC-Refers-To-Date but without WARC-Refers-To.

v0.17.2: Release 0.17.2

21 Mar 08:16
@ato ato
Compare
Choose a tag to compare

Bugs fixed

  • Fixed 'pushback would result in negative position' exception when parsing HTTP messages missing CRLF at the end of the headers

v0.17.1: Release 0.17.1

21 Mar 08:15
@ato ato
Compare
Choose a tag to compare

Bug fixes

  • The lenient HTTP parser now accepts messages missing the terminating CRLFs at the end of the headers (provided there is no message body).

0.17.0

07 Mar 00:19
@ato ato
Compare
Choose a tag to compare

New features

  • Added a new constructor to WarcRequest.Builder and WarcResponse.Builder that accepts a String instead of a URI. #64 (Robert van Loenhout)

v0.16.5: Release 0.16.5

23 Sep 17:35
@ato ato
Compare
Choose a tag to compare

Release 0.16.5

Bug fixed:

  • ValidateTool: fix infinite loop and invalid digest calculation due to incorrect buffer handling

Sorry no native binaries this time, I haven't gotten around to re-implementing the build process for them as a Github Action.

0.16.4

08 Sep 02:38
Compare
Choose a tag to compare

Bug fixes

  • Fixed "is not a jwarc command" error in native builds by upgrading GraalVM #61

v0.16.3: Release 0.16.3

08 Sep 01:47
Compare
Choose a tag to compare

Bugs fixed

  • The ARC parser now accepts non-ASCII characters in URIs. (The ARC
    format was fairly loosely defined so we've been progressively
    relaxing the grammar as we discovered counter-examples.)