File hashes (MD5) #398

geoffjukes · 2019-01-30T18:36:53Z

Hi,

We use a standalone manifest file to describe our audiobooks, which includes MD5s for all file assets (audio files, artwork, and any supplemental material). The audiobook player use the MD5 to check downloaded assets, and for asset update/invalidation.

Any spec that we adopt must at least allow these hashes somewhere in the dataset.

Geoff

GarthConboy · 2019-01-30T18:58:20Z

Particularly wedded to MD5, or something more modern/secure okay? SHA-2?

dauwhe · 2019-01-30T19:00:49Z

I wonder if we can align with the W3C's Subresource Integrity specification. Perhaps we add an "integrity" member to the manifest? And then use algo+hash as in the spec?

{
            "type": "LinkedResource",
            "url": "fonts/STIXGeneral.otf",
            "encodingFormat": "application/vnd.ms-opentype",
            "integrity": "sha384-dOTZf16X8p34q2/kYyEFm0jh89uTjikhnzjeLeF0FHsEaYKb1A1cv+Lyv4Hk8vHd"
        }

iherman · 2019-01-30T19:02:11Z

Can you give some details? Do you store a hash for each resource separately, or one global hash (some sort of a merkle tree)?

iherman · 2019-01-30T19:03:26Z

B.t.w., this should probably be a 'core' manifest feature and not audio specific.

dauwhe · 2019-01-30T19:03:30Z

Can you give some details? Do you store a hash for each resource separately, or one global hash (some sort of a merkle tree)?

See https://github.com/blackstoneaudio/audiobook-spec/blob/master/draft.yaml

geoffjukes · 2019-01-30T19:09:51Z

@GarthConboy We use MD5s because they are cheap to compute, and they are used for download validation not security. Same purpose as 'eTag' keys in S3 or equivalent.

We use the word 'md5' as the key, but anything equivalent (hash, checksum, etc) would be fine, as long as we know what it is. For us the string value would be an MD5. For others it could be anything.

@dauwhe That seems very reasonable to me.

@iherman We store a hash for each resource separately see https://github.com/blackstoneaudio/audiobook-spec/blob/cfd468bb27b890b0e4a59a3345e806221a702fce/draft.json#L59

We do also store a 'hash of hashes' which we use as a sort of 'version' see https://github.com/blackstoneaudio/audiobook-spec/blob/cfd468bb27b890b0e4a59a3345e806221a702fce/draft.json#L11

HadrienGardeur · 2019-03-23T17:00:44Z

Is there anything in schema.org that we could use for that?

I don't think that we should be tied to any specific algorithm, which potentially means:

identifying the algorithm that we use (URI)
plus providing the value of the hash (string)

plinss · 2019-03-23T18:10:48Z

Please use the Subresource Integrity syntax. The last thing we need to add to the web platform is yet another way to compute, store, and parse hashes. Use the platform, use existing mechanisms rather than inventing new ones.

Also, just because your current use isn't thinking about security doesn't mean future uses wont. Adding weak hashes is doing a disservice to future users.

iherman · 2019-03-24T08:10:29Z

@HadrienGardeur

Is there anything in schema.org that we could use for that?

I haven't found any... :-(

geoffjukes · 2019-04-11T20:37:21Z

I am fully on board with using the algo+hash syntax from the Subresource Integrity spec, per Dave's suggestion.

mattgarrish · 2019-04-15T21:46:55Z

One question looking at how to integrate this: given that we aren't restricted to an HTML attribute, how do we handle the ability to define multiple hash expressions for each resource? Do we:

restrict wpub to a single hash expression to keep things simple;
use spaces to delimit each to remain consistent with SRI; or
allow an array of values, where each value is one hash expression?

dauwhe · 2019-04-15T21:53:02Z

I think [1] is too limited. [2] has the advantage of being consistent with SRI

"integrity: "sha384-dOTZf16X8p34q2/kYyEFm0jh89uTjikhnzjeLeF0FHsEaYKb1A1cv+Lyv4Hk8vHd
              sha512-Q2bFTOhEALkN8hOms2FKTDLy7eugP2zFZ1T8LCvX42Fp3WoNr3bjZSAHeOsHrbV1Fu9/A0EzCinRE7Af1ofPrw=="

Not sure about [2] vs [3].

mattgarrish · 2019-04-15T21:58:51Z

Not sure about [2] vs [3].

Ya, this is the particularly tricky thing to answer. We don't have to use whitespace to delimit, but SRI is defined with that expectation. It feels like we should seek input from that spec's authors.

dauwhe · 2019-04-15T22:01:33Z

I guess I lean towards [2] both because of consistency, and because it's way easier to type a space then create an array in JSON. Consider users over authors over implementors over specifiers over theoretical purity ;)

plinss · 2019-04-15T22:20:38Z

I agree with @dauwhe's reasoning. In addition, if you use a json array, then you either have to always use an array (even for one value, which is likely to be the most common case, putting an additional burden on authors), or give users the burden of testing for string vs array values.

Keeping it entirely consistent with SRI also makes it easier to copy values between the manifest and an attribute should the need ever arise. It also allows the wpub manifest spec to simply refer to the SRI spec and avoid re-specifying something potentially introducing inconsistencies as each spec evolves.

iherman · 2019-04-16T07:21:45Z

This issue was discussed in a meeting.

RESOLVED: add the optional integrity property for linked resources, using the subresource integrity format

View the transcript

file hashes
Wendy Reid: #398
Laurent Le Meur: we just need a name for the resource level property …
Wendy Reid: the issue is around file hashes, so content creators can provide identifiable hashes to individual resources
… the proposal is to use SRI
Ivan Herman: what term should we use
… this is not in schema, so we need to pick a term
Dave Cramer: Garth brought up the question of requirements on reading systems, it’s a problem in RSs, EPUB has signatures but RSs don’t always understand them
… if an integrity hash is present, the UA must check it and terminate processing if it does not pass
Brady Duga: hashes are great. If you want to pretend that these have anything to do with security or integrity I object.
… they do not provide this at all.
… they do not provide security.
Laurent Le Meur: I agree with the objection about security. I think it says something about integrity.
… I’m worried that some user agents might not be able to deal with any algorithms that is expressed
… is there a closed list of algorithms?
Dave Cramer: Can someone educate me as to why the SRI spec exists?
Ivan Herman: the big difference between SRI on HTML is that there it is mainly used for the JS you bring in when you use external JS
… I can’t really answer brady’s concerns
… if I trust what I get from a URL as JS, has the same hash that I expected, then I can believe it’s the correct JS
… but it may be different for audio files
Garth Conboy: I was going to disagree with Dave. I have no objection to this, but don’t want user agents to have to deal with this.
Geoff Jukes: it’s doesn’t provide security or integrity. we use it to communicate to our apps that a file was downloaded completely.
… we just use it to detect bad downloads.
Wendy Reid: do we want to include this?
Ivan Herman: how important is this?
Geoff Jukes: our apps rely on this utterly. We deliver to cellphones. Not everyone has 5G. We have to deal with unreliable delivery. We’re OK with this in the spec and optional.
… . we will use this
Wendy Reid: this sounds like something that a distributor/reading system can handle on its own
… perhaps we ask other distributors/UAs?
Ivan Herman: isn’t that the definition of an optional thing?
… we know someone uses it.
… is it important to have a standard format?
Proposed resolution: add the optional integrity property for linked resources, using the subresource integrity format (Ivan Herman)
Wendy Reid: let’s add it as optional
Wendy Reid: +1
Garth Conboy: +1
Brady Duga: +1
Geoff Jukes: +1
Laurent Le Meur: +1
Ivan Herman: +1
Bill Kasdorf: +1
Tzviya Siegman: +1 (i think)
Joshua Pyle: +1
Tim Cole: +1
Resolution #5: add the optional integrity property for linked resources, using the subresource integrity format

iherman · 2019-04-16T07:26:12Z

@llemeurfr asked, during the meeting (quoting the minutes):

Laurent Le Meur: … I’m worried that some user agents might not be able to deal with any algorithms that is expressed
… is there a closed list of algorithms?

The SRI recommendations says in 3.2:

Conformant user agents MUST support the SHA-256, SHA-384 and SHA-512 cryptographic hash functions for use as part of a request’s integrity metadata and may support additional hash functions.

Though we refer to SRI normatively, i.e., we inherit this list, it is probably worth calling this out in our document as well.

Cc @mattgarrish

mattgarrish · 2019-04-16T10:46:15Z

Though we refer to SRI normatively, i.e., we inherit this list, it is probably worth calling this out in our document as well.

I'd prefer to avoid duplicating the requirement, if that's what you mean. We can refer across to the list, of course, but once we replicate the statement we put ourselves in the position of falling out of synch.

iherman · 2019-04-16T11:45:07Z

@mattgarrish I agree, and I did not mean to repeat the list. Just put a note in the text that there is such a list, with a reference to the Rec.

wareid added topic:manifest topic:audio labels Jan 30, 2019

geoffjukes mentioned this issue Feb 5, 2019

Allow signing the components of a package w3c/pwpub#31

Closed

mattgarrish mentioned this issue Apr 16, 2019

add integrity property #425

Merged

wareid closed this as completed Apr 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File hashes (MD5) #398

File hashes (MD5) #398

geoffjukes commented Jan 30, 2019

GarthConboy commented Jan 30, 2019

dauwhe commented Jan 30, 2019

iherman commented Jan 30, 2019

iherman commented Jan 30, 2019

dauwhe commented Jan 30, 2019

geoffjukes commented Jan 30, 2019 •

edited

HadrienGardeur commented Mar 23, 2019

plinss commented Mar 23, 2019

iherman commented Mar 24, 2019

geoffjukes commented Apr 11, 2019

mattgarrish commented Apr 15, 2019

dauwhe commented Apr 15, 2019

mattgarrish commented Apr 15, 2019

dauwhe commented Apr 15, 2019

plinss commented Apr 15, 2019

iherman commented Apr 16, 2019

iherman commented Apr 16, 2019

mattgarrish commented Apr 16, 2019

iherman commented Apr 16, 2019

File hashes (MD5) #398

File hashes (MD5) #398

Comments

geoffjukes commented Jan 30, 2019

GarthConboy commented Jan 30, 2019

dauwhe commented Jan 30, 2019

iherman commented Jan 30, 2019

iherman commented Jan 30, 2019

dauwhe commented Jan 30, 2019

geoffjukes commented Jan 30, 2019 • edited

HadrienGardeur commented Mar 23, 2019

plinss commented Mar 23, 2019

iherman commented Mar 24, 2019

geoffjukes commented Apr 11, 2019

mattgarrish commented Apr 15, 2019

dauwhe commented Apr 15, 2019

mattgarrish commented Apr 15, 2019

dauwhe commented Apr 15, 2019

plinss commented Apr 15, 2019

iherman commented Apr 16, 2019

iherman commented Apr 16, 2019

mattgarrish commented Apr 16, 2019

iherman commented Apr 16, 2019

geoffjukes commented Jan 30, 2019 •

edited