Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File hashes (MD5) #398

Closed
geoffjukes opened this issue Jan 30, 2019 · 19 comments
Closed

File hashes (MD5) #398

geoffjukes opened this issue Jan 30, 2019 · 19 comments

Comments

@geoffjukes
Copy link

Hi,

We use a standalone manifest file to describe our audiobooks, which includes MD5s for all file assets (audio files, artwork, and any supplemental material). The audiobook player use the MD5 to check downloaded assets, and for asset update/invalidation.

Any spec that we adopt must at least allow these hashes somewhere in the dataset.

Geoff

@GarthConboy
Copy link
Contributor

Particularly wedded to MD5, or something more modern/secure okay? SHA-2?

@dauwhe
Copy link
Contributor

dauwhe commented Jan 30, 2019

I wonder if we can align with the W3C's Subresource Integrity specification. Perhaps we add an "integrity" member to the manifest? And then use algo+hash as in the spec?

{
            "type": "LinkedResource",
            "url": "fonts/STIXGeneral.otf",
            "encodingFormat": "application/vnd.ms-opentype",
            "integrity": "sha384-dOTZf16X8p34q2/kYyEFm0jh89uTjikhnzjeLeF0FHsEaYKb1A1cv+Lyv4Hk8vHd"
        }

@iherman
Copy link
Member

iherman commented Jan 30, 2019

Can you give some details? Do you store a hash for each resource separately, or one global hash (some sort of a merkle tree)?

@iherman
Copy link
Member

iherman commented Jan 30, 2019

B.t.w., this should probably be a 'core' manifest feature and not audio specific.

@dauwhe
Copy link
Contributor

dauwhe commented Jan 30, 2019

Can you give some details? Do you store a hash for each resource separately, or one global hash (some sort of a merkle tree)?

See https://github.com/blackstoneaudio/audiobook-spec/blob/master/draft.yaml

@geoffjukes
Copy link
Author

geoffjukes commented Jan 30, 2019

@GarthConboy We use MD5s because they are cheap to compute, and they are used for download validation not security. Same purpose as 'eTag' keys in S3 or equivalent.

We use the word 'md5' as the key, but anything equivalent (hash, checksum, etc) would be fine, as long as we know what it is. For us the string value would be an MD5. For others it could be anything.

@dauwhe That seems very reasonable to me.

@iherman We store a hash for each resource separately see https://github.com/blackstoneaudio/audiobook-spec/blob/cfd468bb27b890b0e4a59a3345e806221a702fce/draft.json#L59

We do also store a 'hash of hashes' which we use as a sort of 'version' see https://github.com/blackstoneaudio/audiobook-spec/blob/cfd468bb27b890b0e4a59a3345e806221a702fce/draft.json#L11

@HadrienGardeur
Copy link

Is there anything in schema.org that we could use for that?

I don't think that we should be tied to any specific algorithm, which potentially means:

  • identifying the algorithm that we use (URI)
  • plus providing the value of the hash (string)

@plinss
Copy link
Member

plinss commented Mar 23, 2019

Please use the Subresource Integrity syntax. The last thing we need to add to the web platform is yet another way to compute, store, and parse hashes. Use the platform, use existing mechanisms rather than inventing new ones.

Also, just because your current use isn't thinking about security doesn't mean future uses wont. Adding weak hashes is doing a disservice to future users.

@iherman
Copy link
Member

iherman commented Mar 24, 2019

@HadrienGardeur

Is there anything in schema.org that we could use for that?

I haven't found any... :-(

@geoffjukes
Copy link
Author

I am fully on board with using the algo+hash syntax from the Subresource Integrity spec, per Dave's suggestion.

@mattgarrish
Copy link
Member

One question looking at how to integrate this: given that we aren't restricted to an HTML attribute, how do we handle the ability to define multiple hash expressions for each resource? Do we:

  1. restrict wpub to a single hash expression to keep things simple;
  2. use spaces to delimit each to remain consistent with SRI; or
  3. allow an array of values, where each value is one hash expression?

@dauwhe
Copy link
Contributor

dauwhe commented Apr 15, 2019

I think [1] is too limited. [2] has the advantage of being consistent with SRI

"integrity: "sha384-dOTZf16X8p34q2/kYyEFm0jh89uTjikhnzjeLeF0FHsEaYKb1A1cv+Lyv4Hk8vHd
              sha512-Q2bFTOhEALkN8hOms2FKTDLy7eugP2zFZ1T8LCvX42Fp3WoNr3bjZSAHeOsHrbV1Fu9/A0EzCinRE7Af1ofPrw=="

Not sure about [2] vs [3].

@mattgarrish
Copy link
Member

Not sure about [2] vs [3].

Ya, this is the particularly tricky thing to answer. We don't have to use whitespace to delimit, but SRI is defined with that expectation. It feels like we should seek input from that spec's authors.

@dauwhe
Copy link
Contributor

dauwhe commented Apr 15, 2019

I guess I lean towards [2] both because of consistency, and because it's way easier to type a space then create an array in JSON. Consider users over authors over implementors over specifiers over theoretical purity ;)

@plinss
Copy link
Member

plinss commented Apr 15, 2019

I agree with @dauwhe's reasoning. In addition, if you use a json array, then you either have to always use an array (even for one value, which is likely to be the most common case, putting an additional burden on authors), or give users the burden of testing for string vs array values.

Keeping it entirely consistent with SRI also makes it easier to copy values between the manifest and an attribute should the need ever arise. It also allows the wpub manifest spec to simply refer to the SRI spec and avoid re-specifying something potentially introducing inconsistencies as each spec evolves.

@iherman
Copy link
Member

iherman commented Apr 16, 2019

This issue was discussed in a meeting.

  • RESOLVED: add the optional integrity property for linked resources, using the subresource integrity format
View the transcript file hashes
Wendy Reid: #398
Laurent Le Meur: we just need a name for the resource level property …
Wendy Reid: the issue is around file hashes, so content creators can provide identifiable hashes to individual resources
… the proposal is to use SRI
Ivan Herman: what term should we use
… this is not in schema, so we need to pick a term
Dave Cramer: Garth brought up the question of requirements on reading systems, it’s a problem in RSs, EPUB has signatures but RSs don’t always understand them
… if an integrity hash is present, the UA must check it and terminate processing if it does not pass
Brady Duga: hashes are great. If you want to pretend that these have anything to do with security or integrity I object.
… they do not provide this at all.
… they do not provide security.
Laurent Le Meur: I agree with the objection about security. I think it says something about integrity.
… I’m worried that some user agents might not be able to deal with any algorithms that is expressed
… is there a closed list of algorithms?
Dave Cramer: Can someone educate me as to why the SRI spec exists?
Ivan Herman: the big difference between SRI on HTML is that there it is mainly used for the JS you bring in when you use external JS
… I can’t really answer brady’s concerns
… if I trust what I get from a URL as JS, has the same hash that I expected, then I can believe it’s the correct JS
… but it may be different for audio files
Garth Conboy: I was going to disagree with Dave. I have no objection to this, but don’t want user agents to have to deal with this.
Geoff Jukes: it’s doesn’t provide security or integrity. we use it to communicate to our apps that a file was downloaded completely.
… we just use it to detect bad downloads.
Wendy Reid: do we want to include this?
Ivan Herman: how important is this?
Geoff Jukes: our apps rely on this utterly. We deliver to cellphones. Not everyone has 5G. We have to deal with unreliable delivery. We’re OK with this in the spec and optional.
… . we will use this
Wendy Reid: this sounds like something that a distributor/reading system can handle on its own
… perhaps we ask other distributors/UAs?
Ivan Herman: isn’t that the definition of an optional thing?
… we know someone uses it.
… is it important to have a standard format?
Proposed resolution: add the optional integrity property for linked resources, using the subresource integrity format (Ivan Herman)
Wendy Reid: let’s add it as optional
Wendy Reid: +1
Garth Conboy: +1
Brady Duga: +1
Geoff Jukes: +1
Laurent Le Meur: +1
Ivan Herman: +1
Bill Kasdorf: +1
Tzviya Siegman: +1 (i think)
Joshua Pyle: +1
Tim Cole: +1
Resolution #5: add the optional integrity property for linked resources, using the subresource integrity format

@iherman
Copy link
Member

iherman commented Apr 16, 2019

@llemeurfr asked, during the meeting (quoting the minutes):

Laurent Le Meur: … I’m worried that some user agents might not be able to deal with any algorithms that is expressed
… is there a closed list of algorithms?

The SRI recommendations says in 3.2:

Conformant user agents MUST support the SHA-256, SHA-384 and SHA-512 cryptographic hash functions for use as part of a request’s integrity metadata and may support additional hash functions.

Though we refer to SRI normatively, i.e., we inherit this list, it is probably worth calling this out in our document as well.

Cc @mattgarrish

@mattgarrish
Copy link
Member

Though we refer to SRI normatively, i.e., we inherit this list, it is probably worth calling this out in our document as well.

I'd prefer to avoid duplicating the requirement, if that's what you mean. We can refer across to the list, of course, but once we replicate the statement we put ourselves in the position of falling out of synch.

@iherman
Copy link
Member

iherman commented Apr 16, 2019

@mattgarrish I agree, and I did not mean to repeat the list. Just put a note in the text that there is such a list, with a reference to the Rec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants