update manifest processing to use infra types #103

mattgarrish · 2019-10-08T23:18:57Z

Opening this as a PR to get feedback, as I'm going cross-eyed working on this section. Beyond just switching terminology, I also tried to tighten up the processing to account for potentially other bits of invalid input, to simplify the duplication of language/direction steps, and to consolidate a number of validation checks that were all dependent on recursion.

The WebIDL section is moved into the processing section, and to de-emphasize the technology in the interim I changed the title to "Data Visualization" (but other ideas welcome).

Preview | Diff

remove commented-out bidi example

mattgarrish · 2019-10-09T11:19:57Z

One question: should the webidl section also be informative?

We're not actually requiring webidl from the processing, so it's more an example of the final data than a usable schema.

rdeltour

I made a cursory read and it looks like the right direction.

A couple comments:

one of my concern is that the output of the algorithm is too vaguely defined. It currently says it's the internal representation, but this has no tangible meaning. My suggestion is to explicitly say that for the purpose of the algorithm, the internal definition is expressed as an Infra map.
I think I understand the idea behind the "visualization" terminology, but I'm not sure I like it, at least as a normative section. If it's normative, then we have to normatively define how to convert from the Infra type (result of the algorithm) to this normative structure. If it's not normative, then let's move to an appendix.

If the idea is that we want a simple and terse way to represent the data structure, is it a matter of defining a notation for Infra types?
In that case, we could open an issue to Infra, and work with with Infra people to help define that.

rdeltour · 2019-10-09T11:32:30Z

One question: should the webidl section also be informative?

You beat me to it, I was writing my review 😊: yes in its current form I think it should be informative.

mattgarrish · 2019-10-09T11:32:56Z

If the idea is that we want a simple and terse way to represent the data structure, is it a matter of defining a notation for Infra types?

Yes, that would be ideal, if it can happen in time with our charter. Given our current options, however, webidl has proven the most stable and understandable way of expressing this info.

But I agree with webidl being informative if we keep it. It's not expressing the actual final representation.

index.html

iherman · 2019-10-09T11:45:27Z

Actually, I may revise my earlier comment in another thread, and would opt to move the WebIDL into an (informative) appendix. If we want to change later by either adding other formalism or even formalisms (see #98 (comment)), we can do that more easily... And with the predominance of infra, this would probably be a better place for it.

iherman · 2019-10-09T11:47:03Z

I must admit that, looking through the text, I am reinforced in my opinion that something like the WebIDL is necessary. For my taste the move towards infra did not make the spec more easy to read (maybe even on the contrary, but that may be personal), and such an extra aid is really a good thing to have.

mattgarrish · 2019-10-09T11:58:52Z

For my taste the move towards infra did not make the spec more easy to read

No, that's what I've been struggling with the last couple of days... ;)

But by the end I did find it to be more precise than the rather loose references to javascript and generic object/arrays that we were using. Plus there is the benefit that if devs are already reading this notation in a growing list of other specs, we're not expecting them to figure out something new.

(I couldn't make sense of the final data without a cleaner representation, either.)

rdeltour · 2019-10-09T12:00:52Z

As an attempt to define the data structure using Infra, here's what it could look like:

A localizable string is a struct with the following items:

value, which is a Javascript string
language, which is a Javascript string
direction, which is `ltr` or `rtl`

(…)

A publication manifest is a struct with the following items:

type, which is a list of Javascript strings
conformsTo, which is a a list of Javascript strings
id, which is a Javascript string
abridged, which is a boolean
…

It may not be as concise as a WebIDL dictionary, but is reasonably easy to read?

mattgarrish · 2019-10-09T12:03:26Z

* one of my concern is that the output of the algorithm is too vaguely defined. It currently says it's the _internal representation_, but this has no tangible meaning

The statements are somewhat split up. In the first para, yes, all we say is it defines the internal representation, but right before the note there is:

If successful, the algorithm returns a map containing the internal representation of the data.

These can probably be linked better to be clear that the internal representation is defined in terms of infra and results in an infra map, but you can use any language to implement the conversion (provided it can represent the infra data types, of course).

rdeltour · 2019-10-09T12:05:32Z

Note that in my Infra definition above the types are defined as structs rather than maps as currently used in the PR, which seem to be a better fit…

mattgarrish · 2019-10-09T12:17:19Z

the types are defined as structs rather than maps as currently used in the PR, which seem to be a better fit…

For expressing the structure, yes, but the conversion of JS to infra changes objects to maps, so the PR is correct.

rdeltour · 2019-10-09T12:35:12Z

the conversion of JS to infra changes objects to maps

Ah right, good point. I knew I had seen that somewhere.

rdeltour · 2019-10-09T12:36:48Z

the conversion of JS to infra changes objects to maps

However, maps can be used as the starting point and converted to structs by the processing algorithm… Anyways, it's probably bike shedding at this proof-of-concept stage 😊

mattgarrish · 2019-10-09T14:53:23Z

However, maps can be used as the starting point and converted to structs by the processing algorithm…

Could you do this easily? Don't structs come with a restricted set of members? The incoming data may not match the struct because of the extensibility of the manifest, so would it require a new struct that extends the base struct?

(WebIDL doesn't really capture this extensibility, either, I suppose, but if we're actually converting...)

Anyways, it's probably bike shedding at this proof-of-concept stage

Agreed.

fix some links referencing the wrong infra definitions; fix some phrasing to better reflect the infra spec

iherman · 2019-10-10T04:24:17Z

I am not sure what is happening, and I cannot locate it on the diff view above: It seems that part of 7.1 (setting the name to the value of <title> as a default) is missing toward the end. This is what I see on my screen (I checked it with two different browsers):

mattgarrish · 2019-10-10T10:46:03Z

It seems that part of 7.1 (setting the name to the value of <title> as a default) is missing toward the end.

No, I put a pre where there should be a code. It's saying to set name to a list with title as the only item:

Set processed["name"] to « title ».

rdeltour

Looks good! The requested changes are only the editorial stuff (missing back references). The other questions can be either dealt in this PR or moved to separate issues (to not block this PR).

rdeltour · 2019-10-11T07:42:20Z

index.html

+									<p>otherwise, if <var>normalized</var> is a <a
+											href="https://infra.spec.whatwg.org/#list">list</a>, perform this step on
+										each <a href="https://infra.spec.whatwg.org/#list-item">item</a> of the
+										list.</p>
+								</li>
+								<li>
+									<p>otherwise, if <var>normalized</var> is a <a
+											href="https://infra.spec.whatwg.org/#ordered-map">map</a>, perform this step
+										on each <a href="https://infra.spec.whatwg.org/#map-key">key</a> of the map.</p>


what is "this step" here? there should be a link to said step to clarify that.
is the idea to resolve all the URLs in the list? but then the term still expects a single URL, not a list?
I'm probably missing something.

No, looks like I pasted them into the wrong place at some point. Those two steps are supposed to come after the normalizing steps, as they're intended to recursively descend into normalized to run these operations on subproperties.

If we only go through the primary keys of the manifest, we'll miss normalizing properties inside LinkedResources, for example (or whatever an extension might defined).

rdeltour · 2019-10-11T07:55:19Z

index.html

+									<p>if <var>normalized</var> is a <a href="https://infra.spec.whatwg.org/#string"
+											>string</a> that is not an <a data-cite="!url/#absolute-url-string">absolute
+											URL string</a>, resolve it to an absolute URL string using the value of
+											<var>base</var>&#160;[[!rfc1808]].</p>


Note: this comment shouldn't block this PR, it can be moved to a separate issue.

this is tangential to the goal of this PR, but why referencing RFC 1808 here and not use the URL parsing algorithm from the URL standard? This algo takes a base URL as input to perform URL resolution.

Also btw, this step could result in a URL record, not a string. That would even simplify the logic down to:

if normalized is a string, set normalized to the result of running the URL parser with normalized as input and base url as the base URL.

Ya, I can't track down where 1808 came from, but since we use URL standard everywhere else, I don't see why we wouldn't use it for resolving.

rdeltour · 2019-10-11T08:01:32Z

index.html

@@ -3020,36 +2919,34 @@ <h2>Processing a Manifest</h2>
 						manifest</a>, when available.</li>


(commenting here has this specific text wasn't affected by the PR):

base: a URL string that represents the base URL for the manifest.

should link back to our definition of the manifest base URL.

rdeltour · 2019-10-11T08:09:59Z

index.html

+							<p>(<a href="#duration"></a>) If <var>processed["duration"]</var> is set and is not a valid
+								duration value, per&#160;[[!iso8601]], <a>validation error</a>, <a
+									href="https://infra.spec.whatwg.org/#map-remove">remove</a>
+								<var>processed["duration"]</var>.</p>
 						</li>

 						<li id="processing-checks-last-mod">
 							<p>(<a href="#last-modification-date"></a>) If <var>processed["dateModified"]</var> is set
-								and its value is not a valid date or date-time per&#160;[[!iso8601]], this is a
-									<a>validation error</a>.</p>
+								and is not a valid date or date-time per&#160;[[!iso8601]], <a>validation error</a>, <a
+									href="https://infra.spec.whatwg.org/#map-remove">remove</a>
+								<var>processed["dateModified"]</var>.</p>
 						</li>

 						<li id="processing-checks-pub-date">
-							<p>(<a href="#publication-date"></a>) If <var>processed["datePublished"]</var> is set and
-								its value is not a valid date or date-time per&#160;[[!iso8601]], this is a
-									<a>validation error</a>.</p>
+							<p>(<a href="#publication-date"></a>) If <var>processed["datePublished"]</var> is set and is
+								not a valid date or date-time per&#160;[[!iso8601]], <a>validation error</a>, <a
+									href="https://infra.spec.whatwg.org/#map-remove">remove</a>
+								<var>processed["datePublished"]</var>.</p>


Note: this comment shouldn't block this PR, it can be further discussed in #104 and —if it results in any change— dealt-with in another PR.

do we want to allow smart UA to recover from these? Some might be able to fix an invalid date string, or convert a date in natural language to the ISO format.

One possibility could be to define a generic step like "convert input to the expected type" as a UA-defined step, and either return that or remove.

I'm not sure; what do you think?

Specifically for dates this may make sense (I am a little bit afraid we create a precedence, but maybe we are fine), but the step you propose can be read as if the UA MUST try to recover. This may be a step too far (let alone the fact that we would have to test this, and I am not sure how many different strings there are that can be interpreted correctly).

"... is set and is not a valid date or date-time per [[!iso8601]], validation error. If the User Agent is able to interpret the value as a valid date or date-time, replace the value with that generated value; otherwise remove it."

Or something like that, converted to infra-speak...

Doesn't this lead to interop problems if we can't define how to fix the invalid input? There are a whole host of reasons why the input could be invalid, from the data type being wrong to the date string not conforming.

Yeah... what is worse, it could lead to wrong interpretations. E.g., what date is meant by "01/02/03"? 1st of February 2003 (in the UK), 2nd of January 2001 (in the US) and, I believe, 3rd of February in 2001 in Japan (or in Hungary, b.t.w.).

Maybe we should keep it as is after all.

but the step you propose can be read as if the UA MUST try to recover.

sorry I wasn't clear. I was suggesting a MAY at best 🙂

Doesn't this lead to interop problems if we can't define how to fix the invalid input?

yeah, that's why I'm not sure.It's not so uncommon to have implementation-defined behavior/processing. There's not a huge risk for interoperability I think, given the nature of the metadata. At worst there will be cases like users not getting their publications sorted in the same order on various UAs if they sort it by date, maybe this isn't too bad? Again, these are open questions, I don't have a strong opinion on this.

it could lead to wrong interpretations

Right. I think that given a language and a date string, a lot can be done (e.g. "Dec 2019", "01/02/03" are not unambiguous in US English). I'm more worried of implementation starting using a default date instead of removing the value if they can't make sense of the authored date.

As always it would be interesting to see if "invalid" dates are commonly found in today's EPUB's for instance, to have an idea whether this would be useful at all.

In any case, keeping it as is may be safer for a first version?

In any case, keeping it as is may be safer for a first version?

Yes

move alternate to a validation check; add default for conformsto

mattgarrish · 2019-10-11T19:34:13Z

I just pushed a change to unlink the webidl from the property definitions, but could be convinced to undo this.

Given that we've made the idl informative and moved to an appendix, the bidirectional linking is confusing.

mattgarrish · 2019-10-11T19:41:07Z

In addition to @rdeltour's comments, I also added the default profile URL from #101 (I still don't quite understand how this will work, but we can tweak that).

Also, per w3c/audiobooks#27, I removed alternate from the step to expand LinkedResources and put it into the validation steps to make sure that url and encodingFormat are always set.

The only way we could allow alternate resources to be strings is if we get into inferring media types from extensions. It may be more complicated than that, though, as we'd have to lower the requirement for encodingFormat to a should (i.e., we'd have to bless media type guessing).

We could permit guessing to avoid alternate resources being thrown out, but like the other issue with dates it gets into how much work we want to push onto user agents.

mattgarrish · 2019-10-11T19:45:56Z

Hm, I should have double-checked the definition. I was thinking encodingFormat was required except for HTML, but now I see it's just:

Non-HTML resources SHOULD be expressed as LinkedResource objects with their encodingFormat values set.

So guess we can do both. Expand strings and also validate that encodingFormat is set (but not omit if not).

iherman · 2019-10-12T08:12:34Z

I just pushed a change to unlink the webidl from the property definitions, but could be convinced to undo this.

Given that we've made the idl informative and moved to an appendix, the bidirectional linking is confusing.

Indeed. However, there is a strange effect now (respec is getting in our way): in, say, A.1.1., the terms themselves (like abridged) are active links which link to, I believe, themselves, ie, not back into the text. Which is a bit strange.

I wonder whether we should not get rid of respec's WebIDL processing here...

Also: you have a subsection A.1 for WebIDL. Does it mean that we would want to put other formalisms as "A.2"?

iherman · 2019-10-12T08:13:46Z

Also, per w3c/audiobooks#27, I removed alternate from the step to expand LinkedResources and put it into the validation steps to make sure that url and encodingFormat are always set.

Is this related to #105 ? My feeling is that there should be no exception for audiobooks, in fact...

mattgarrish · 2019-10-12T12:51:01Z

I wonder whether we should not get rid of respec's WebIDL processing here...

That works for me.

Does it mean that we would want to put other formalisms as "A.2"?

Leaving open that possibility, yes. We can always revise at the end of CR if we still only have webidl to show.

mattgarrish · 2019-10-12T13:17:07Z

My feeling is that there should be no exception for audiobooks, in fact...

Ya, I screwed that up by not double-checking what I thought I knew. I put back the expansion of strings, but I'm still confused about why HTML resources get a pass from encodingType. As I just mentioned in #105, either warnings for all or no warnings.

iherman · 2019-10-12T14:32:04Z

Leaving open that possibility, yes. We can always revise at the end of CR if we still only have webidl to show.

As a new convert to Typescript I believe it would be useful:-) but I am not sure it would bring anything new, so it may be unnecessary.

add check that conformsto is a list; add variable for storing profile; fix recursion steps in normalization

…as the later recursive steps already handle this

…ck and the value property of each object in the name array needs to be inspected, too; merges the validation checks for linkedresources as there is no need to have them separate; some minor cleanup of variable names

iherman · 2019-10-14T17:26:56Z

This issue was discussed in a meeting.

No actions or resolutions

View the transcript

update manifest processing to use infra types
Garth Conboy: See Issue #101
Garth Conboy: See Issue #98
Ivan Herman: See PR #103
Garth Conboy: There are 2 issues in this pull request - this is also one that Ivan and Matt have…
… Does anyone on the call want to run through this and hope to get to a consensus?
Ivan Herman: The work was done by Matt.
… I would prefer Romain to comment on this as he was the one initiating this whole work and can give a more appropriate background.
Romain Deltour: Basically it originated from trying to get rid of the webIGL as a way to describe a datastructure. In trying to harmonize the various manifests in the W3C, we wanted to use infra-types to represent the data structure. Matt’s editing the spec to replace webIGL for infra types.
Ivan Herman: See Infra standard
Romain Deltour: The WebIDL representative is moved to informative edits. A few different issues were raised when we looked at the algorithm. I think it’s clearer now, but we need more reviews there. It will be interesting to have more.
Garth Conboy: Looking at the PR, Romain you have one requested change outstanding and Ivan has approved it.
Ivan Herman: I am now repeating what Dave said earlier on the other PR. This was a pretty large change that made the document better and Matt did great work. Now it becomes difficult to discuss other issues and details with this as a PR.
Romain Deltour: I think my requests have been addressed in Matt’s latest changes
Ivan Herman: I would be in favor of merging this ASAP (as soon as Matt is comfortable) then we can comment and modify once it’s in. Again, this is something he should do…
Romain Deltour: +1 on merging and doing more reviews/changes based on Matt’s work
Garth Conboy: Sounds good for me. If there are subsequent changes, they will be smaller. Barring objection, why don’t we go ahead and put that as a request for Matt…

update manifest processing to use infra types;

e1532a4

remove commented-out bidi example

mattgarrish requested review from iherman and rdeltour October 8, 2019 23:18

rdeltour reviewed Oct 9, 2019

View reviewed changes

iherman requested changes Oct 9, 2019

View reviewed changes

mattgarrish added 2 commits October 9, 2019 20:19

updates per comments to the PR

31ff500

prune invalid properties/values as requested in #104;

a5279d1

fix some links referencing the wrong infra definitions; fix some phrasing to better reflect the infra spec

additional fixes per recent comments

12edcc6

mattgarrish mentioned this pull request Oct 10, 2019

Handling of validation errors #104

Closed

iherman approved these changes Oct 10, 2019

View reviewed changes

rdeltour requested changes Oct 11, 2019

View reviewed changes

mattgarrish added 2 commits October 11, 2019 15:54

fixes per romain's comments;

1fd3675

move alternate to a validation check; add default for conformsto

unlink webidl

9c6f47e

undo removing alternate string conversion

5f28264

iherman mentioned this pull request Oct 12, 2019

What are "html resources"? #105

Closed

mattgarrish added 6 commits October 12, 2019 11:51

remove idl classes;

ad94e1b

add check that conformsto is a list; add variable for storing profile; fix recursion steps in normalization

fix links

b7e1c0a

fix audiobook profile

34e1b49

fix id in css

d2cc718

remove step e to descend into maps/lists to find localizable strings …

5b85662

…as the later recursive steps already handle this

This was referenced Oct 13, 2019

value type check complicated and draconian #109

Closed

Lack of conformsTo too draconian? #101

Closed

WebIDL, yea or nay? #98

Closed

mattgarrish merged commit 07f264d into master Oct 14, 2019

mattgarrish deleted the editorial/infra-compat branch October 14, 2019 21:01

This was referenced Oct 15, 2019

require extension steps be identified by profile? #107

Closed

conformsTo single and/or array of literals #106

Closed

		@@ -3020,36 +2919,34 @@ <h2>Processing a Manifest</h2>
		manifest</a>, when available.</li>

update manifest processing to use infra types #103

update manifest processing to use infra types #103

Conversation

mattgarrish commented Oct 8, 2019 • edited by pr-preview bot

mattgarrish commented Oct 9, 2019 • edited

rdeltour left a comment

Choose a reason for hiding this comment

rdeltour commented Oct 9, 2019

mattgarrish commented Oct 9, 2019

iherman commented Oct 9, 2019

iherman commented Oct 9, 2019

mattgarrish commented Oct 9, 2019

rdeltour commented Oct 9, 2019 • edited

mattgarrish commented Oct 9, 2019

rdeltour commented Oct 9, 2019

mattgarrish commented Oct 9, 2019

rdeltour commented Oct 9, 2019

rdeltour commented Oct 9, 2019

mattgarrish commented Oct 9, 2019

iherman commented Oct 10, 2019

mattgarrish commented Oct 10, 2019

rdeltour left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattgarrish commented Oct 11, 2019

mattgarrish commented Oct 11, 2019 • edited

mattgarrish commented Oct 11, 2019

iherman commented Oct 12, 2019

iherman commented Oct 12, 2019

mattgarrish commented Oct 12, 2019

mattgarrish commented Oct 12, 2019

iherman commented Oct 12, 2019

iherman commented Oct 14, 2019

mattgarrish commented Oct 8, 2019 •

edited by pr-preview bot

mattgarrish commented Oct 9, 2019 •

edited

rdeltour commented Oct 9, 2019 •

edited

mattgarrish commented Oct 11, 2019 •

edited