Use w3c xml.xsd from Darwin Core repository #134

mdoering · 2016-11-18T08:35:44Z

The DwC text xml schema references the W3C xml.xsd from a GBIF server. It should better reference a copy from the DwC repository instead to not rely on the external GBIF URL http://rs.gbif.org/schema/xml.xsd

tucotuco · 2020-09-08T22:41:45Z

This would affect docs/text/index.md and https://dwc.tdwg.org/text/tdwg_dwc_text.xsd. I don't see a copy of xml.xsd on rs.tdwg.org. We'd need that to resolve this issue.

tucotuco · 2020-09-08T23:12:54Z

See also issue #124

baskaufs · 2020-09-09T13:13:15Z

If this is a critical file to the functioning of Darwin Core, then it should be considered to be part of the standard itself. If that is true, then I believe the appropriate course of action would be to treat it in the same manner as other standards documents: assign it an IRI in the rs.tdwg.org subdomain following the IRI pattern scheme for documents.

Using the analogous patterns to those of the TAPIR XML schema and ABCD XML schemas listed in the file that defines document redirects, I would recommend:

http://rs.tdwg.org/dwc/doc/xmlschema/

with a browserRedirectUrl of

https://dwc.tdwg.org/text/tdwg_dwc_text.xsd

The behavior of dereferencing this "permanent" IRI can be seen by dereferencing the TAPIR schema "permanent" IRI http://rs.tdwg.org/tapir/doc/xmlschema/ using cURL with Accept header of application/xml:

Request: http://rs.tdwg.org/tapir/doc/xmlschema/. Response: 303 redirect to http://rs.tdwg.org/tapir/doc/xmlschema.htm.
Request: http://rs.tdwg.org/tapir/doc/xmlschema.htm. Response: 302 redirect to http://tdwg.github.io/tapir/schema/tapir.xsd.
Request: http://tdwg.github.io/tapir/schema/tapir.xsd. Response: 200 with document sent as body with Content-type header of application/xml.

The first response is the appropriate Linked Data behavior -- if an RDF Content-type like text/turtle were requested instead, then machine-readable document metadata would be returned. The second response is an idiosyncrasy of how the server is programmed. If any Content-type other than one of the RDF serializations were requested, the request would be processed as if the request were for HTML. The server can generate hundreds of different HTML documents from the database by script, so in the case of most IRIs, this second server call will produce the desired document. Only if the document IRI pattern http://rs.tdwg.org/x/doc/x/ is found does it look for a redirect URL and perform the 302 redirect to the appropriate external URL.

Although this extra HTTP GET is a bit kludgy, it seems to work fine and in the end produces the correct document with the correct Content-type. If it is a problem, I could fix the server script to correctly do content negotiation when application/xml is requested. I didn't bother to do that yet because currently there are only four XML standards documents that are currently being served, but if it's important, I could fix it.

I think that this general approach (using a standard IRI pattern and redirecting from a permanent IRI to a redirect URL) is the right one and what we should do from this point forward. We need to stop having people use idiosyncratic URLs that break every time we change delivery system and start getting people to use actual stable IRIs to access resources.

The other thing is that every file that is a critical part of a standard needs to be included in the metadata for that standard so that at least in theory a machine or human can determine all of the parts of a standard. We are not quite there yet, but following systematic patterns for IRIs of standards components is a piece of that. It also avoids us having to have a long list of custom redirects every time a critical file moves to some other place.

I can easily and quickly set up the entry in file that defines document redirects using the IRI http://rs.tdwg.org/tapir/doc/xmlschema/ if this is an acceptable solution. It would technically be an "addition" of the schema to Darwin Core, but I think the Maintenance Group could just give the go-ahead since it effectively is an oversight that it wasn't already in the standard. Just let me know.

tucotuco · 2020-09-09T14:26:42Z

I wholeheartedly support the proposed solution. Others? @mdoering @peterdesmet @MattBlissett ?

peterdesmet · 2020-09-09T18:17:39Z

No objections

mdoering · 2020-09-09T22:11:32Z

I support @baskaufs proposal. But it does not seem to address the original issue about the xml.xsd file?

https://dwc.tdwg.org/text/tdwg_dwc_text.xsd still imports xml.xsd from a gbif URL:

<xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://rs.gbif.org/schema/xml.xsd">
  <xs:annotation>
    <xs:documentation>Get access to the xml: attribute groups for xml:lang</xs:documentation>
  </xs:annotation>
</xs:import>

baskaufs · 2020-09-10T01:02:26Z

If a "permanent" IRI is implemented, can we just change the schemaLocation attribute in the xsd?

mdoering · 2020-09-10T08:15:36Z

Technically I would strongly avoid using redirects. The IRI should return an http 200, no 3xx as many default implementations including Java fail to deal with that.

See https://stackoverflow.com/questions/29696638/how-to-validate-xml-with-schema-urls-that-return-http-301 or https://planet.jboss.org/post/java_7_xml_entity_resolver_doesn_t_follow_redirects_makes_xsd_validation_fail

baskaufs · 2020-09-10T11:19:46Z

That's good to know, @mdoering. It sounds like the best solution for the XML schemas is to set up the content negotiation part of the script to serve requests for XML files without redirects. Do you know if most clients actually send a request header of application/xml or text/xml, or do they just depend on the server to look at the .xml file extension? I suppose it would be safest to handle both.

I think this could be implemented relatively easily by keeping a list on GitHub of the locations where the few XML schemas are located, having the server script load that list, then pull the file from wherever it is and serve it to the client with a 200. I do something like that here to determine whether particular terms should redirect to a web page or if the script should generate a page from data. That would require loading two files from GitHub the first time each XML file was requested, but I think that @MattBlissett has the server set up to cache requests for some period of time. So after the first request, clients should just get the file from the cache and that should be pretty efficient. That approach would allow changing the location of the file by just changing an entry in a table in GitHub and not actually making any changes to the server itself.

I don't think I have the bandwidth to deal with this now, so maybe it can be put off until I have time to work on the script and test carefully.

mdoering · 2020-09-10T13:00:33Z

Looks like this is what javax.xml.validation requests:

  ReqMethod      GET
  ReqURL         /eml.xsd
  ReqProtocol    HTTP/1.1
  ReqHeader      User-Agent: Java/13.0.1
  ReqHeader      Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2

Rather unspecific...

baskaufs · 2020-09-10T13:26:07Z

Wow. Not that great.

I should have said .xsd, not .xml in my previous comment. It looks like handling the extension is the way to go.

baskaufs · 2020-09-10T13:27:47Z

@tucotuco Maybe we can back-burner this for the time being.

tucotuco · 2020-09-10T16:12:37Z

I agree. I think the highest priority is to move forward requests for new terms and term changes, then web site fixes, then the rest.

…

On Thu, Sep 10, 2020 at 10:28 AM Steve Baskauf ***@***.***> wrote: @tucotuco <https://github.com/tucotuco> Maybe we can back-burner this for the time being. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#134 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADQ723CGO3A7ZEWC2YYTGDSFDH6LANCNFSM4CWY6KQQ> .

MattBlissett · 2020-09-11T14:54:24Z

The original issue:

The DwC text xml schema references the W3C xml.xsd from a GBIF server.

The rs.gbif.org server is about a metre from the rs.gbif.org server, and significantly further from GitHub's servers. Communication between them is very reliable, and external availability is likely to be the same.

Unless there's some other reason, I also agree this is very low priority.

tucotuco · 2021-08-08T00:12:03Z

Keeping at low priority and removing from current milestone in the interest of releasing the new terms.

tucotuco added the Format - XML label Sep 30, 2017

tucotuco added Docs - XML Guide task labels Sep 8, 2020

tucotuco mentioned this issue Sep 8, 2020

TDWG DwC Text schema references git head #124

Open

baskaufs self-assigned this Sep 10, 2020

tucotuco added this to the Post ratification Documentation Updates milestone Apr 30, 2021

tucotuco removed this from the Post ratification Documentation Updates milestone Aug 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use w3c xml.xsd from Darwin Core repository #134

Use w3c xml.xsd from Darwin Core repository #134

mdoering commented Nov 18, 2016

tucotuco commented Sep 8, 2020

tucotuco commented Sep 8, 2020

baskaufs commented Sep 9, 2020

tucotuco commented Sep 9, 2020

peterdesmet commented Sep 9, 2020

mdoering commented Sep 9, 2020 •

edited

Loading

baskaufs commented Sep 10, 2020

mdoering commented Sep 10, 2020

baskaufs commented Sep 10, 2020

mdoering commented Sep 10, 2020

baskaufs commented Sep 10, 2020

baskaufs commented Sep 10, 2020

tucotuco commented Sep 10, 2020 via email

MattBlissett commented Sep 11, 2020

tucotuco commented Aug 8, 2021

Use w3c xml.xsd from Darwin Core repository #134

Use w3c xml.xsd from Darwin Core repository #134

Comments

mdoering commented Nov 18, 2016

tucotuco commented Sep 8, 2020

tucotuco commented Sep 8, 2020

baskaufs commented Sep 9, 2020

tucotuco commented Sep 9, 2020

peterdesmet commented Sep 9, 2020

mdoering commented Sep 9, 2020 • edited Loading

baskaufs commented Sep 10, 2020

mdoering commented Sep 10, 2020

baskaufs commented Sep 10, 2020

mdoering commented Sep 10, 2020

baskaufs commented Sep 10, 2020

baskaufs commented Sep 10, 2020

tucotuco commented Sep 10, 2020 via email

MattBlissett commented Sep 11, 2020

tucotuco commented Aug 8, 2021

mdoering commented Sep 9, 2020 •

edited

Loading