Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use w3c xml.xsd from Darwin Core repository #134

Open
mdoering opened this issue Nov 18, 2016 · 15 comments
Open

Use w3c xml.xsd from Darwin Core repository #134

mdoering opened this issue Nov 18, 2016 · 15 comments

Comments

@mdoering
Copy link
Contributor

The DwC text xml schema references the W3C xml.xsd from a GBIF server. It should better reference a copy from the DwC repository instead to not rely on the external GBIF URL http://rs.gbif.org/schema/xml.xsd

@tucotuco
Copy link
Member

tucotuco commented Sep 8, 2020

This would affect docs/text/index.md and https://dwc.tdwg.org/text/tdwg_dwc_text.xsd. I don't see a copy of xml.xsd on rs.tdwg.org. We'd need that to resolve this issue.

@tucotuco
Copy link
Member

tucotuco commented Sep 8, 2020

See also issue #124

@baskaufs
Copy link

baskaufs commented Sep 9, 2020

If this is a critical file to the functioning of Darwin Core, then it should be considered to be part of the standard itself. If that is true, then I believe the appropriate course of action would be to treat it in the same manner as other standards documents: assign it an IRI in the rs.tdwg.org subdomain following the IRI pattern scheme for documents.

Using the analogous patterns to those of the TAPIR XML schema and ABCD XML schemas listed in the file that defines document redirects, I would recommend:

http://rs.tdwg.org/dwc/doc/xmlschema/

with a browserRedirectUrl of

https://dwc.tdwg.org/text/tdwg_dwc_text.xsd

The behavior of dereferencing this "permanent" IRI can be seen by dereferencing the TAPIR schema "permanent" IRI http://rs.tdwg.org/tapir/doc/xmlschema/ using cURL with Accept header of application/xml:

  1. Request: http://rs.tdwg.org/tapir/doc/xmlschema/. Response: 303 redirect to http://rs.tdwg.org/tapir/doc/xmlschema.htm.
  2. Request: http://rs.tdwg.org/tapir/doc/xmlschema.htm. Response: 302 redirect to http://tdwg.github.io/tapir/schema/tapir.xsd.
  3. Request: http://tdwg.github.io/tapir/schema/tapir.xsd. Response: 200 with document sent as body with Content-type header of application/xml.

The first response is the appropriate Linked Data behavior -- if an RDF Content-type like text/turtle were requested instead, then machine-readable document metadata would be returned. The second response is an idiosyncrasy of how the server is programmed. If any Content-type other than one of the RDF serializations were requested, the request would be processed as if the request were for HTML. The server can generate hundreds of different HTML documents from the database by script, so in the case of most IRIs, this second server call will produce the desired document. Only if the document IRI pattern http://rs.tdwg.org/x/doc/x/ is found does it look for a redirect URL and perform the 302 redirect to the appropriate external URL.

Although this extra HTTP GET is a bit kludgy, it seems to work fine and in the end produces the correct document with the correct Content-type. If it is a problem, I could fix the server script to correctly do content negotiation when application/xml is requested. I didn't bother to do that yet because currently there are only four XML standards documents that are currently being served, but if it's important, I could fix it.

I think that this general approach (using a standard IRI pattern and redirecting from a permanent IRI to a redirect URL) is the right one and what we should do from this point forward. We need to stop having people use idiosyncratic URLs that break every time we change delivery system and start getting people to use actual stable IRIs to access resources.

The other thing is that every file that is a critical part of a standard needs to be included in the metadata for that standard so that at least in theory a machine or human can determine all of the parts of a standard. We are not quite there yet, but following systematic patterns for IRIs of standards components is a piece of that. It also avoids us having to have a long list of custom redirects every time a critical file moves to some other place.

I can easily and quickly set up the entry in file that defines document redirects using the IRI http://rs.tdwg.org/tapir/doc/xmlschema/ if this is an acceptable solution. It would technically be an "addition" of the schema to Darwin Core, but I think the Maintenance Group could just give the go-ahead since it effectively is an oversight that it wasn't already in the standard. Just let me know.

@tucotuco
Copy link
Member

tucotuco commented Sep 9, 2020

I wholeheartedly support the proposed solution. Others? @mdoering @peterdesmet @MattBlissett ?

@peterdesmet
Copy link
Member

No objections

@mdoering
Copy link
Contributor Author

mdoering commented Sep 9, 2020

I support @baskaufs proposal. But it does not seem to address the original issue about the xml.xsd file?

https://dwc.tdwg.org/text/tdwg_dwc_text.xsd still imports xml.xsd from a gbif URL:

<xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://rs.gbif.org/schema/xml.xsd">
  <xs:annotation>
    <xs:documentation>Get access to the xml: attribute groups for xml:lang</xs:documentation>
  </xs:annotation>
</xs:import>

@baskaufs
Copy link

If a "permanent" IRI is implemented, can we just change the schemaLocation attribute in the xsd?

@mdoering
Copy link
Contributor Author

Technically I would strongly avoid using redirects. The IRI should return an http 200, no 3xx as many default implementations including Java fail to deal with that.

See https://stackoverflow.com/questions/29696638/how-to-validate-xml-with-schema-urls-that-return-http-301 or https://planet.jboss.org/post/java_7_xml_entity_resolver_doesn_t_follow_redirects_makes_xsd_validation_fail

@baskaufs
Copy link

That's good to know, @mdoering. It sounds like the best solution for the XML schemas is to set up the content negotiation part of the script to serve requests for XML files without redirects. Do you know if most clients actually send a request header of application/xml or text/xml, or do they just depend on the server to look at the .xml file extension? I suppose it would be safest to handle both.

I think this could be implemented relatively easily by keeping a list on GitHub of the locations where the few XML schemas are located, having the server script load that list, then pull the file from wherever it is and serve it to the client with a 200. I do something like that here to determine whether particular terms should redirect to a web page or if the script should generate a page from data. That would require loading two files from GitHub the first time each XML file was requested, but I think that @MattBlissett has the server set up to cache requests for some period of time. So after the first request, clients should just get the file from the cache and that should be pretty efficient. That approach would allow changing the location of the file by just changing an entry in a table in GitHub and not actually making any changes to the server itself.

I don't think I have the bandwidth to deal with this now, so maybe it can be put off until I have time to work on the script and test carefully.

@mdoering
Copy link
Contributor Author

Looks like this is what javax.xml.validation requests:

  ReqMethod      GET
  ReqURL         /eml.xsd
  ReqProtocol    HTTP/1.1
  ReqHeader      User-Agent: Java/13.0.1
  ReqHeader      Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2

Rather unspecific...

@baskaufs
Copy link

Wow. Not that great.

I should have said .xsd, not .xml in my previous comment. It looks like handling the extension is the way to go.

@baskaufs baskaufs self-assigned this Sep 10, 2020
@baskaufs
Copy link

@tucotuco Maybe we can back-burner this for the time being.

@tucotuco
Copy link
Member

tucotuco commented Sep 10, 2020 via email

@MattBlissett
Copy link
Member

The original issue:

The DwC text xml schema references the W3C xml.xsd from a GBIF server.

The rs.gbif.org server is about a metre from the rs.gbif.org server, and significantly further from GitHub's servers. Communication between them is very reliable, and external availability is likely to be the same.

Unless there's some other reason, I also agree this is very low priority.

@tucotuco
Copy link
Member

tucotuco commented Aug 8, 2021

Keeping at low priority and removing from current milestone in the interest of releasing the new terms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants