New term

tucotuco · 2014-11-13T15:47:42Z

This proposal has had extensive commentary and has been updated by @timrobertson100 to accommodate all comments up to Dec 8th 2022. Previous versions of this proposal may be viewed by clicking the "edited" link above, and were the subject of the earlier comments below

Submitter: Tommy McElrath @tmcelrath, Debbie Paul @debpaul, Tim Robertson @timrobertson100, Christian Bölling @cboelling
Efficacy Justification (why is this term necessary?): To provide a digital representation derived from and as close as possible in content to what is on the original label(s), in order to provide quality control and comparison to any and all parsed data from a label. Other use cases are outlined here: https://doi.org/10.1093/database/baz129
Demand Justification (name at least two organizations that independently need this term): Survey of digitizing collections conducted by @tmcelrath (see comments below), DataShot (MCZ), TaxonWorks, GBIF
Stability Justification (what concerns are there that this might affect existing implementations?): New term, does not adversely affect any existing terms or implementations.
Implications for dwciri: namespace (does this change affect a dwciri term version)?: As a "verbatim" term, dwc:verbatimLabel is not expected to have a dwciri: analog, so there are no implications in that namespace.

Proposed attributes of the new term:

Term name (in lowerCamelCase for properties, UpperCamelCase for classes): verbatimLabel
Organized in Class (e.g., Occurrence, Event, Location, Taxon): MaterialSample
Definition of the term (normative): A serialized encoding intended to represent the literal, i.e., character by character, textual content of a label affixed on, near, or explicitly associated with a material entity, free from interpretation, translation, or transliteration.
Usage comments (recommendations regarding content, etc., not normative): The content of this term should include no embellishments, prefixes, headers or other additions made to the text. Abbreviations must not be expanded and supposed misspellings must not be corrected. Lines or breakpoints between blocks of text that could be verified by seeing the original labels or images of them may be used. Examples of material entities include preserved specimens, fossil specimens, and material samples. Best practice is to use UTF-8 for all characters. Best practice is to add comment “verbatimLabel derived from human transcription” in occurrenceRemarks.
Examples (not normative):
1. For a label affixed to a pinned insect specimen, the verbatimLabel would contain:
  
  ILL: Union Co.
  Wolf Lake by Powder Plant
  Bridge. 1 March 1975
  Coll. S. Ketzler, S. Herbert
  
  Monotoma
  longicollis 4 ♂
  Det TC McElrath 2018
  
  INHS
  Insect Collection
  456782
  
  With comment "verbatimLabel derived from human transcription" added in occurrenceRemarks.
2. When using Optical Character Recognition (OCR) techniques against an herbarium sheet, the verbatimLabel would contain:
  
  0 1 2 3 4 5 6 7 8 9 10
  cm copyright reserved
  The New York
  Botanical Garden
  
  NEW YORK
  BOTANICAL
  GARDEN
  
  NEW YORK BOTANICAL GARDEN
  ACADEMY OF NATURAL SCIENCES OF PHILADELPHIA
  EXPLORATION OF BERMUDA
  NO. 355
  Cymbalaria Cymbalaria (L.) Wettst
  Roadside wall, The Crawl.
  STEWARDSON BROWN
  }COLLECTORS AUG. 31-SEPT. 20, 1905
  N.L. BRITTON
  
  NEW YORK BOTANICAL GARDEN
  00499439
  
  With comment “verbatimLabel derived from unadulterated OCR output” added in occurrenceRemarks.
Refines (identifier of the broader term this term refines; normative): None
Replaces (identifier of the existing term that would be deprecated and replaced by this term; normative): None. Does not replace any current DWC “verbatim” terms. Other “verbatim” terms have already been “parsed” to a certain data class and have their own uses
ABCD 2.06 (XPATH of the equivalent term in ABCD or EFG; not normative): /Marks/Mark/MarkText

The text was updated successfully, but these errors were encountered:

tucotuco · 2020-09-09T19:29:48Z

This proposal still needs evidence of demand.

My question is, "Is it not sufficient/preferable to capture the label images? That is one level less of interpretation already."

tmcelrath · 2020-09-23T15:39:20Z

We use this field in the TaxonWorks. We split it into three fields "Buffered Determination Label", "Buffered Collecting Event Label" and "Buffered Other Labels". Just having an image is not enough, or sometimes we do not have an image.

Basically, I, and many other collections using TaxonWorks, want this DWC field.

matdillen · 2020-09-23T16:10:55Z

Does this encompass both "gold standard" verbatim transcriptions of specimen labels and outputs of automated OCR processes (e.g. Tesseract)? How to encode the different approaches and their metadata (methodology)?

How to differentiate between labels and their relative location? I don't think $ and are reliable enough, in particular if OCR outputs are in scope.

chicoreus · 2020-09-23T16:50:00Z

Wes use a field for verbatim transcription of a label in the DataShot object to image to data workflow software. This captures the verbatim transcription of text from a region of interest representing a single label identified in an image of a set of labels. Subsequent workflow steps add interpretation of this verbatim text into structured data. In a less formal manner, there is a twitter feed https://twitter.com/EntoTranslator and a facebook group https://www.facebook.com/groups/232785306782255/ where images of difficult to interpret labels are posted for members of the community to either provide transcriptions from difficult to read handwriting or interpretations of words, phrases, abbreviations, and such on the labels. There are clear upstream needs in digitization workflows for representing verbatim label text in structured form.

tucotuco · 2021-04-19T02:48:44Z

Closing for lack of demand.

tmcelrath · 2021-04-19T13:57:50Z

"Lack of demand?" Four different people have requested this be a DWC field and expected something to happen. I don't see lack of demand here. What do we need to provide to evidence "demand?"

tucotuco · 2021-04-19T16:22:06Z

TDWG members discussing a good idea does not constitute demand. The demand requirement needs independent organizations with a mission-driven need to share these data.

…

On Mon, Apr 19, 2021 at 10:58 AM Tommy McElrath ***@***.***> wrote: "Lack of demand?" Four different people have requested this be a DWC field and expected something to happen. I don't see lack of demand here. What do we need to provide to evidence "demand?" — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#32 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADQ723ZZ7E44AXMGQFHUFTTJQZG7ANCNFSM4AXK2UUA> .

tmcelrath · 2021-04-19T16:44:30Z

@tucotuco What specifically, do you want us to provide then? would a survey of different natural history collections members with documented support of their need of this field suffice?

tucotuco · 2021-04-19T17:06:44Z

@tmcelrath TaxonWorks suffices to represent that class of proponent. That is the equivalent of one proponent. What other organization or project needs it? If you can come up with that, the next step is to submit a templated New term request. I can do that, adding it to the beginning of the first comment to keep all the discussion in one place, but I need that evidence of demand.

chicoreus · 2021-04-19T17:20:26Z

As noted above, We've got a field for this in the DataShot system at the MCZ associated with a region of interest in an image that contains multiple lables, but haven't been able to go very far with this in the absence of a means of sharing with the community.

edwbaker · 2021-04-20T09:21:37Z

This initially seems like a straightforward enough proposal, but how does it interplay with the existing (and numerous) verbatim fields within DarwinCore? It seems to risk becoming a dumping ground for data that could/should go into existing fields, and perhaps discouraging their use because it's easier to just put it all, unstructured, into verbatimLabel.

I think my main reservation is the following: are there many examples where the existing verbatim fields are inadequate, and could these be better covered by additional verbatim field(s) rather than such a loosely defined single field?

tmcelrath · 2021-04-20T14:09:28Z

@edwbaker The issue is actually slightly different. "Parsing" text into many verbatim fields automatically introduces interpretation by its very nature. For example: What is a "verbatimLocality"? Should all locality info go in it? Or just the most specific locality? We've had differences of opinion just within our own group on just this one field.

To answer your question, DWC absolutely does not have enough verbatim fields. There are no verbatim identification fields, or verbatim curation labels fields (e.g. accession numbers, comments about preparation, etc ...). We use the ones that DWC has in addition to the verbatim one we are providing. Users do not have to use these fields, and yes, it introduces duplication of text, but that actually adds more power in terms of text-breakdown. We will never stop misreading labels and having poor quality control, but having this field allows for comparisons to the original verbatim label and will allow for corrections to be made.

The idea of this field is in part, quality control. I have found having this field INVALUABLE more times than I can count when looking back at the original text, comparing incorrect GPS coords, poorly interpreted localities, or people misreading labels.

tmcelrath · 2021-04-20T14:10:41Z

To anyone following this thread, I have a poll out right now: https://forms.gle/fgxbQUmQLQC4a1NY6 collecting people's thoughts about this proposed DWC field. Please help me gather responses there. I am looking to get as many diverse stakeholders as possible.

tucotuco · 2021-04-20T16:50:04Z

Reopened to accommodate renewed vigor in the proposal.

albenson-usgs · 2021-04-20T17:06:20Z

What I'm wondering about this proposal is if we are conflating data management with implementing a standard. In my work for OBIS-USA I rarely receive data already in Darwin Core and I have to do a crosswalk. When I do that work there is always a chance that I performed that work incorrectly in some way and so I do my best to preserve the original data in a data repository and a link to that in the IPT so that future users of the data can get back to the original data to check the translation if they need to. For me it would not make sense to have all of that information stored in verbatim fields. When and how is the best place to separate out the standardization of the data from management of the data? Apologies if my comment doesn't make sense in this context since this is primarily considering museum collection data and I'm thinking of sampling event data.

edwbaker · 2021-04-20T17:43:01Z

@albenson-usgs I think the only way of going back to the original data here is to include a label image. Having a label field is one potential source of error, then any further processing from that is another potential source of error.

There are a number of potential solutions to "the verbatim problem" in this thread (using either SKOS or a separate dwc namespace).

tmcelrath · 2021-04-20T18:03:05Z

So far in poll, all respondents want to see this term implemented in some form:

tmcelrath · 2021-04-20T18:03:53Z

Respondents are from a variety of different Collection Management Systems/databases:

tmcelrath · 2021-04-20T18:10:30Z

About half of respondents already use this field in their CMS:

matdillen · 2021-04-20T18:13:33Z

There are various different use cases for verbatim data. We described quite a few of them in a paper we wrote a while ago, more specifically in this table..

Darwin Core terms currently hardly support these use cases, with many verbatim concepts unaccounted for and no unambiguous term for the uninterpreted text dump as Tommy described.

While the content of this term will be messy and not very practical for machine training purposes, which seems like it could be a nice use case, it would support improved findability, validation efforts and linguistic aspects.

edwbaker · 2021-04-20T18:16:29Z

The issue I see with adding verbatimLabel or an equivalent (in name it doesn't cover other data sources, such as occurrences from a notebook) is that if we have that, why do we need all the verbatim fields in dwc? The current process seems to be we put the label data in verbatimX and cleaner data in X. If we follow this precedent, then we should look at what verbatim label data is missed at present, and how we address that (two possible solutions in my above comment). If we don't follow this precedent then (in my mind) we have a much larger discussion.

I think the point raised above by @albenson-usgs between data management (which I take in this instance to broadly be within an institution) and data standards (broadly between institutions) is highly relevant. From what I can see (glancing over dwc) this would be the first break from relatively atomic data to a definition that might include multiple data types. This alone I think is worthy of some serious discussion.

I wonder if a better solution to this might actually be within AudubonCore as a term like 'transcription of data' which would cover not only the textual transcription of a photograph of a label, but also the equivalent spoken data in audio recordings of species, etc. In this way we could potentially cover occurrence as well as specimen data using the same methodology - each time having a resource (label image, sound recording, etc) to verify against.

edwbaker · 2021-04-21T01:31:15Z

Having had a more thorough search it looks like GBIF have already minted a verbatimLabel term, and that it is used in the DwC-A format already by Plazi - http://plazi.org/api-tools/api/.

tucotuco · 2021-04-21T03:04:34Z

Given that GBIF has minted a term, are there any stability issues with Darwin Core making one? Does the term have a definition? If so, is it semantically the same as proposed here?

…

On Tue, Apr 20, 2021 at 10:31 PM Ed Baker ***@***.***> wrote: Having had a more thorough search it looks like GBIF have already minted a verbatimLabel term, and that it is used in the DwC-A format already by Plazi - http://plazi.org/api-tools/api/. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#32 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADQ72Z4TJHFLADWLW5ZVBDTJYTHJANCNFSM4AXK2UUA> .

debpaul · 2022-12-09T20:36:21Z

And (please excuse as tangent, but quite relevant), here's one major reason why we want verbatimLabel. Just now using the chat.openai.com I put the above NYBG OCR output and asked the chat to find certain elements for me.
Can you see my happy dance?

debpaul · 2022-12-09T20:49:24Z

And because I couldn't resist a more complex query, see the entomology entry from above run through chat.openai.com

debpaul · 2022-12-09T20:51:12Z

And then @dimus asked me if I could get it to output in JSON or XML just for the asking and I got this!!!
{
"state": "ILL",
"county": "Union Co.",
"locality": "Wolf Lake by Powder Plant Bridge",
"collecting_date": "1 March 1975",
"collectors": ["S. Ketzler", "S. Herbert"],
"taxon_name": "Monotoma longicollis",
"specimen_count": 4,
"sex": "♂",
"determined_by": "TC McElrath",
"determined_date": 2018,
"institution": "INHS",
"collection_name": "Insect Collection",
"barcode": 456782
}
From this

debpaul · 2022-12-09T21:02:20Z

One more, because, well, Darwin Core. So I modified the query to ask for Darwin Core mapped output. And Voila!

in text

{
"stateProvince": "ILL",
"county": "Union Co.",
"locality": "Wolf Lake by Powder Plant Bridge",
"eventDate": "1 March 1975",
"recordedBy": ["S. Ketzler", "S. Herbert"],
"scientificName": "Monotoma longicollis",
"individualCount": 4,
"sex": "♂",
"identifiedBy": "TC McElrath",
"dateIdentified": 2018,
"institutionCode": "INHS",
"collectionCode": "Insect Collection",
"catalogNumber": 456782
}

Jegelewicz · 2023-02-20T18:50:21Z

Could this also be applied to a MachineObservation when it is supported by a physical photograph?

debpaul · 2023-02-28T15:11:57Z

Could this also be applied to a MachineObservation when it is supported by a physical photograph?

@Jegelewicz I'm not sure I understand your use case. What would the text be in the verbatimLabel in your use case?

Also note. (See above comment, somewhere in the long thread). In the distant future, one can imagine something to do with the AC Extension where any media with associated verbatim text can both be shared. For now, having dwc:verbatimLabel moves us closer to getting a lot more out of our skeletal records when it comes to searching in the aggregate.

Jegelewicz · 2023-02-28T16:24:06Z

What would the text be in the verbatimLabel in your use case

Photos in an archive generally have information either written on the back or on an associated label.

timrobertson100 · 2023-02-28T16:31:25Z

My reading of the definition would make this an ideal field to capture that @Jegelewicz

debpaul · 2023-02-28T17:07:24Z

My reading of the definition would make this an ideal field to capture that @Jegelewicz

@timrobertson100 so, for any media shared (a la using AC extension), associated text from that media could go in dwc:verbatimLabel?

tmcelrath · 2023-02-28T17:32:15Z

@timrobertson100 & @Jegelewicz as this is written "textual content of a label affixed on, near, or explicitly associated with a material entity" that can easily be incorporated. E.g. a machine observation (e.g. OCR) from a photo explicitly associated with a specimen.

Jegelewicz · 2023-02-28T18:01:34Z

Yes - I think the change to MaterialEntity from the other set of terms fixes my concern that MachineObservation was left out.

tucotuco · 2023-04-30T17:22:14Z

Is there anyone who would be willing to prepare a markdown document that describes, at a minimum, the recommendations for the usage comments for this term? The idea would be to point to this new document with a link in the usage comments for verbatimLabel, which can't support the complexity required for the existing commentary. We are still in the process of figuring out the best way to do this (see issue #444, but will need the content regardless before we can move on to ratification.

debpaul · 2023-05-23T21:39:57Z

Is there anyone who would be willing to prepare a markdown document that describes, at a minimum, the recommendations for the usage comments for this term? The idea would be to point to this new document with a link in the usage comments for verbatimLabel, which can't support the complexity required for the existing commentary. We are still in the process of figuring out the best way to do this (see issue #444, but will need the content regardless before we can move on to ratification.

@tucotuco I've never done such before (by myself anyway). I could ask @timrobertson100 and @tmcelrath and see if we can manage it. What's our deadline? Many of us headed to SPNHC.

debpaul · 2023-05-23T21:58:52Z

Also regarding usage comments, what then needs to be added / edited / removed from this original version then?

Usage comments (recommendations regarding content, etc., not normative): The content of this term should include no embellishments, prefixes, headers or other additions made to the text. Abbreviations must not be expanded and supposed misspellings must not be corrected. Lines or breakpoints between blocks of text that could be verified by seeing the original labels or images of them may be used. Examples of material entities include preserved specimens, fossil specimens, and material samples. Best practice is to use UTF-8 for all characters. Best practice is to add comment “verbatimLabel derived from human transcription” in occurrenceRemarks.

Wouldn't this be
verbatimLabel derived from human transcription and / or optical character recognition (OCR), in occurrenceRemarks?

timrobertson100 · 2023-05-24T13:00:37Z

@tucotuco, @debpaul how about this as a start, which is a verbatim extract from the opening comment?

The raw markdown which you can see in this preview.

(being non-normative, we can evolve these examples at any time without further ratification)

tucotuco · 2023-05-24T16:18:11Z

@timrobertson100 This is the simple approach I was hoping for. I would include the document in dwc repository (sensible location to be determined) and flag the entire document as non-normative paralleling what it would be if its content were captured directly in comments or examples.

Examples and template for #32 and #444

timrobertson100 · 2023-06-15T12:27:11Z

The examples now exist on https://dwc.tdwg.org/examples/verbatimLabel
If anyone on this thread would like their name added to that, please let me know on trobertson@gbif.org

tucotuco · 2023-07-10T15:56:18Z

Huzzah!

tucotuco added Term - add Class - MaterialSample labels Nov 13, 2014

tucotuco added the Process - need evidence for demand label Sep 9, 2020

tucotuco closed this as completed Apr 19, 2021

tucotuco added the Process - dismissed label Apr 19, 2021

tucotuco reopened this Apr 20, 2021

tucotuco removed the Process - dismissed label Apr 20, 2021

tucotuco added Process - prepare for Executive review and removed Process - ready for public comment labels Mar 28, 2023

tucotuco mentioned this issue Apr 29, 2023

Template for non-normative usage documents #444

Closed

Jegelewicz mentioned this issue Jun 8, 2023

Change term - verbatimLabel #458

Closed

timrobertson100 added a commit to timrobertson100/dwc that referenced this issue Jun 14, 2023

Examples and template for tdwg#32 and tdwg#444

3dd9bac

timrobertson100 added a commit to timrobertson100/dwc that referenced this issue Jun 14, 2023

Examples and template for tdwg#32 and tdwg#444

a917caa

tucotuco pushed a commit that referenced this issue Jun 14, 2023

Merge pull request #481 from timrobertson100/examples

7341372

Examples and template for #32 and #444

tucotuco added Process - in Executive review and removed Process - prepare for Executive review labels Jun 18, 2023

tucotuco added Process - complete and removed Process - in Executive review labels Jul 10, 2023

tucotuco closed this as completed Jul 10, 2023

New term - verbatimLabel #32

New term - verbatimLabel #32

Comments

tucotuco commented Nov 13, 2014 • edited by timrobertson100 Loading

New term

tucotuco commented Sep 9, 2020

tmcelrath commented Sep 23, 2020 • edited Loading

matdillen commented Sep 23, 2020

chicoreus commented Sep 23, 2020

tucotuco commented Apr 19, 2021

tmcelrath commented Apr 19, 2021

tucotuco commented Apr 19, 2021 via email

tmcelrath commented Apr 19, 2021

tucotuco commented Apr 19, 2021

chicoreus commented Apr 19, 2021

edwbaker commented Apr 20, 2021

tmcelrath commented Apr 20, 2021

tmcelrath commented Apr 20, 2021

tucotuco commented Apr 20, 2021

albenson-usgs commented Apr 20, 2021

edwbaker commented Apr 20, 2021

tmcelrath commented Apr 20, 2021

tmcelrath commented Apr 20, 2021

tmcelrath commented Apr 20, 2021

matdillen commented Apr 20, 2021

edwbaker commented Apr 20, 2021

edwbaker commented Apr 21, 2021

tucotuco commented Apr 21, 2021 via email

debpaul commented Dec 9, 2022

debpaul commented Dec 9, 2022

debpaul commented Dec 9, 2022

debpaul commented Dec 9, 2022

Jegelewicz commented Feb 20, 2023

debpaul commented Feb 28, 2023

Jegelewicz commented Feb 28, 2023 • edited Loading

timrobertson100 commented Feb 28, 2023 • edited Loading

debpaul commented Feb 28, 2023

tmcelrath commented Feb 28, 2023

Jegelewicz commented Feb 28, 2023

tucotuco commented Apr 30, 2023

debpaul commented May 23, 2023

debpaul commented May 23, 2023

timrobertson100 commented May 24, 2023

tucotuco commented May 24, 2023

timrobertson100 commented Jun 15, 2023 • edited Loading

tucotuco commented Jul 10, 2023

tucotuco commented Nov 13, 2014 •

edited by timrobertson100

Loading

tmcelrath commented Sep 23, 2020 •

edited

Loading

Jegelewicz commented Feb 28, 2023 •

edited

Loading

timrobertson100 commented Feb 28, 2023 •

edited

Loading

timrobertson100 commented Jun 15, 2023 •

edited

Loading