Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New term - verbatimLabel #32

Closed
tucotuco opened this issue Nov 13, 2014 · 123 comments
Closed

New term - verbatimLabel #32

tucotuco opened this issue Nov 13, 2014 · 123 comments

Comments

@tucotuco
Copy link
Member

tucotuco commented Nov 13, 2014

This proposal has had extensive commentary and has been updated by @timrobertson100 to accommodate all comments up to Dec 8th 2022. Previous versions of this proposal may be viewed by clicking the "edited" link above, and were the subject of the earlier comments below

New term

  • Submitter: Tommy McElrath @tmcelrath, Debbie Paul @debpaul, Tim Robertson @timrobertson100, Christian Bölling @cboelling
  • Efficacy Justification (why is this term necessary?): To provide a digital representation derived from and as close as possible in content to what is on the original label(s), in order to provide quality control and comparison to any and all parsed data from a label. Other use cases are outlined here: https://doi.org/10.1093/database/baz129
  • Demand Justification (name at least two organizations that independently need this term): Survey of digitizing collections conducted by @tmcelrath (see comments below), DataShot (MCZ), TaxonWorks, GBIF
  • Stability Justification (what concerns are there that this might affect existing implementations?): New term, does not adversely affect any existing terms or implementations.
  • Implications for dwciri: namespace (does this change affect a dwciri term version)?: As a "verbatim" term, dwc:verbatimLabel is not expected to have a dwciri: analog, so there are no implications in that namespace.

Proposed attributes of the new term:

  • Term name (in lowerCamelCase for properties, UpperCamelCase for classes): verbatimLabel

  • Organized in Class (e.g., Occurrence, Event, Location, Taxon): MaterialSample

  • Definition of the term (normative): A serialized encoding intended to represent the literal, i.e., character by character, textual content of a label affixed on, near, or explicitly associated with a material entity, free from interpretation, translation, or transliteration.

  • Usage comments (recommendations regarding content, etc., not normative): The content of this term should include no embellishments, prefixes, headers or other additions made to the text. Abbreviations must not be expanded and supposed misspellings must not be corrected. Lines or breakpoints between blocks of text that could be verified by seeing the original labels or images of them may be used. Examples of material entities include preserved specimens, fossil specimens, and material samples. Best practice is to use UTF-8 for all characters. Best practice is to add comment “verbatimLabel derived from human transcription” in occurrenceRemarks.

  • Examples (not normative):

    1. For a label affixed to a pinned insect specimen, the verbatimLabel would contain:

      ILL: Union Co.
      Wolf Lake by Powder Plant
      Bridge. 1 March 1975
      Coll. S. Ketzler, S. Herbert

      Monotoma
      longicollis 4 ♂
      Det TC McElrath 2018

      INHS
      Insect Collection
      456782

      With comment "verbatimLabel derived from human transcription" added in occurrenceRemarks.

    2. When using Optical Character Recognition (OCR) techniques against an herbarium sheet, the verbatimLabel would contain:

      0 1 2 3 4 5 6 7 8 9 10
      cm copyright reserved
      The New York
      Botanical Garden

      NEW YORK
      BOTANICAL
      GARDEN

      NEW YORK BOTANICAL GARDEN
      ACADEMY OF NATURAL SCIENCES OF PHILADELPHIA
      EXPLORATION OF BERMUDA
      NO. 355
      Cymbalaria Cymbalaria (L.) Wettst
      Roadside wall, The Crawl.
      STEWARDSON BROWN
      }COLLECTORS AUG. 31-SEPT. 20, 1905
      N.L. BRITTON

      NEW YORK BOTANICAL GARDEN
      00499439

      With comment “verbatimLabel derived from unadulterated OCR output” added in occurrenceRemarks.

  • Refines (identifier of the broader term this term refines; normative): None

  • Replaces (identifier of the existing term that would be deprecated and replaced by this term; normative): None. Does not replace any current DWC “verbatim” terms. Other “verbatim” terms have already been “parsed” to a certain data class and have their own uses

  • ABCD 2.06 (XPATH of the equivalent term in ABCD or EFG; not normative): /Marks/Mark/MarkText

@tucotuco
Copy link
Member Author

tucotuco commented Sep 9, 2020

This proposal still needs evidence of demand.

My question is, "Is it not sufficient/preferable to capture the label images? That is one level less of interpretation already."

@tmcelrath
Copy link

tmcelrath commented Sep 23, 2020

We use this field in the TaxonWorks. We split it into three fields "Buffered Determination Label", "Buffered Collecting Event Label" and "Buffered Other Labels". Just having an image is not enough, or sometimes we do not have an image.

Basically, I, and many other collections using TaxonWorks, want this DWC field.

@matdillen
Copy link

Does this encompass both "gold standard" verbatim transcriptions of specimen labels and outputs of automated OCR processes (e.g. Tesseract)? How to encode the different approaches and their metadata (methodology)?

How to differentiate between labels and their relative location? I don't think $ and are reliable enough, in particular if OCR outputs are in scope.

@chicoreus
Copy link

Wes use a field for verbatim transcription of a label in the DataShot object to image to data workflow software. This captures the verbatim transcription of text from a region of interest representing a single label identified in an image of a set of labels. Subsequent workflow steps add interpretation of this verbatim text into structured data. In a less formal manner, there is a twitter feed https://twitter.com/EntoTranslator and a facebook group https://www.facebook.com/groups/232785306782255/ where images of difficult to interpret labels are posted for members of the community to either provide transcriptions from difficult to read handwriting or interpretations of words, phrases, abbreviations, and such on the labels. There are clear upstream needs in digitization workflows for representing verbatim label text in structured form.

@tucotuco
Copy link
Member Author

Closing for lack of demand.

@tmcelrath
Copy link

"Lack of demand?" Four different people have requested this be a DWC field and expected something to happen. I don't see lack of demand here. What do we need to provide to evidence "demand?"

@tucotuco
Copy link
Member Author

tucotuco commented Apr 19, 2021 via email

@tmcelrath
Copy link

@tucotuco What specifically, do you want us to provide then? would a survey of different natural history collections members with documented support of their need of this field suffice?

@tucotuco
Copy link
Member Author

@tmcelrath TaxonWorks suffices to represent that class of proponent. That is the equivalent of one proponent. What other organization or project needs it? If you can come up with that, the next step is to submit a templated New term request. I can do that, adding it to the beginning of the first comment to keep all the discussion in one place, but I need that evidence of demand.

@chicoreus
Copy link

As noted above, We've got a field for this in the DataShot system at the MCZ associated with a region of interest in an image that contains multiple lables, but haven't been able to go very far with this in the absence of a means of sharing with the community.

@edwbaker
Copy link
Member

This initially seems like a straightforward enough proposal, but how does it interplay with the existing (and numerous) verbatim fields within DarwinCore? It seems to risk becoming a dumping ground for data that could/should go into existing fields, and perhaps discouraging their use because it's easier to just put it all, unstructured, into verbatimLabel.

I think my main reservation is the following: are there many examples where the existing verbatim fields are inadequate, and could these be better covered by additional verbatim field(s) rather than such a loosely defined single field?

@tmcelrath
Copy link

@edwbaker The issue is actually slightly different. "Parsing" text into many verbatim fields automatically introduces interpretation by its very nature. For example: What is a "verbatimLocality"? Should all locality info go in it? Or just the most specific locality? We've had differences of opinion just within our own group on just this one field.

To answer your question, DWC absolutely does not have enough verbatim fields. There are no verbatim identification fields, or verbatim curation labels fields (e.g. accession numbers, comments about preparation, etc ...). We use the ones that DWC has in addition to the verbatim one we are providing. Users do not have to use these fields, and yes, it introduces duplication of text, but that actually adds more power in terms of text-breakdown. We will never stop misreading labels and having poor quality control, but having this field allows for comparisons to the original verbatim label and will allow for corrections to be made.

The idea of this field is in part, quality control. I have found having this field INVALUABLE more times than I can count when looking back at the original text, comparing incorrect GPS coords, poorly interpreted localities, or people misreading labels.

@tmcelrath
Copy link

To anyone following this thread, I have a poll out right now: https://forms.gle/fgxbQUmQLQC4a1NY6 collecting people's thoughts about this proposed DWC field. Please help me gather responses there. I am looking to get as many diverse stakeholders as possible.

@tucotuco tucotuco reopened this Apr 20, 2021
@tucotuco
Copy link
Member Author

Reopened to accommodate renewed vigor in the proposal.

@albenson-usgs
Copy link

What I'm wondering about this proposal is if we are conflating data management with implementing a standard. In my work for OBIS-USA I rarely receive data already in Darwin Core and I have to do a crosswalk. When I do that work there is always a chance that I performed that work incorrectly in some way and so I do my best to preserve the original data in a data repository and a link to that in the IPT so that future users of the data can get back to the original data to check the translation if they need to. For me it would not make sense to have all of that information stored in verbatim fields. When and how is the best place to separate out the standardization of the data from management of the data? Apologies if my comment doesn't make sense in this context since this is primarily considering museum collection data and I'm thinking of sampling event data.

@edwbaker
Copy link
Member

@albenson-usgs I think the only way of going back to the original data here is to include a label image. Having a label field is one potential source of error, then any further processing from that is another potential source of error.

There are a number of potential solutions to "the verbatim problem" in this thread (using either SKOS or a separate dwc namespace).

@tmcelrath
Copy link

So far in poll, all respondents want to see this term implemented in some form:
image

@tmcelrath
Copy link

Respondents are from a variety of different Collection Management Systems/databases:
image

@tmcelrath
Copy link

About half of respondents already use this field in their CMS:
image

@matdillen
Copy link

There are various different use cases for verbatim data. We described quite a few of them in a paper we wrote a while ago, more specifically in this table..

Darwin Core terms currently hardly support these use cases, with many verbatim concepts unaccounted for and no unambiguous term for the uninterpreted text dump as Tommy described.

While the content of this term will be messy and not very practical for machine training purposes, which seems like it could be a nice use case, it would support improved findability, validation efforts and linguistic aspects.

@edwbaker
Copy link
Member

The issue I see with adding verbatimLabel or an equivalent (in name it doesn't cover other data sources, such as occurrences from a notebook) is that if we have that, why do we need all the verbatim fields in dwc? The current process seems to be we put the label data in verbatimX and cleaner data in X. If we follow this precedent, then we should look at what verbatim label data is missed at present, and how we address that (two possible solutions in my above comment). If we don't follow this precedent then (in my mind) we have a much larger discussion.

I think the point raised above by @albenson-usgs between data management (which I take in this instance to broadly be within an institution) and data standards (broadly between institutions) is highly relevant. From what I can see (glancing over dwc) this would be the first break from relatively atomic data to a definition that might include multiple data types. This alone I think is worthy of some serious discussion.

I wonder if a better solution to this might actually be within AudubonCore as a term like 'transcription of data' which would cover not only the textual transcription of a photograph of a label, but also the equivalent spoken data in audio recordings of species, etc. In this way we could potentially cover occurrence as well as specimen data using the same methodology - each time having a resource (label image, sound recording, etc) to verify against.

@edwbaker
Copy link
Member

Having had a more thorough search it looks like GBIF have already minted a verbatimLabel term, and that it is used in the DwC-A format already by Plazi - http://plazi.org/api-tools/api/.

@tucotuco
Copy link
Member Author

tucotuco commented Apr 21, 2021 via email

@debpaul
Copy link

debpaul commented Dec 9, 2022

And (please excuse as tangent, but quite relevant), here's one major reason why we want verbatimLabel. Just now using the chat.openai.com I put the above NYBG OCR output and asked the chat to find certain elements for me.
image Can you see my happy dance?

@debpaul
Copy link

debpaul commented Dec 9, 2022

And because I couldn't resist a more complex query, see the entomology entry from above run through chat.openai.com
image

@debpaul
Copy link

debpaul commented Dec 9, 2022

And then @dimus asked me if I could get it to output in JSON or XML just for the asking and I got this!!!
{
"state": "ILL",
"county": "Union Co.",
"locality": "Wolf Lake by Powder Plant Bridge",
"collecting_date": "1 March 1975",
"collectors": ["S. Ketzler", "S. Herbert"],
"taxon_name": "Monotoma longicollis",
"specimen_count": 4,
"sex": "♂",
"determined_by": "TC McElrath",
"determined_date": 2018,
"institution": "INHS",
"collection_name": "Insect Collection",
"barcode": 456782
}
From this
image

@debpaul
Copy link

debpaul commented Dec 9, 2022

One more, because, well, Darwin Core. So I modified the query to ask for Darwin Core mapped output. And Voila!
image

in text

{
"stateProvince": "ILL",
"county": "Union Co.",
"locality": "Wolf Lake by Powder Plant Bridge",
"eventDate": "1 March 1975",
"recordedBy": ["S. Ketzler", "S. Herbert"],
"scientificName": "Monotoma longicollis",
"individualCount": 4,
"sex": "♂",
"identifiedBy": "TC McElrath",
"dateIdentified": 2018,
"institutionCode": "INHS",
"collectionCode": "Insect Collection",
"catalogNumber": 456782
}

@Jegelewicz
Copy link

Could this also be applied to a MachineObservation when it is supported by a physical photograph?

@debpaul
Copy link

debpaul commented Feb 28, 2023

Could this also be applied to a MachineObservation when it is supported by a physical photograph?

@Jegelewicz I'm not sure I understand your use case. What would the text be in the verbatimLabel in your use case?

Also note. (See above comment, somewhere in the long thread). In the distant future, one can imagine something to do with the AC Extension where any media with associated verbatim text can both be shared. For now, having dwc:verbatimLabel moves us closer to getting a lot more out of our skeletal records when it comes to searching in the aggregate.

@Jegelewicz
Copy link

Jegelewicz commented Feb 28, 2023

What would the text be in the verbatimLabel in your use case

Photos in an archive generally have information either written on the back or on an associated label.

@timrobertson100
Copy link
Member

timrobertson100 commented Feb 28, 2023

My reading of the definition would make this an ideal field to capture that @Jegelewicz

@debpaul
Copy link

debpaul commented Feb 28, 2023

My reading of the definition would make this an ideal field to capture that @Jegelewicz

@timrobertson100 so, for any media shared (a la using AC extension), associated text from that media could go in dwc:verbatimLabel?

@tmcelrath
Copy link

@timrobertson100 & @Jegelewicz as this is written "textual content of a label affixed on, near, or explicitly associated with a material entity" that can easily be incorporated. E.g. a machine observation (e.g. OCR) from a photo explicitly associated with a specimen.

@Jegelewicz
Copy link

Yes - I think the change to MaterialEntity from the other set of terms fixes my concern that MachineObservation was left out.

@tucotuco
Copy link
Member Author

Is there anyone who would be willing to prepare a markdown document that describes, at a minimum, the recommendations for the usage comments for this term? The idea would be to point to this new document with a link in the usage comments for verbatimLabel, which can't support the complexity required for the existing commentary. We are still in the process of figuring out the best way to do this (see issue #444, but will need the content regardless before we can move on to ratification.

@debpaul
Copy link

debpaul commented May 23, 2023

Is there anyone who would be willing to prepare a markdown document that describes, at a minimum, the recommendations for the usage comments for this term? The idea would be to point to this new document with a link in the usage comments for verbatimLabel, which can't support the complexity required for the existing commentary. We are still in the process of figuring out the best way to do this (see issue #444, but will need the content regardless before we can move on to ratification.

@tucotuco I've never done such before (by myself anyway). I could ask @timrobertson100 and @tmcelrath and see if we can manage it. What's our deadline? Many of us headed to SPNHC.

@debpaul
Copy link

debpaul commented May 23, 2023

Also regarding usage comments, what then needs to be added / edited / removed from this original version then?

Usage comments (recommendations regarding content, etc., not normative): The content of this term should include no embellishments, prefixes, headers or other additions made to the text. Abbreviations must not be expanded and supposed misspellings must not be corrected. Lines or breakpoints between blocks of text that could be verified by seeing the original labels or images of them may be used. Examples of material entities include preserved specimens, fossil specimens, and material samples. Best practice is to use UTF-8 for all characters. Best practice is to add comment “verbatimLabel derived from human transcription” in occurrenceRemarks.

Wouldn't this be
verbatimLabel derived from human transcription and / or optical character recognition (OCR), in occurrenceRemarks?

@timrobertson100
Copy link
Member

@tucotuco, @debpaul how about this as a start, which is a verbatim extract from the opening comment?

The raw markdown which you can see in this preview.

(being non-normative, we can evolve these examples at any time without further ratification)

@tucotuco
Copy link
Member Author

@timrobertson100 This is the simple approach I was hoping for. I would include the document in dwc repository (sensible location to be determined) and flag the entire document as non-normative paralleling what it would be if its content were captured directly in comments or examples.

timrobertson100 added a commit to timrobertson100/dwc that referenced this issue Jun 14, 2023
timrobertson100 added a commit to timrobertson100/dwc that referenced this issue Jun 14, 2023
tucotuco added a commit that referenced this issue Jun 14, 2023
@timrobertson100
Copy link
Member

timrobertson100 commented Jun 15, 2023

The examples now exist on https://dwc.tdwg.org/examples/verbatimLabel
If anyone on this thread would like their name added to that, please let me know on trobertson@gbif.org

@tucotuco
Copy link
Member Author

Huzzah!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests