Abstract
The DLF/Aquifer Implementation Guidelines for Shareable MODS Records were created to facilitate the creation of rich, sharable metadata for use in aggregated digital humanities collections. While guideline creators recognize most data providers do not meet the criteria set forth in this document, this study attempts to quantify current levels of conformance to the base requirements set forth by DLF/Aquifer MODS guidelines. By analyzing collections for which MODS records are currently made available to OAI-PMH service providers, predictions can be made as to the nature and extent of future normalization processes required by service providers and the nature and extent of training and education required by data providers wishing to expose MODS records useful in a variety of contexts.
Documents
Supporting documents
- DLF/Aquifer Implementation Guidelines for Shareable MODS Records (.pdf - Adobe Portable Document - 564K)
- Metadata Object Description Schema (MODS)
Note: See final paper for full bibliography
Resulting documents
Project commentary
Goals
- Measure conformance to nine requirements in the DLF/Aquifer Implementation Guidelines for Shareable MODS Records
- Recommend conformance techniques:
- For service providers
- For data providers
- What are the costs of conformance?
Background
Open Archives Initiative - Protocol for Metadata Harvesting
- Low-barrier mechanism for repository interoperability
- Data providers expose metadata available for harvest
- Service providers harvest metadata for aggregation and creation of value-added services
- OAI-PMH records encapsulate descriptive metadata:
- Unqualified Dublin Core is required descriptive metadata in OAI-PMH record
- Other metadata formats allowable; Harvesters generally harvest DC
- Dublin Core may have limits on quality and shareability
Quality / Shareable Metadata
- Quality: "Low quality metadata can render a library almost unusable..." (Stvilia et al.)
- Helps users find, identify, select, and acquire entities (Bruce & Hillman)
-
Bruce & Hillman: Seven characteristics
- Completeness
- Provenance
- Accuracy
- Conformance to expectations
- Logical consistency and coherence
- Timeliness
- Accessibility
-
Gasser & Stvilia: Intrinsic, Relational & Reputational
- Intrinsic: Within the item
- Relational: Represents external conditions accurately
- Reputational: Believability of representation
-
Shreeves et al.:
- Relational Completeness: Description sufficiently supports goals of metadata
- Intrinsic Consistency: Consistent structure and content
-
Bruce & Hillman: Seven characteristics
- Shareable: "Interoperable" and "meaningful when combined with metadata from other sources." (DLF/NSDL)
- Quality metadata and:
- Consistent across collections
- Coherent outside of local context
- Use of standard vocabularies
- XML (machine interoperability)
- Dublin Core limitations reduce quality and shareability:
- All elements are optional
- All elements are repeatable
- Cannot declare controlled vocabularies
- Cannot define completeness or consistency outside of local context
DLF/Aquifer Project & MODS
DLF/Aquifer Project
"Promote effective use of distributed digital library content for teaching, learning, and research in the area of American culture and life."
DLF/Aquifer Implementation Guidelines for Shareable MODS Records
- "Rich, shareable metadata that is coherent and consistent, and, thus, useful to aggregators and end-users."
- Born-digital or digitized items; Cultural heritage or humanities-based materials; Specifically meant to be shared
- Based on MODS
- American Social History Online
Metadata Object Description Schema 3.2 (MODS)
- XML-based metadata schema; "Subset of MARC fields and uses language-based tags rather than numeric ones"; "Used for a variety of purposes, and particularly for library applications"
- More explicit and granular than Dublin Core
- <title> and <subTitle>
- "keyDate"
- Can declare controlled vocabularies
- DLF/Aquifer Implementation Guidelines for Shareable MODS Records state requirements and recommendations for encoding MODS
Goals of Study
- Guidelines state "the requirements and recommendations set forth here are not currently met by most current and potential Aquifer participants."
- Measure current record conformance to nine DLF/Aquifer MODS guideline requirements (and "required if applicable")
- Make recommendations for remediation:
- "Service providers will generally expect to do at least a minimal amount of processing of harvested metadata to achieve goals." (DLF/NDSL)
- "Enhances usefulness, while avoiding misuse or misunderstanding" (Foulonneau & Cole)
- Cleaning, normalization, augmentation done automatically; Done "safely" to maintain accuracy
- What processing can more accurately (and easily) be done by data providers
Methodology
Test set
- Harvested nine DLF/MODS Portal Data Contributors plus Digital Books from UIUC and the Open Content Alliance (343,529 records)
- Processed to populate a relational database with repository, record, and element tables
- Collections of less than 1,000 records -> all records extracted; collections of 1,000+ records -> processed enough records for 95% confidence level; Confidence interval of 4
| Repository Name (Base URL) | Records Harvested | Records Extracted |
|---|---|---|
| A Celebration of Women Writers (Base URL) | 304 | 304 |
| OCLC Research Publications (Base URL) | 852 | 852 |
| University of Tennessee Libraries, Digital Library Center (Base URL) | 859 | 859 |
| Southern Spaces (Base URL) | 62 | 62 |
| Digital Books from UIUC and the Open Content Alliance (Base URL) | 767 | 766 |
| University of Chicago Library Metadata Repository (Base URL) | 372 | 372 |
| Indiana University Digital Library Program (Base URL) | 14,425 | 721 |
| Deep Blue at the University of Michigan (Base URL) | 24,299 | 967 |
| Library of Congress Digitized Historical Collections (Base URL) | 292,000 | 747 |
| The University of Michigan, University Library, Digital Library Production Service Collections (Base URL) |
9,589 | 639 |
| 343,529 | 6,289 | |
| (DLF/MODS Portal Data Contributors) | ||
Tests
SQL queries to test full specification (including required subelements, attribute/value pairs, and any element content requirements) of the nine requirements (and required if applicable) specified by DLF/Aquifer MODS guidelines.
General result categories:
- Required element (and required subelements and attribute/value pairs) present and not empty
- Compile element/attribute values and subelements
- Group/count values and subelements
Futher heurisitic testing may give deeper insight into quality issues.
Summary of Results
| Required Element | Percentage of Records with Element |
|---|---|
| <titleInfo> | 99.89% |
| <typeOfResource> | 84.57% |
| <originInfo> | 30.56% |
| <language> | 5.82% |
| <physicalDescription> | 27.51% |
| <subject> | 83.54% |
| <location> | 17.97% |
| <accessCondition> | 37.89% |
| <recordInfo> | 18.56% |
| (All required subelements, attribute/value pairs, and valid content must be present to satisfy requirement.) | |
Metadata Deficiencies
- Missing data - Not there
- Incorrect data - Invalid content
- Confusing data - Ambiguous content
- Insufficient data - No controlled vocabularies declared
Resolution
- What is needed?
- Who should do it (data provider, service provider)?
Results & Discussion
<titleInfo><title>
Requirements
- At least one <titleInfo> element that includes one <title> subelement
- Recommend use of <title> subelements
- Example
- 99.89% conformance (1.55 titles per record)
Conformance Issues
- <title> repeated within records (Example)
- <title> values repeated in collections (399 repeated values in test set)
- Non-use of <nonSort>
Conformance Solutions
- Service providers normalize <nonSort>
- Service providers analyze use of "type" attribute and subelements
- Data providers differentiate titles if needed
<typeOfResource>
Requirements
- At least one <typeOfResource> element
- Content must be one of eleven valid <typeOfResource> values
- Example
- 84.57% conformance
Conformance Issues
- Elements missing (only from single collection)
Conformance Solutions
- Difficult to generate algorithmically
- Only missing from a single collection; may be best for data provider to remediate
<originInfo>
Requirements
- At least one <originInfo> element
- One <originInfo> date-related subelement must include a recommended date-related subelement that includes a keyDate="yes" attribute/value pair
- Recommended date-related subelements include <dateIssued>, <dateCreated>, <copyrightDate>, and <dateOther>
- Example
- 30.56% conformance
Conformance Issues
- Use of required date-related subelements high (99.35%)
- keyDate="yes" attribute/value pairs missing or repeated
- Encoding only declared 66% of time (sometimes non-allowed schema)
- Invalid date content
- Single dates ("06/27/63", "Fall 1959", "184u", "[c1914]")
- Date ranges ("[between 1861 and 1865]", "1907-1908-1908")
| Schema | Occurrences |
|---|---|
| iso8601 | 970 |
| marc | 2288 |
| w3cdtf | 3296 |
| (MARC not recommended) | |
Conformance Solutions
- Service providers can attempt to normalize date values (see: CDL Date Normalization Utility)
- Service providers can add keyDate when only one date present
- Data providers must identify key dates when multiple or no dates are present
<language>
Requirements
- At least one <language> element
- One <language> element must include a pair of <languageTerm> elements "representing the primary language of the text"
- One <languageTerm type="text"> — Content is from the MARC Code List for Languages
- One <languageTerm type="code" authority="iso639-2b"> — Content is a valid ISO 639-2 code
- Example
- 5.82% conformance
Conformance Issues
- Missing <languageTerm type="text"> element/content
Conformance Solutions
- Service providers can map <languageTerm type="code" authority="iso639-2b"> to <languageTerm type="text">
- Service providers can map from other schema (RFC 3066)
- Data providers must carefully evaluate if "language is primary to understanding the resource"
<physicalDescription>
Requirements
- One and only one <physicalDescription> element
- One and only one <physicalDescription><digitalOrigin> subelement — Content is either "born digital", "reformatted digital", "digitized microfilm" or "digitized other analog"
- At least one <physicalDescription><internetMediaType> subelement — Content is valid MIME Media Type
- Example
- 27.51% conformance
Conformance Issues
- <digitalOrigin> missing from more than 70% of records (either all or none in a collection)
- <internetMediaType> missing from almost half of records
- Content is generally valid controlled vocabulary terms
- Repeated <physicalDescription> (Example)
- Repeated <internetMediaType>
Conformance Solutions
- Difficult for service providers to generate content
- Can normalize repeated elements
- Data providers should consistently provide <physicalDescription> data
<subject>
Requirements
- At least one <subject> element
- Recommended to declare controlled vocabularies using "authority" attribute
- Subject content should be valid term from controlled vocabulary
- Use of <subject> subelements recommended
- Example
- 83.54% conformance (3.89 subjects per record)
Conformance Issues
- 16% of records missing <subject>
- Authorities declared for 67% of <subject> elements
- Variation in use of subject subelements (1.75 - 17.26 subelements per record)
| Authority | Occurrences |
|---|---|
| lcsh | 6392 |
| local | 5272 |
| lctgm | 240 |
| 74 | |
| rvm | 73 |
| mesh | 45 |
| lcshac | 1 |
| GNIS | 1 |
| Total: | 12098 |
| Subelement | Occurrences |
|---|---|
| topic | 16979 |
| geographic | 2717 |
| name | 1607 |
| temporal | 1210 |
| hierarchicalGeographic | 941 |
| geographicCode | 359 |
| genre | 335 |
| cartographics | 43 |
| titleInfo | 42 |
| Total: | 24233 |
Conformance Solutions
- Is subject content not applicable in records missing <subject>
- Further evaluate use of subelements
- Significant use of "local" vocabularies (25% of all subject elements)
<location>
Requirements
- At least one <location> element with <url> subelement
- One and only one <location><url> with usage="primary display" attribute/value pair
- Content should follow URI specifications in DLF/NSDL Best Practices for Shareable Metadata
- Example
- 17.97% conformance
Conformance Issues
- Missing usage="primary display" attribute/value pairs
Conformance Solutions
- Service providers can generate missing attribute/value when there is a single <location><url> (records average 1.14 <location><url> elements per record)
- Data providers must disambiguate and generate content when there are multiple or no <location><url> elements
- Are primary display locations "object in context"
<accessCondition>
Requirements
- Use of at least one <accessCondition> element with type="useAndReproduction" attribute/value pair
- Free of technical jargon
- Explicitly state lack of copyright restrictions when applicable
- Contact information
- URL (and state purpose of URL)
- Consider using standard licenses (CC, ODRL)
- Example
- 37.89% conformance
Conformance Issues
- Missing element
- Custom statement in all cases (no use of standard licenses)
Conformance Solutions
- Single statements apply to whole collections or sets -> Service providers work with data providers to remediate
- Data providers may want to evaluate conformance to recommendations
<recordInfo>
Requirements
- Use of one and only one <recordInfo> element with one and only one <languageOfCataloging> subelement
- <languageOfCataloging> requires valid <languageTerm> element pair (see <language> requirement)
- Example
- 18.56% conformance
Conformance Issues
- Missing content
Conformance Solutions
- Difficult for service providers to accurately generate
- Single statements apply to whole collections or sets -> Service providers work with data providers to remediate
Conclusion
- Conformance issues often solvable by manageable changes in service provider or data provider practices or both.
- Further evaluation of conformance to recommendations and level of quality (<title> , <subject>)
- Evaluate how requirements mesh with MODS Guidelines Levels of Adoption (Meeting MODS Guidelines requirements only meets Level 2 requirements - <name>)