Skip to Main Content

Research Data Management

What is Metadata?

Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called data about data or information about information. (2004, NISO, Understanding Metadata pg. 1)

Metadata is used to record information about data (e.g. bibliographic or scientific) that has been collected. Metadata is essential to enabling the use and reuse of data and in ensuring that resources are accessible, and usable, in the future.

You must have metadata in order to:

  • Find data from other researchers to support your research
  • Use the data that you do find
  • Help other professionals to find and use data from your research
  • Use your own data in the future when you may have forgotten details of the research

Types of Metadata

Metadata is commonly broken down into three main types: descriptive, structural, and administrative.

  • Descriptive metadata describes the object or data and gives the basic facts: who created it (i.e. authorship), title, keywords, and abstract.
  • Structural metadata describes the structure of an object including its components and how they are related.  It also describes the format, process, and inter-relatedness of objects. It can be used to facilitate navigation, or define the format or sequence of complex objects.
  • Administrative metadata includes information about the management of the object and may include information about: preservation and rights management, creation date, copyright permissions, required software, provenance (history), and file integrity checks

Metadata facilitates discoverability, accessibility, ownership, reuse and data structure by providing necessary information about an object.  This information is attached to the object, and will follow it throughout its lifecycle, and facilitate its use.  Depending on which metadata scheme is used, and how much about an object is known, the amount of metadata for any object will vary. Accessibility and discoverability will also depend on the existence of high-quality metadata. The more you have, and the more organized it is, the easier it will be to search for an object. Users query databases for information, and objects, based on the metadata that exists for an object. Searching by author, title, format, or a phrase in the description requires that information of those kinds exist (a value for each of those fields in a metadata record).

Sample Metadata Standards

Adhering to metadata standards is crucial to successful data management and for future publishing and funding. Metadata standards guide the collection and structure of metadata so that data is collected, described, structured, and referred to consistently.

A sampling of these standards is provided below as an example:

Biology
Darwin Core
A body of standards, including a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries.

Ecology
EML - Ecological Metadata Language
Ecological Metadata Language (EML) is a metadata specification particularly developed for the ecology discipline.

Earth Science
AgMES - Agricultural Metadata Element Set
A semantic standard for description, resource discovery, interoperability and data exchange for different types of agricultural information resources.

Climatology
CF (Climate and Forecast) Metadata Conventions
A standard for climate and forecast “use metadata” that aims both to distinguish quantities (such as physical description, units, or prior processing) and to locate the data in space–time.

Physical Science
CIF - Crystallographic Information Framework
An extensible standard file format and set of protocols for the exchange of crystallographic and related structured data.

Social Sciences & Humanities
DDI - Data Documentation Initiative
An international standard for describing data from the social, behavioral, and economic sciences. Expressed in XML, the DDI metadata specification supports the entire research data lifecycle.

General Research Data
DataCite Metadata Schema
A domain-agnostic list of core metadata properties chosen for the accurate and consistent identification of data for citation and retrieval purposes.

General
Dublin Core
A basic, domain-agnostic standard which can be easily understood and implemented, and as such is one of the best known and most widely used metadata standards.

Controlled Vocabularies

Controlled vocabularies are simply lists of predefined terms that ensure consistency of use, and help to disambiguate similar concepts. It is usually a good idea to use the controlled vocabulary that best matches the type of research you are describing. For example, subject terms used in research about biometric sensing may be taken from a controlled vocabulary list such as the Medical Subject Headings (MeSH). Some other examples of controlled vocabularies include the ERIC Thesaurus for education terms, the IEE INSPEC Thesaurus of the Scientific and Technical terms, and the Centre for Agricultural Bioscience international’s CAB Thesaurus.

Controlled vocabularies are important because they solve the problems of natural language ambiguity such as homographs and synonyms.

They help take the guess work out of choosing between:

  • A preferred spelling
  • A scientific or popular term
  • Determining which synonym to use

In short, controlled vocabularies ensure consistency and clarity.

Technical Standards

Technical standards ensure that the units such as date and time, format, etc. are entered consistently amongst different researchers.

Date and time are particularly troublesome to enter consistently because of different types of notation. Consequently, you may choose to use the World Wide Web Consortium Date and Time Format (W3C-DTF) which provides strict encoding rules about how date information is entered. This is important because different metadata standards may need different levels of granularity in the date and time and because different communities have different ways of expressing dates. By formatting your date elements according to this standard, you not only ensure that a machine can "read" it, but also international colleagues.

Media types can be problematic as well. The MIME media types helps you choose among the following: Application, audio, example, image, message, model, multipart, text, video.

Typically the metadata standard you use will provide a best practice recommendation for which controlled vocabularies and standards you should enter. There are standards and controlled vocabularies for every conceivable element you may wish to describe. Your metadata standard will generally recommend a best practice with the idea that as long as you structure your data according to the defined standards, it will be consistent, and able to be discovered and reused by other researchers. In cases where it is unclear, or not defined, it may help to talk to a metadata specialist, who can advise and help with your documentation.

Metadata Elements

At this point the number of metadata standards, controlled vocabularies and technical standards available to you may seem daunting. It is important to remember that the metadata standards are frequently designed for a specific purpose, which should dovetail with the types of controlled vocabularies and technical standards that best describe your data. Overtime you will become more proficient in recognizing the metadata standard for your research community.

Nonetheless, there are some common elements necessary to ensure that you data can be found and used by other researchers. The following is taken from MIT's best practices for managing your data. These elements are necessary regardless of your discipline, and can be used as a general crib sheet if you are not using an established metadata standard.

Title Name of the dataset or research project that produced it
Creator Names and addresses of the organization or people who created the data
Identifier Number used to identify the data, even if it is just an internal project reference number.  This should always be a unique number.
Subject Best practice is to use a controlled vocabulary to establish the appropriate keywords or phrases describing the subject or content of the data
Funders Organizations or agencies who funded the research
Rights Any known intellectual property rights held for the data
Access information Where and how your data can be accessed by other researchers
Language Best practice is to use a technical standard to indicate the language(s) of the intellectual content of the resource, when applicable
Dates Best practice is to use a technical standard to indicate key dates associated with the data, including: project start and end date; release date; time period covered by the data; and other dates associated with the data lifespan, e.g., maintenance cycle, update schedule
Location Where the data relates to a physical location, record information about its spatial coverage
Methodology How the data was generated, including equipment or software used, experimental protocol, other things one might include in a lab notebook
Data processing Along the way, record any information on how the data has been altered or processed
Sources Citations to material for data derived from other sources, including details of where the source data is held and how it was accessed
List of file names List of all data files associated with the project, with their names and file extensions (e.g. 'NWPalaceTR.WRL', 'stone.mov'). Best practice is to establish a file naming convention to ensure ease of discoverability
File Formats Format(s) of the data, e.g. FITS, SPSS, HTML, JPEG, and any software required to read the data
File structure Organization of the data file(s) and the layout of the variables, when applicable
Variable list List of variables in the data files, when applicable
Code lists Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. '999 indicates a missing value in the data')
Versions Date/time stamp for each file, and use a separate ID for each version
Checksums To test if your file has changed over time

Creating Metadata

Metadata creation comes by manual entry of data, automatic extraction, or a combination of both methods. The manual method occurs when you enter information about your resource into a template, a table, a spreadsheet or some other data entry interface. Typically manual metadata is descriptive in nature. Automatic metadata creation occurs when information about a resource is extracted. Generally this type of metadata is technical in nature. Decisions about who will produce the metadata and what methods will be used must be considered as part of your overall project plan. What follows below are some general considerations to help you decide how to manage metadata creation (adapted from UW Core Metadata Companion).

Here are some best practices as you prepare to create your own metadata to describe your content.

  1. Consistent data entry is important.  Review your metadata for typos, extraneous punctuation, and any inconsistencies in fielded entry, such as putting an author into a title field.
  2. Avoid extraneous punctuation as it can create retrieval issues.
  3. Avoid most abbreviations. It is fine to use common or accepted abbreviations (such as "cm" for "centimeters") as long as you document the expectation, and are consistent about it.
  4. In general, capitalize the first word (of a title, for example) and proper names (place, personal and corporate names) and subject terms only. Capitalize content in the description field according to normal rules of writing. Do not enter content in all caps except in the case of acronyms.
  5. Use templates and macros when possible.  It may be that certain data elements will always be the same.  In those cases try to automate the entry as it cuts down on errors.
  6. Extract pre-existing metadata from your sources whenever possible.  Information about pictures and word documents can be embedded within the resource itself and extracted for quick population of templates.
  7. Keep a data dictionary of the elements, technical standards, and controlled vocabularies you use in your project.
  8. Always use an established metadata standard. Your discipline probably already has a best practices metadata standard specific to your research needs.

License and Attributions