Skip to Main Content

Research Data Management

What is Research Data?

Data is often thought of in quantitative terms. Much research data is indeed quantitative, but 'number' and 'data' are not synonymous. Data is typically considered in two broad categories: quantitative and qualitative.

Quantitative data can include experimental measurements, e.g. lab instrument data, sensor readings, survey results, and test/simulation models. Qualitative data can include text, audio, images, and video. Some definitions of data are quite broad, and include objects such as laboratory specimens.

Types of Data

Some types of research data and data files are fairly ubiquitous and can be found distributed across disciplines:

  • Images
  • Video
  • Mapping/GIS data
  • Numerical measurements

Research data in Social Sciences can include:

  • Survey responses
  • Focus group and individual interviews
  • Economic indicators
  • Demographics
  • Opinion polling

Research data in Hard Sciences can include:

  • Measurements generated by sensors/laboratory instruments
  • Computer modeling
  • Simulations
  • Observations and/or field studies
  • Specimen

Research Data Life Cycle

Stages of Data Related to Research Data Life Cycle

  • Raw Data: What is being measured or observed?  The data being generated during the research project.
  • Processed Data: Making the raw data useful/manipulable
  • Analyzed Data: Manipulated/interpreted data. What does the data tell us?  Is it significant?  How so?
  • Finalized/Published Data: How do the data support your research question?
  • Existing Data across Different Sources: e.g. GIS data

Data Formats

These file format characteristics ensure the best chances for long-term access:

  • Non-proprietary
    • Non-proprietary or open formats are readable by more than just the equipment and/or program that generated it.  Sometimes proprietary file formats are unavoidable.  However, proprietary formats can often be converted to open formats.  Please see the following section for more on this topic.
  • Unencrypted / uncompressed
    • Unencrypted and uncompressed files offer the best prospects for long-term access.  If files are encrypted and/or compressed, the method of encryption/compression used will need to be both discoverable and usable for file access in the future. 

Here are some examples of preferred formats for various data types (from https://lib.stanford.edu/data-management-services/file-formats):

  • Containers: TAR, GZIP, ZIP
  • Databases: XML, CSV
  • Geospatial: SHP, DBF, GeoTIFF, NetCDF
  • Moving Images: MOV, MPEG, AVI, MXF
  • Audio: WAVE, AIFF, MP3, MXF
  • Numbers/statistics: ASCII, DTA, POR, SAS, SAV
  • Images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
  • Text: PDF/A, HTML, ASCII, XML, UTF-8
  • Web Archive: WARC

For a list of common file formats and evaluations of format quality and long-term sustainability see http://www.digitalpreservation.gov/formats/fdd/browse_list.shtml

Some archives specify the optimal data formats they use for long-term preservation of data.

Format Conversion

Proprietary systems and file formats can resist attempts at data integration, reuse, and sharing. These barriers can often be addressed by converting proprietary formats to open formats. The protocols and solutions for doing so can be discipline-specific (e.g. https://docs.openmicroscopy.org/bio-formats/6.9.0/supported-formats.html), but some general guidelines apply.

Information can be lost when converting file formats. When data is converted from one format to another - through export or by using data translation software - certain changes may occur to the data:

  • For data held in statistical packages, spreadsheets or databases, some data or internal metadata such as missing value definitions, decimal numbers, formulae or variable labels may be lost during conversion to another format, or data may be truncated
  • For textual data, editing such as highlighting, bold text or headers/footers may be lost

After conversion data should be checked for errors or changes.

To mitigate the risk of lost information:

  • Note conversion steps taken
  • If possible, keep the original file as well as the converted one

Data Documentation

Data documentation explains the who/what/where/when/why of data:

  • Who collected this data?  Who/what were the subjects under study?
  • What was collected, and for what purpose?  What is the content/structure of the data?
  • Where was this data collected?  What were the experimental conditions?
  • When was this data collected?  Is it part of a series, or ongoing experiment?
  • Why was this experiment performed?

Good data documentation helps you, the researcher. Clear documentation makes it easier to interpret your findings later, helps facilitate collaboration, sharing, and reuse, and can also help ensure successful long-term preservation of your research findings.

Data documentation practices vary by discipline. These methods include lab notebooks data dictionaries and codebooks in the social sciences, and well-documented/commented code for computer science (or for really any project that uses code and/or scripting).

While formats and methods for documentation differ, the general idea is always to describe:

  • What the data is
  • When the data was collected
  • Where it was collected
  • How it was collected
  • Notes about the data characteristics (including file formats/potential format conversions), and
  • Any pertinent notes about experimental conditions.

Note that for collaborative research projects, it’s important to come to some agreement among members of the project team that will help ensure consistent data documentation practices by all.

Naming Conventions

No matter what, you need to have:

  • File naming conventions
  • Version control

Why Use File Naming Conventions?

Naming conventions make life easier.

  • Help you find your data    
  • Help others find your data
  • Help track which version of a file is most current

What File Naming Convention Should I Use?

Has your research group established a convention?

If not, general guidelines include:

  • Meaningful file names that aren’t too long
  • Avoid certain characters
  • Dates can help with sorting and version control

License and Attributions