LibGuides: Research Data Management: Types of Data

What is Research Data?

Data is often thought of in quantitative terms. Much research data is indeed quantitative, but 'number' and 'data' are not synonymous. Data is typically considered in two broad categories: quantitative and qualitative.

Quantitative data can include experimental measurements, e.g. lab instrument data, sensor readings, survey results, and test/simulation models. Qualitative data can include text, audio, images, and video. Some definitions of data are quite broad, and include objects such as laboratory specimens.

Types of Data

Some types of research data and data files are fairly ubiquitous and can be found distributed across disciplines:

Images
Video
Mapping/GIS data
Numerical measurements

Research data in Social Sciences can include:

Survey responses
Focus group and individual interviews
Economic indicators
Demographics
Opinion polling

Research data in Hard Sciences can include:

Measurements generated by sensors/laboratory instruments
Computer modeling
Simulations
Observations and/or field studies
Specimen

Research Data Life Cycle

Stages of Data Related to Research Data Life Cycle

Raw Data: What is being measured or observed? The data being generated during the research project.
Processed Data: Making the raw data useful/manipulable
Analyzed Data: Manipulated/interpreted data. What does the data tell us? Is it significant? How so?
Finalized/Published Data: How do the data support your research question?
Existing Data across Different Sources: e.g. GIS data

Data Formats

These file format characteristics ensure the best chances for long-term access:

Non-proprietary
- Non-proprietary or open formats are readable by more than just the equipment and/or program that generated it. Sometimes proprietary file formats are unavoidable. However, proprietary formats can often be converted to open formats. Please see the following section for more on this topic.
Unencrypted / uncompressed
- Unencrypted and uncompressed files offer the best prospects for long-term access. If files are encrypted and/or compressed, the method of encryption/compression used will need to be both discoverable and usable for file access in the future.

Here are some examples of preferred formats for various data types (from https://lib.stanford.edu/data-management-services/file-formats):

Containers: TAR, GZIP, ZIP
Databases: XML, CSV
Geospatial: SHP, DBF, GeoTIFF, NetCDF
Moving Images: MOV, MPEG, AVI, MXF
Audio: WAVE, AIFF, MP3, MXF
Numbers/statistics: ASCII, DTA, POR, SAS, SAV
Images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
Text: PDF/A, HTML, ASCII, XML, UTF-8
Web Archive: WARC

For a list of common file formats and evaluations of format quality and long-term sustainability see http://www.digitalpreservation.gov/formats/fdd/browse_list.shtml

Some archives specify the optimal data formats they use for long-term preservation of data.

Format Conversion

Proprietary systems and file formats can resist attempts at data integration, reuse, and sharing. These barriers can often be addressed by converting proprietary formats to open formats. The protocols and solutions for doing so can be discipline-specific (e.g. https://docs.openmicroscopy.org/bio-formats/6.9.0/supported-formats.html), but some general guidelines apply.

Information can be lost when converting file formats. When data is converted from one format to another - through export or by using data translation software - certain changes may occur to the data:

For data held in statistical packages, spreadsheets or databases, some data or internal metadata such as missing value definitions, decimal numbers, formulae or variable labels may be lost during conversion to another format, or data may be truncated
For textual data, editing such as highlighting, bold text or headers/footers may be lost

After conversion data should be checked for errors or changes.

To mitigate the risk of lost information:

Note conversion steps taken
If possible, keep the original file as well as the converted one

Data Documentation

Data documentation explains the who/what/where/when/why of data:

Who collected this data? Who/what were the subjects under study?
What was collected, and for what purpose? What is the content/structure of the data?
Where was this data collected? What were the experimental conditions?
When was this data collected? Is it part of a series, or ongoing experiment?
Why was this experiment performed?

Good data documentation helps you, the researcher. Clear documentation makes it easier to interpret your findings later, helps facilitate collaboration, sharing, and reuse, and can also help ensure successful long-term preservation of your research findings.

Data documentation practices vary by discipline. These methods include lab notebooks data dictionaries and codebooks in the social sciences, and well-documented/commented code for computer science (or for really any project that uses code and/or scripting).

While formats and methods for documentation differ, the general idea is always to describe:

What the data is
When the data was collected
Where it was collected
How it was collected
Notes about the data characteristics (including file formats/potential format conversions), and
Any pertinent notes about experimental conditions.

Note that for collaborative research projects, it’s important to come to some agreement among members of the project team that will help ensure consistent data documentation practices by all.

Naming Conventions

No matter what, you need to have:

File naming conventions
Version control

Why Use File Naming Conventions?

Naming conventions make life easier.

Help you find your data
Help others find your data
Help track which version of a file is most current

What File Naming Convention Should I Use?

Has your research group established a convention?

If not, general guidelines include:

Meaningful file names that aren’t too long
Avoid certain characters
Dates can help with sorting and version control

License and Attributions

Much of this guide was adapted from the New England Collaborative Data Management Curriculum.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.