Skip to Main Content

Research Data Management

Research Data Management

Over the course of a research project scholars collect data to analyze, write about, and discuss. This guide provides an overview of concepts to consider on how to manage that research data, to preserve and share it for the long term to enhance scholarship and fulfill funder requirements.

The content of this guide draws significantly from the New England Collaborative Data Management Curriculum.

What is Research Data?

There are a number of definitions for ‘research data’.  Here are two examples of commonly cited definitions.

“Research data, unlike other types of information, is collected, observed, or created, for purposes of analysis to produce original research results” (University of Edinburgh).

“The recorded factual material commonly accepted in the research community as necessary to validate research findings” (Excerpted from OMB Circular A-110 36.d.2.i).

Types of Research Data

  • Observational
    • Data captured in real time, usually irreplaceable
    • Examples include sensor readings, telescope images, sample data
  • Experimental
    • Data from lab equipment, often reproducible but can be expensive
    • Examples include gene sequences, chromatograms
  • Simulation
    • Data generated from test models where models and metadata are more important than output data
    • Examples include climate models
  • Derived or Compiled Data
    • Data is reproducible but expensive
    • Examples include data mining, compiled databases, 3D models

Data covers a broad range of types of information:

  • Documents, spreadsheets
  • Laboratory notebooks, field notebooks, diaries
  • Questionnaires, transcripts, codebooks
  • Survey responses
  • Health indicators such as blood cell counts, vital signs
  • Audio and video recordings
  • Images, films
  • Protein or genetic sequences
  • Spectra
  • Test responses
  • Slides, artifacts, specimens, samples
  • Database contents (video, audio, text, images)
  • Models, algorithms, scripts
  • Software code, software for simulation
  • Methodologies and workflows
  • Standard operating procedures and protocols
  • Digital data can be structured and stored a variety of file formats

Why Manage Research Data?

  • Transparency & Integrity
    • You may be required by a funder or publisher to maintain the data that underlies your published works and findings.
  • Compliance
    • Managing data is a part of compliance with the University's Institutional Review Board, your funders' data sharing and data management policies. Funders like the NIH reserve the right to audit your lab notebooks and pre-publication data. Since 2011 the NSF has required a data management plan and the federal government is currently working to make publicly funded research data available to the public. Starting in January of 2023 NIH will require all grants to have data management plans and to eventually share their data in an open access format.
  • Personal & Professional Benefits
    • Managing data saves you time and effort, and avoids the duplication of efforts, "good research data management = good research". You can easily find the data you need and make these available should you be asked. In recent years the NSF wanted researchers to account for all scholarly products, including data, resulting from their funding. In addition, publishing your data can increase your citation impact and discoverability of your research & help with promotion & tenure.

Since 2013 the United States government requires the results, including data, of federally funded research to be made free available to the public. Many other non-government grand funding agencies and a number of publishers, particularly open access publications, now also have similar requirements.

This video by the NYU Health Science Library provide a humorous overview of some of the concerns and needs for proper data management:

Data Management Issues

There are serious issues surrounding data management.  Some of these challenges include managing the work flows of team science, getting everyone on the team to follow a plan, and making data management a priority.  Some issues concern the challenges presented by the frequency of students and post-docs rotating in and out of labs, having data stored in multiple places, and in some cases, having multiple research team members and data spread across the globe. Drs. Stephen Erickson and Karen M.T. Muskavitch (2013) list some examples of serious data management issues that they noted for improvement:

  • Technical data not recorded properly.  This occurs in research programs when the data are not recorded in accordance with the accepted standards of the particular academic field.  This is a very serious matter.  Should another researcher wish to replicate the research, improper recording of the original research would make any attempt to replicate the work questionable at best.  Also, should an allegation of misconduct arise concerning the research, having the data improperly recorded will greatly increase the likelihood that a finding of misconduct will be substantiated.
  • Technical data management not supervised by PI.  In this situation the principal investigator might inappropriately delegate his/her oversight responsibilities to someone in his/her lab that is insufficiently trained.  Another situation might arise if the principal investigator simply does not dedicate the appropriate time and effort to fulfill responsibilities related to proper data management.
  • Data not maintained at the institution.  This situation could occur in a collaboration in which all data is maintained by one collaborator.  It would be particularly problematical if each collaborator is working under a sponsored project in which their institutions are responsible for data management.  In other cases, researchers might maintain data in their homes, and this can also present problems of access.
  • Financial or administrative data not maintained properly.  This basically means that the information is not maintained in sufficient detail, is inaccurately recorded, or not maintained in identifiable files.  External auditors or reviewers would find these matters to be a serious breach of exercising appropriate responsibility regarding the proper stewardship of funds.
  • Data not stored properly.  This could occur with research, financial, and administrative data.  Careless storage of the data that could permit its being destroyed or made unusable is a significant matter.  In such case, the institution and/or researcher have acted negligently, have not fulfilled their stewardship duties, and have violated sponsor policies as well as the terms of the sponsored agreement.
  • Data not held in accordance with retention requirements.  As noted previously, it is absolutely essential that those involved with sponsored projects know how long different kinds of data must be retained to satisfy all compliance requirements as well as to offer appropriate support in the event of lawsuits or disputes over intellectual property.
  • Data not retained by the institution.  This is a major problem that would occur if a researcher leaves the institution and takes the original research data and does not leave a copy at the institution. In the event access is needed, it places the institution in an untenable position since it has not fulfilled its fiduciary responsibility to the sponsor.

 

Lack of Responsibility

Issues that come from the lack of responsibility for research data:

  • Challenges of Team Science
    • One of the greatest challenges in managing data is the distributed nature of modern research.  With so many responsibilities, it is easy to not prioritize data management.  By assigning data management tasks, you will increase the efficiency of your research.
  • Challenges Managing Laboratory Notebooks
    • Laboratory notebooks, paper and electronic, may be audited by the funder, such as NIH.  Managing and preserving these notebooks require a plan.
  • Challenges with Rotating Lab Personnel
    • In many labs personnel are changing constantly.  There must be a plan to bridge the data management knowledge of new and outgoing students, post-docs, and staff.

Best Practices

Here are some best practices for outlining roles for managing data and laboratory notebooks.  Unless the distribution of responsibility is clear, misunderstandings can result and compliance jeopardized.

  • Define roles and assign responsibilities for data management
  • For each task identified in your data management plan, identify the skills needed to perform the task
  • Match skills needed to available staff and identify gaps
  • Develop training plans for continuity
  • Assign responsible parties and monitor results

Lack of a Data Management Plans

Many research funders require that you have a plan to manage and/or share your data.

These are some questions that are commonly addressed in a data management plan:

  • What types of data will be created?
  • Who will own, have access to, and be responsible for managing these data?
  • What equipment and methods will be used to capture and process data?
  • What metadata will make these data make sense to others?
  • Where will data be stored during and after?

Here is a simplified example of a data management plan:

  1. Types of data
    1. What types of data will you be creating or capturing (experimental measures, observational or qualitative, model simulation, existing)?
    2. How will you capture, create, and/or process the data? (Identify instruments, software, imaging used, etc.)
  2. Contextual Details (Metadata) Needed to Make Data Meaningful to others
    1. What file formats and naming conventions will you be using?
  3. Storage, Backup and Security
    1. Where and on what media will you store the data?
    2. What is your backup plan for the data?
    3. How will you manage data security?
  4. Provisions for Protection/Privacy
    1. How are you addressing any ethical or privacy issues (IRB, anonymization of data)?
    2. Who will own any copyright or intellectual property rights to the data?
  5. Policies for re-use
    1. What restrictions need to be placed on re-use of your data?
  6. Policies for access and sharing
    1. What is the process for gaining access to your data?
  7. Plan for archiving and preservation of access
    1. What is your long-term plan for preservation and maintenance of the data?

Poor Records Management

Some of the major issues with managing data are related to locating and making sense of data. Practical lessons from the field of records management apply in these situations.

Common records management failures include:

  • Inconsistently labeled data files
  • Containing unmarked versions
  • Stored inside poorly structured folders
  • Stored on multiple media
  • Stored in multiple locations
  • Stored in various file formats

Best Practices

These are some best practices for creating file names. Poorly constructed file names can cause issues when transferring files from one format to another, or to another operating system.

  • Avoid special characters in a file name.
  • Use capitals or underscores instead of periods or spaces.
  • Use 25 or fewer characters.
  • Use documented & standardized descriptive information about the project/experiment.
  • Use date format ISO 8601:YYYYMMDD.
  • Include a version number.

Lack of Metadata

Often described as 'data about data', metadata contextualizes information.  It can help you to answer several important questions:

  • How will I label, document, describe and contextualize my data during my project so I know what I am collecting?
  • How will someone else make sense of my data during and after the project (e.g. field names, terminology, values, parameters, etc.)?
  • How can I describe my data to make it discoverable by others?

There are several types of metadata that can help make sense of your data. Metadata can be descriptive, it can be structural to navigate the files, it can be administrative, or it can be technical. Each of these metadata may also allow someone a better chance of finding the information while conducting a search within a collection or database. The more of these details available, the more options the searcher has to locate and make sense of the data.

Best Practices

  • Describe the contents of data files
  • Define the parameters and the units on the parameter
  • Explain the formats for dates, time, geographic coordinates, and other parameters
  • Define any coded values
  • Describe quality flags or qualifying values
  • Define missing values

Here is a list of common metadata fields associated with a data set.

  • Title    
  • Creator    
  • Identifier    
  • Subject    
  • Funders    
  • Rights    
  • Access information    
  • Language    
  • Dates    
  • Location
  • Methodology    
  • Data processing    
  • Sources    
  • List of file names    
  • File Formats    
  • File structure    
  • Variable list    
  • Code lists    
  • Versions    
  • Checksums

Lack of Back Up & Data Security

Properly storing, backing up, and securing data are important responsibilities.  Your institution and sponsor want you to take these responsibilities seriously to ensure the integrity of your data.

Here are some guiding questions for this exploration:

  • How often should I be backing up my data?
  • How many copies of my data should/can I have?
  • Where can I store my data at my institution?
  • How much server space can I get at my institution?
  • Am I allowed to use personal hard drives, portable storage like USBs or cloud storage?

Best Practices

  • Make 3 copies (original + external/local + external/remote)
  • Have them geographically distributed (local vs. remote)
  • Use a Hard drive or Tape backup system
  • Cloud Storage
  • Unencrypted is ideal for storing your data because it will make it most easily read by you and others in the future…but if you do need to encrypt your data because of human subjects then:
  • Keep passwords and keys on paper (2 copies), and in a PGP (pretty good privacy) encrypted digital file
  • Uncompressed is also ideal for storage, but if you need to do so to conserve space, limit compression to your 3rd backup copy

DataONE also has a primer for to avoid accidental loss of data:

  • Backup your data at regular frequencies
    • When you complete your data collection activity
    • After you make edits to your data
  • Streaming data should be backed up at regularly scheduled points in the collection process
    • High-value data should be backed up daily or more often
    • Automation simplifies frequent backups
  • Use a reliable device when making backups
    • External USB drive (avoid the use of “light-weight” devices e.g., floppy disks, USB stick-drive; avoid network drives that are intermittently accessible)
    • Managed network drive
    • Managed cloud file-server
  • Ensure backup copies are identical to the original copy
    • Perform differential checks
    • Perform “checksum” check
  • Document all procedures to ensure a successful recovery from a backup copy

Undetermined Ownership & Retention

When it comes to data ownership and data retention there are a lot of overlapping policies. University Intellectual Property policies can cover the ownership and retention of data related to patents, the Institutional Review Board wants to ensure that documentation of human subjects' data are retained and/or destroyed appropriately, and the funders and publishers want you to retain data to defend the integrity of your findings, and then there are federal guidelines like HIPAA.

Data retention: how long should I keep my data?

The easy answer to this question is: it depends. There can be a lot of overlapping regulations depending on the type of research you’re conducting, the nature of the data you have, and the sponsors of your research. Here are some examples of overlapping data retention requirements:

  • IRB OHRP Requirements: 45 CFR 46 requires research records to be retained for at least 3 years after the completion of the research.
  • HIPAA Requirements: Any research that involved collecting identifiable health information is subject to HIPAA requirements. As a result records must be retained for a minimum of 6 years after each subject signed an authorization.
  • FDA Requirements 21 CFR 312.62.c  Any research that involved drugs, devices, or biologics being tested in humans must have records retained for a period of 2 years following the date a marketing application is approved for the drug for the indication for which it is being investigated; or, if no application is to be filed or if the application is not approved for such indication, until 2 years after the investigation is discontinued and FDA is notified.
  • VA Requirements: At present records for any research that involves the VA must be retained indefinitely per VA federal regulatory requirements.
  • Intellectual Property Requirements - Any research data used to support a patent through must be retained for the life of the patent in accordance with Intellectual Property Policy.

Best Practices

Check with your Funder and Publisher Requirements

Questions of data validity: If there are questions or allegations about the validity of the data or appropriate conduct of the research, you must retain all of the original research data until such questions or allegations have been completely resolved.

Lack of Long-Term Planning

After a project you may want to consider appraising, and publishing or depositing your data in a repository. There are a variety of factors that impact your ability to share data with outside parties.

  • What will happen to my data after my project ends?
  • How can I appraise the value of my data?
  • What are my options for archiving and preserving my data?
  • What are my options for publishing and sharing data?

Best Practices

  • Is the file format open or closed (i.e. proprietary)?
  • Is a particular software package required to read and work with the data file?  If so, the software, version, and operating system should be cited in the metadata.
  • Do multiple files comprise the data file structure? If so, that should be specified in the metadata.
  • When choosing a file format, select a consistent format that can be read well into the future and is independent of changes in applications.
  • Non-proprietary: Open, documented standard, Unencrypted, Uncompressed, ASCII formatted files will be readable into the future.

License and Attributions