LibGuides: Research Data Management: Archiving & Preservation

Data Life Cycle

https://www.dcc.ac.uk/guidance/curation-lifecycle-model

The Digital Curation Centre's Data Life Cycle model outlines the curation and preservation activities related to data from the beginning to the end of a research project. One of the most important activities is adding descriptive information to data. Metadata describes who, what, when, where, why, and what. Metadata can answer questions about the content of a dataset and its format.

Data Repositories

The venue most suited for researchers seeking to share and maintain access to their data is the digital data repository. There are three major types of digital data repositories.

Institutional repository ("IR"): connected to the researcher’s institution.
Disciplinary repository ("DR"): discipline specific and often operated by a professional organization, a consortium of researchers, or some similar group.
Open Repository ("OR"): allows researchers from different disciplines to deposit and make their data available.

Some useful registries of repositories are OpenDOAR (https://v2.sherpa.ac.uk/opendoar/), and re3data (re3data.org).

Selecting a Repository

Cost

Costs for utilizing data repositories vary. Long term storage of data inherently has a cost involved, some organizations are currently shouldering that cost and others are charging the researcher a fee to help defray the costs of archiving their data. Be sure to check the policies of any repository you consider to know what costs you may need to pay before deciding.

Discovery & Access

Access-related features are critical considerations in deciding on a home for data. Discovery hinges upon proper indexing by search engines, and a common way that happens is through use of the OAI/PMH (the Open Archives Initiative Protocol for Metadata Harvesting). When a repository implements this protocol search engines can index the website, metadata XML documents, and resource identifiers. Some repositories have integrated with additional search systems such as the Open Science Foundation's SHARE that allows multiple repositories to be searched.

Permanent Identifiers

Dead links and "404 Not Found" errors represent devastatingly pernicious threats to the advancement of research. To counter this peril, preservation-minded web architects have devised the concept of persistent identifiers, which serve as reliable location listings for information objects. These identifiers remain valid even when objects may be relocated across folders or servers. A common example of this is a Digital Object Identifier (DOI).

Policies & Licensing

One of the fundamental tenets of managing federally funded research data is that data be openly accessible and free of restrictions. However, there may be valid reasons for limiting access in some way or for some period of time. You are responsible for understanding if there are restraints stemming from national security, intellectual property, or human subjects' privacy policies. Data user agreements and licensing allow owners to state explicitly up front what uses they would be willing to allow. The most popular system for communicating these licenses is the Creative Commons (CC). Levels of CC licenses range from the most liberal "CC0," which effectively renders material as public domain, to the most restrictive "CC BY-NC-ND," which requires attribution, and disallows any changes as well as any commercial use. Some disciplinary repositories require a uniform level of licensing from their depositors. You can work with your library and intellectual property office to choose an appropriate access level and license for your scholarship.

Professional Metrics

Researchers should consider the extent to which they will be able to get back useful metrics for validating and expressing the quality, quantity, and impact of their work they have made available in a repository. Permanent Identifiers, such as the DOIs, that can be assigned to data sets in a repository are also very useful for tracking citation and impact. Piwowar (2007) found that the papers most cited had publicly available, online, data sets. Thus, these identifiers can be fruitful for researchers' promotion and tenure, as they can be used to collect metrics on the impact of their publicly available data sets and other scholarly output.

Data Retention

Understanding Data Retention

Maintaining comprehensive and accurate research records and data is important and may be an obligation long after a project has concluded. Data retention requirements are put in place by funding agencies and sponsoring institutions for a number of reasons. Retention requirements depend on a variety of factors, including the type of data, the purpose data collection, and the policies the institutions.

In order to comply with the terms of a grant, it is important to understand the retention requirements of those funding this research. A parent organization may also have retention requirements for research data, including permanently keeping some records as a part of its institutional history or intellectual property. Different retention requirements might apply to records related to data, such as the administrative or financial records the project. Not all funding agencies require the retention of these records, though they may be governed by records management policies at a parent institution. Researchers are responsible for understanding in advance the data retention expectations of their sponsors/funders and institutions so that they may plan their budget, future storage needs, and ongoing oversight of all records in their custody accordingly.

How Long to Retain Data

One of the most challenging aspects of data management is in understanding how long data needs to be maintained. Retention periods serve to help researchers understand how long they are required to keep their data in order to comply with the terms of their grant. Data retention requirements may be complex and ambiguous, so it is important to understand the retention requirements of your projects' sponsors and policies of your parent institution. It is not a one-size-fits-all situation, and often there may be several different guidelines within one policy. It is also not unusual for several data retention policies to apply to one set of data and it is usually the longest amount of time recommended for retention that is applied. However, publishers also influence the retention of data, there have been publications that have retracted articles after the standard six years because data called into question that could not be found.

Long-Term versus Permanent Retention of Data

Retention policies also support an institution in identifying data and records that might be maintained permanently as a part of the historical record or as intellectual property. Records eligible for permanent retention may be those that document a breakthrough, were generated by a lab or individual who had great impact on the field, or are highly reusable in a particular area of research. Permanent retention is a significant investment for an institution. This is not the same as ensuring long-term storage or preservation of research data. Long-term preservation seeks to ensure that research data will be available to those who seek it in a persistent and accessible format for the specific period of time outlined by your funder and parent institution. These retention periods allow for a measured period of time to pass so that a better assessment of the long-term impact of a research project can be evaluated.

Data Disposal

Most research data retention policies maintain minimum requirements for keeping records but you are not required to keep your data for longer than the retention period. Often the cost of long-term storage is prohibitive for researchers, and thus they may not be interested in storing data for any longer than necessary. Should your research data not meet criteria for permanent retention you want to take steps to safely and completely disposed of your data once the specified retention period has passed. Disposal of your research data might include shredding, deleting, disk-wiping, destroying, or otherwise disassembling materials holding your data in a way that ensures that data cannot be reconstructed or extracted. Extra steps may be required to maintain safety, biologic or otherwise, and the privacy and confidentiality of your research subjects.

Always check with the policies of your institution or funder to make sure destruction actions are in line with their research data policies. Sometimes different policies may apply to research records but not to data, so always confirm the policy before taking steps to destroy anything. Institutions may recommend that you document the records you destroy, if so maintain this record along with any final project outcomes. The benefit of documenting the disposal of research data is the responsible management of the full life-cycle of your data, as well as avoiding future confusion about missing or abandoned data.

Data Appraisal

In order to identify research data that will be permanently maintained by an institution many organizations will have an appraisal process. This often involves an inventory of the records as well as an interview about the project. Common questions to anticipate include:

What are the essential records required to understand this research data and project?
What was the impact of this research on its discipline?
What had been the impact of the researcher in their field?
Are the data replicable?
Is there an index to the data? How would future researchers understand the research?
Has this research been published? Where?
Has the data been kept in a research repository?
Are there additional records related to the data?
Are there security or access issues?
Does someone else own the data?

Related Records

Some appraisal questions highlight records related to the research data that provide context to the data or project. These records and their connection to the data are a key element in considering a collection of research data for permanent retention. They help someone unfamiliar with the details of the research make sense of the overall project's mission, progress, and findings. These records may have different retention periods than the data, so it is important to recognize these records as separate yet closely related to your data and requiring management and oversight. In some cases institutional policies may specify that these records are required to be archived even if the data is not. Examples of related records include:

Human or animal subject protection records including proposals, protocols, informed consent forms, laboratory care documentation, and related correspondence.
Administrative records such as proposals, working papers, meeting minutes and notes, narrative reports, internal status reports, personnel records, deliverables such as books and manuals, project review summaries and reports, and related correspondence.
Financial records such as accounts payable, invoices, and budget monitoring or audit records.

What Data Should be Archived?

Research data and records an Archive are often assessed to be of long-term, enduring value to a scientific discipline, the public interest, or institutional legacy. Data sets are often considered a priority for ------- in an archive when:

The data is not available anywhere else, or is not likely to be available elsewhere in the future
The research is in line with the collecting policies of an institution
Related records are well maintained, comprehensive and available for archiving
Ownership is clear
Standards for privacy and confidentiality of subjects studied are clear
The technical documentation is comprehensive
The data is in a format that facilitates ease of use and preservation

Long-Term Data Management

Researchers play a critical role in planning for the ongoing management and preservation of research data. Long-term data management and long-term preservation have the same objectives:

Ongoing, consistent, and citable access to data and associated contextual records after a project is complete and in such a way as to permit the long-term review, re-use, interpretation, and re-creation of the products of research
Ensuring that protected data stays protected through repository-governed access controls
Ensuring the integrity of the data itself beyond mandated retention periods through the use of automated repository functions, such as the replication of data to geographically dispersed locations

Long-term data management speaks to the intellectual responsibilities and actions of the researcher, while long-term preservation addresses the technical requirements necessary to ensuring access that is "permanent and persistent." It is the management of the data which enables knowledge to be discovered, shared, and further developed. Good long-term data management includes the selection of a repository to ensure baseline descriptive information about research is captured along with the data set and that certain technical processes are performed routinely and reliably to maintain data integrity. The below table outlines the distinctions between the work required of a principal investigator/researcher and the related functions of a preservation repository.

Responsibilities of the Principal Investigator/Researcher	Functionalities of the Preservation Repository
Appraises research data and contextual documentation for deposit, adhering to requirements specified by his/her institution and granting agency; working with project researchers to ensure all records are captured; and consults with an archivist or records manager regarding research records that may be useful for historians across disciplines and over time	Offers instruction on repository scope, requirements, and infrastructure
Collocates all data, data sets, and contextual records on removable storage devices, hard drives, cloud storage, and network servers for retention	Centralizes a point of upload and management of files
Consults with appropriate institutional offices about intellectual property and distribution rights, as well as applicable data security rules, prior to depositing data and associated records	Enables the creation of intellectual property and rights metadata to clarify the rules governing access to, and use of, deposited files
Re-organizes and re-names files as necessary to correct deviations from project-established filing structure and file naming conventions	Enables uploads of folders and specifies maximum character length for file names
Saves files in open formats whenever possible	Accepts broad variety of file formats, which may include SPSS, Stata, R Data, FITS data, Social Network Data, and Data Visualizations
Performs virus checks and other scans to prevent the deposit of spy or malware	Performs routine virus checking and restores files to replace corrupted versions; ensures overall integrity of data deposited
De-identifies data sets for public access as necessary to maintain privacy and confidentiality; deposits both copies in repository	Enables user to create accounts and/or set permissions for individual files (access controls)
Creates metadata for deposited records; adheres to the data entry conventions of selected metadata schema; utilizes controlled vocabularies to populate elements (or fields) required by the repository and specific to a discipline	Provides a data entry template for the depositor that adheres to an accepted metadata schema (such as Dublin Core) and stipulates a minimum amount of metadata be created prior to deposit
Creates metadata that specifies who created the data, the institution that hosted the research, and the granting agencies that paid for the research to enable citation	Requires a DOI (Digital Object Identifier) or other identifier scheme for persistent citation
Deposits additional files and revised versions of already-deposited files on an ongoing basis as necessary post-project or to reflect post-project work in a particular area	Offers version control and does not rewrite over files with the same name
Keeps physical records generated as a product of research safe through deposit to an institutional archives or special collections	Maintains the integrity of physical and digital records. Performs bit-level checking and generates checksums (logs of fixity checks), ensuring deposited files remain uncorrupted and usable; makes bit-level copies, and saves them to additional servers that are in different geographic locations in case of natural disaster
Plans for the cost of depositing and storing data and associated records in a preservation repository	Clearly communicates explicit terms of use and costs/schedule of fees associated with deposit and use of the repository
Appoints a data custodian at the institution where research was conducted if he or she leaves the host institution in the event data and/or associated records must be withdrawn or transferred to a new repository	Enables the transfer of administrative privileges to manage the deposit

Different repositories will have different levels of service, will support different file formats for deposit, and offer different levels of administrative and user support.

While the pre-deposit obligations of the researcher might seem overwhelming, much of the planning can be expedited by the creation of a good data management plan. Creating a data management plan requires researchers to understand their obligations to their institutions and sponsors, designate data collectors, and to consult with librarians, metadata specialists, and archivists on whether or not their institution has the resources and infrastructure to support long-term management and preservation requirements. Whether you are required to have a plan or not, investing in the process before start-up promotes the cost-effective creation of high quality, useable, and preservable data.

Planning for Repository Deposits

Librarians and archivists can help researchers determine and scope the records they must deposit with data sets, which varies by discipline. Principal investigators should consider informing team members of where the data is deposited, particularly when researchers work in multiple locations. Clearly articulating in a data management plan who owns the data, for how long de-identified data and other project records will be accessible to project partners, and where the data will be deposited can improve communications and collegiality, ensure the preservation of the data, and prevent disputes at the close of a project.

Metadata for Long-Term Management

Creating good metadata promotes the long-term discovery of research assets at the point of ingest (the process of adding objects to a preservation repository). Dickmann et al. (2012) writes of a direct correlation between the 'benefit of metadata plus research data' to levels of understanding over time:

From the literature review as well as discussions we derived the following: the level of understanding of data produced in a particular research experiment reduces as time progresses. For each researcher, this process is individual. Thus no specific time frame may be defined. In addition, no researcher can share his complete understanding/perception to other researchers. The perception difference increases with organizational distance: working group, department, institution, etc.

Particular attention should be paid to elements common across metadata schemas. Access can further be enhanced through the use of multiple controlled vocabularies, which offer standardized forms of personal names, institutions, research subjects, methodologies, techniques, and equipment used that make it easier for researchers to identify records of potential use across collections and repositories. Use of terms from two different vocabularies to express the same concept further opens the possibility of the data being located by researchers outside of a specific discipline and broader terms can also be included to promote access.

Qualities of a Preservation Repository

While a complete understanding of the technical requirements to certifying a repository a "preservation repository" is not required, detailed information about what are considered can be found on the following websites:

These requirements include, but are not limited to:

Demonstrating commitment to the organizational infrastructure of the repository
Policies relevant to digital object management
Policies relevant to technologies and technical infrastructure employed and data security

Repositories for the preservation of data can be free, fee-based, or institutionally hosted, and either open or closed to public access. While depositing to an open access repository is a necessity for government-sponsored research, data can be preserved in closed access storage services as long as the host institution is provided unfettered access to accounts.

The benefits of commercial repositories may include:

A higher level of system support
Tools that make deposits easier than open source tools
Higher levels of data security
Insurance against data loss

Disadvantages may include:

Closed to reuse by the academic enterprise
Not citable or discoverable online
Lack of access or download metrics
Ongoing cost commitments on the part of the institution
Opening accounts that may be known to only a few researchers and are at risk for abandonment
Time needed for vendor review/approval by your institution's Information Security staff

Fees associated with institutional repositories are generally storage, not service-based. While an institution's staff can help with the deposit of research data to an institution's repository for free, a designated department usually pays the annual data storage fee.

License and Attributions

Much of this guide was adapted from the New England Collaborative Data Management Curriculum.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.