The Digital Curation Centre's Data Life Cycle model outlines the curation and preservation activities related to data from the beginning to the end of a research project. One of the most important activities is adding descriptive information to data. Metadata describes who, what, when, where, why, and what. Metadata can answer questions about the content of a dataset and its format.
The venue most suited for researchers seeking to share and maintain access to their data is the digital data repository. There are three major types of digital data repositories.
Some useful registries of repositories are OpenDOAR (https://v2.sherpa.ac.uk/opendoar/), and re3data (re3data.org).
Costs for utilizing data repositories vary. Long term storage of data inherently has a cost involved, some organizations are currently shouldering that cost and others are charging the researcher a fee to help defray the costs of archiving their data. Be sure to check the policies of any repository you consider to know what costs you may need to pay before deciding.
Access-related features are critical considerations in deciding on a home for data. Discovery hinges upon proper indexing by search engines, and a common way that happens is through use of the OAI/PMH (the Open Archives Initiative Protocol for Metadata Harvesting). When a repository implements this protocol search engines can index the website, metadata XML documents, and resource identifiers. Some repositories have integrated with additional search systems such as the Open Science Foundation's SHARE that allows multiple repositories to be searched.
Dead links and "404 Not Found" errors represent devastatingly pernicious threats to the advancement of research. To counter this peril, preservation-minded web architects have devised the concept of persistent identifiers, which serve as reliable location listings for information objects. These identifiers remain valid even when objects may be relocated across folders or servers. A common example of this is a Digital Object Identifier (DOI).
One of the fundamental tenets of managing federally funded research data is that data be openly accessible and free of restrictions. However, there may be valid reasons for limiting access in some way or for some period of time. You are responsible for understanding if there are restraints stemming from national security, intellectual property, or human subjects' privacy policies. Data user agreements and licensing allow owners to state explicitly up front what uses they would be willing to allow. The most popular system for communicating these licenses is the Creative Commons (CC). Levels of CC licenses range from the most liberal "CC0," which effectively renders material as public domain, to the most restrictive "CC BY-NC-ND," which requires attribution, and disallows any changes as well as any commercial use. Some disciplinary repositories require a uniform level of licensing from their depositors. You can work with your library and intellectual property office to choose an appropriate access level and license for your scholarship.
Researchers should consider the extent to which they will be able to get back useful metrics for validating and expressing the quality, quantity, and impact of their work they have made available in a repository. Permanent Identifiers, such as the DOIs, that can be assigned to data sets in a repository are also very useful for tracking citation and impact. Piwowar (2007) found that the papers most cited had publicly available, online, data sets. Thus, these identifiers can be fruitful for researchers' promotion and tenure, as they can be used to collect metrics on the impact of their publicly available data sets and other scholarly output.
Maintaining comprehensive and accurate research records and data is important and may be an obligation long after a project has concluded. Data retention requirements are put in place by funding agencies and sponsoring institutions for a number of reasons. Retention requirements depend on a variety of factors, including the type of data, the purpose data collection, and the policies the institutions.
In order to comply with the terms of a grant, it is important to understand the retention requirements of those funding this research. A parent organization may also have retention requirements for research data, including permanently keeping some records as a part of its institutional history or intellectual property. Different retention requirements might apply to records related to data, such as the administrative or financial records the project. Not all funding agencies require the retention of these records, though they may be governed by records management policies at a parent institution. Researchers are responsible for understanding in advance the data retention expectations of their sponsors/funders and institutions so that they may plan their budget, future storage needs, and ongoing oversight of all records in their custody accordingly.
One of the most challenging aspects of data management is in understanding how long data needs to be maintained. Retention periods serve to help researchers understand how long they are required to keep their data in order to comply with the terms of their grant. Data retention requirements may be complex and ambiguous, so it is important to understand the retention requirements of your projects' sponsors and policies of your parent institution. It is not a one-size-fits-all situation, and often there may be several different guidelines within one policy. It is also not unusual for several data retention policies to apply to one set of data and it is usually the longest amount of time recommended for retention that is applied. However, publishers also influence the retention of data, there have been publications that have retracted articles after the standard six years because data called into question that could not be found.
Retention policies also support an institution in identifying data and records that might be maintained permanently as a part of the historical record or as intellectual property. Records eligible for permanent retention may be those that document a breakthrough, were generated by a lab or individual who had great impact on the field, or are highly reusable in a particular area of research. Permanent retention is a significant investment for an institution. This is not the same as ensuring long-term storage or preservation of research data. Long-term preservation seeks to ensure that research data will be available to those who seek it in a persistent and accessible format for the specific period of time outlined by your funder and parent institution. These retention periods allow for a measured period of time to pass so that a better assessment of the long-term impact of a research project can be evaluated.
Most research data retention policies maintain minimum requirements for keeping records but you are not required to keep your data for longer than the retention period. Often the cost of long-term storage is prohibitive for researchers, and thus they may not be interested in storing data for any longer than necessary. Should your research data not meet criteria for permanent retention you want to take steps to safely and completely disposed of your data once the specified retention period has passed. Disposal of your research data might include shredding, deleting, disk-wiping, destroying, or otherwise disassembling materials holding your data in a way that ensures that data cannot be reconstructed or extracted. Extra steps may be required to maintain safety, biologic or otherwise, and the privacy and confidentiality of your research subjects.
Always check with the policies of your institution or funder to make sure destruction actions are in line with their research data policies. Sometimes different policies may apply to research records but not to data, so always confirm the policy before taking steps to destroy anything. Institutions may recommend that you document the records you destroy, if so maintain this record along with any final project outcomes. The benefit of documenting the disposal of research data is the responsible management of the full life-cycle of your data, as well as avoiding future confusion about missing or abandoned data.
In order to identify research data that will be permanently maintained by an institution many organizations will have an appraisal process. This often involves an inventory of the records as well as an interview about the project. Common questions to anticipate include:
Some appraisal questions highlight records related to the research data that provide context to the data or project. These records and their connection to the data are a key element in considering a collection of research data for permanent retention. They help someone unfamiliar with the details of the research make sense of the overall project's mission, progress, and findings. These records may have different retention periods than the data, so it is important to recognize these records as separate yet closely related to your data and requiring management and oversight. In some cases institutional policies may specify that these records are required to be archived even if the data is not. Examples of related records include:
Research data and records an Archive are often assessed to be of long-term, enduring value to a scientific discipline, the public interest, or institutional legacy. Data sets are often considered a priority for ------- in an archive when:
Researchers play a critical role in planning for the ongoing management and preservation of research data. Long-term data management and long-term preservation have the same objectives:
Long-term data management speaks to the intellectual responsibilities and actions of the researcher, while long-term preservation addresses the technical requirements necessary to ensuring access that is "permanent and persistent." It is the management of the data which enables knowledge to be discovered, shared, and further developed. Good long-term data management includes the selection of a repository to ensure baseline descriptive information about research is captured along with the data set and that certain technical processes are performed routinely and reliably to maintain data integrity. The below table outlines the distinctions between the work required of a principal investigator/researcher and the related functions of a preservation repository.
Responsibilities of the Principal Investigator/Researcher | Functionalities of the Preservation Repository |
---|---|
Appraises research data and contextual documentation for deposit, adhering to requirements specified by his/her institution and granting agency; working with project researchers to ensure all records are captured; and consults with an archivist or records manager regarding research records that may be useful for historians across disciplines and over time | Offers instruction on repository scope, requirements, and infrastructure |
Collocates all data, data sets, and contextual records on removable storage devices, hard drives, cloud storage, and network servers for retention | Centralizes a point of upload and management of files |
Consults with appropriate institutional offices about intellectual property and distribution rights, as well as applicable data security rules, prior to depositing data and associated records | Enables the creation of intellectual property and rights metadata to clarify the rules governing access to, and use of, deposited files |
Re-organizes and re-names files as necessary to correct deviations from project-established filing structure and file naming conventions | Enables uploads of folders and specifies maximum character length for file names |
Saves files in open formats whenever possible | Accepts broad variety of file formats, which may include SPSS, Stata, R Data, FITS data, Social Network Data, and Data Visualizations |
Performs virus checks and other scans to prevent the deposit of spy or malware | Performs routine virus checking and restores files to replace corrupted versions; ensures overall integrity of data deposited |
De-identifies data sets for public access as necessary to maintain privacy and confidentiality; deposits both copies in repository | Enables user to create accounts and/or set permissions for individual files (access controls) |
Creates metadata for deposited records; adheres to the data entry conventions of selected metadata schema; utilizes controlled vocabularies to populate elements (or fields) required by the repository and specific to a discipline | Provides a data entry template for the depositor that adheres to an accepted metadata schema (such as Dublin Core) and stipulates a minimum amount of metadata be created prior to deposit |
Creates metadata that specifies who created the data, the institution that hosted the research, and the granting agencies that paid for the research to enable citation | Requires a DOI (Digital Object Identifier) or other identifier scheme for persistent citation |
Deposits additional files and revised versions of already-deposited files on an ongoing basis as necessary post-project or to reflect post-project work in a particular area | Offers version control and does not rewrite over files with the same name |
Keeps physical records generated as a product of research safe through deposit to an institutional archives or special collections | Maintains the integrity of physical and digital records. Performs bit-level checking and generates checksums (logs of fixity checks), ensuring deposited files remain uncorrupted and usable; makes bit-level copies, and saves them to additional servers that are in different geographic locations in case of natural disaster |
Plans for the cost of depositing and storing data and associated records in a preservation repository | Clearly communicates explicit terms of use and costs/schedule of fees associated with deposit and use of the repository |
Appoints a data custodian at the institution where research was conducted if he or she leaves the host institution in the event data and/or associated records must be withdrawn or transferred to a new repository | Enables the transfer of administrative privileges to manage the deposit |
Different repositories will have different levels of service, will support different file formats for deposit, and offer different levels of administrative and user support.
While the pre-deposit obligations of the researcher might seem overwhelming, much of the planning can be expedited by the creation of a good data management plan. Creating a data management plan requires researchers to understand their obligations to their institutions and sponsors, designate data collectors, and to consult with librarians, metadata specialists, and archivists on whether or not their institution has the resources and infrastructure to support long-term management and preservation requirements. Whether you are required to have a plan or not, investing in the process before start-up promotes the cost-effective creation of high quality, useable, and preservable data.
Librarians and archivists can help researchers determine and scope the records they must deposit with data sets, which varies by discipline. Principal investigators should consider informing team members of where the data is deposited, particularly when researchers work in multiple locations. Clearly articulating in a data management plan who owns the data, for how long de-identified data and other project records will be accessible to project partners, and where the data will be deposited can improve communications and collegiality, ensure the preservation of the data, and prevent disputes at the close of a project.
Creating good metadata promotes the long-term discovery of research assets at the point of ingest (the process of adding objects to a preservation repository). Dickmann et al. (2012) writes of a direct correlation between the 'benefit of metadata plus research data' to levels of understanding over time:
From the literature review as well as discussions we derived the following: the level of understanding of data produced in a particular research experiment reduces as time progresses. For each researcher, this process is individual. Thus no specific time frame may be defined. In addition, no researcher can share his complete understanding/perception to other researchers. The perception difference increases with organizational distance: working group, department, institution, etc.
Particular attention should be paid to elements common across metadata schemas. Access can further be enhanced through the use of multiple controlled vocabularies, which offer standardized forms of personal names, institutions, research subjects, methodologies, techniques, and equipment used that make it easier for researchers to identify records of potential use across collections and repositories. Use of terms from two different vocabularies to express the same concept further opens the possibility of the data being located by researchers outside of a specific discipline and broader terms can also be included to promote access.
While a complete understanding of the technical requirements to certifying a repository a "preservation repository" is not required, detailed information about what are considered can be found on the following websites:
These requirements include, but are not limited to:
Repositories for the preservation of data can be free, fee-based, or institutionally hosted, and either open or closed to public access. While depositing to an open access repository is a necessity for government-sponsored research, data can be preserved in closed access storage services as long as the host institution is provided unfettered access to accounts.
The benefits of commercial repositories may include:
Disadvantages may include:
Fees associated with institutional repositories are generally storage, not service-based. While an institution's staff can help with the deposit of research data to an institution's repository for free, a designated department usually pays the annual data storage fee.
Much of this guide was adapted from the New England Collaborative Data Management Curriculum.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.