Researchers are eager to store their data safely | Renate Mattiszik
Where and how do researchers best store their research data during the research? How can they best deal with backups and version control? How can they exchange research data with others? How can they protect research data from accidental loss and from unauthorized manipulation? In this section we give a global overview of the possibilities. The challenges of data storage
The two infographics The evolution of data storage (GoCanvas, 2014) and The history of digital storage (Mashable, 2011) give a nice look at the transience of storage media, the carriers of information. Perhaps a researcher once thought he was doing a good job of backing up the research data on a USB stick, but how long have they been around? Will you still be able to retrieve the data that is on such a stick? For example, not all laptops have a USB port. And if the data can already be retrieved from such a stick, can it still be read by the software used? And how do you prevent your data from ending up on the street in the event of loss? There are plenty of data horror stories (Pinboard, n.d.) that make the risk of data loss more than visible.
Research data can become roughly unreadable in two ways: The loss of bits
The information carrier deteriorates in quality in such a way that bits – the order of zeros and ones – spontaneously change. Informally, this is also called bit rot. For example, the loss of bits can occur due to a virus, fire, accidental deletion of files, loss of them but also spontaneous bit rot occurs over time.
To ensure that the order of zeros and ones remains intact, you can take the following measures (Digital Heritage Network (n.d.)):
- Maintaining on-site and off-site backups;
- Regularly performing a virus check;
- Copying files to new storage media;
- Regularly checking the data integrity with a checksum (TRACKS, n.d.).
The loss of the display capability
Research data can no longer be displayed if the appropriate combination of the operating system, hardware and application no longer exists, can no longer be used or can be imitated. For example, to reduce the likelihood of losing display capability, the following measures can be taken:
- Storing data in open data formats;
- Store the software and documentation used or developed;
- Mimicking outdated software and hardware environments so that old files can still be used. This last strategy is called emulation and is a lot more complicated and expensive than the previous two.
Ricardo Seguel is one of the researchers (4TU. Centre for Research Data. n.d.) who, after his research, also archived his prototype software tool at 4TU in addition to his data. Centre for Research Data. In this way, he not only keeps his data readable, but other researchers can also repeat his experiments. Storage strategy
If you want to keep data readable and usable during the research, it is important to think carefully about a storage strategy. The following questions are important:
- How big is the dataset?
- Is it ‘active’ data?
- For what period should the dataset be stored?
- Does the software also need to be saved?
- Is it privacy-sensitive or confidential data?
- Who needs access when? Are these datasets that multiple researchers from multiple institutions should be able to work on?
- How often should the data be backed up?
- What precautions are needed to protect the data from loss?
- Does the data need to be encrypted?
About the advantages and disadvantages of different types of solutions, CESSDA (n.d.a.) has made a comprehensive overview. Options for data storage during the research in NL
For storage of individual data and backup during the research, solutions are available on local (network) disks within most institutions. However, researchers often also want to share the data and/or want to collaborate on the data with others from outside their own institution. The illustration below shows a number of solutions that are used in the Netherlands, subdivided according to the goal that researchers have with the data.
- Save data
SURFDrive (SURF, n.d.a.) is used by many researchers in the Netherlands for personal storage. - Collaborate on data
- Figshare for institutions
The University of Amsterdam (UvA) and the Amsterdam University of Applied Sciences (AUAS) offer their researchers Figshare (UvA, 2017). Researchers can safely store their research data in the tailor-made Figshare environment (Figshare, n.d.) during the study and share it with other researchers. Upon completion of their research, researchers can publish and archive their research data using the same system. - Research Drive
In the next paragraph you can read an interview about how Saxion Research Drive from SURF (n.d.b.) has embedded in the research chain. At ResearchDrive, a data steward or principal investigator manages and monitors the project environment, such as managing users, granting rights and permissions, handing out quotas, transferring data, and closing the project environment when a research project is completed. These possibilities are in Research Drive but not in SURFdrive. - DataverseNL
For example, DataverseNL (DANS, n.d.) is used by Avans University of Applied Sciences and several universities in the Netherlands. In a case on the website of the Vrije Universiteit Amsterdam (2019), Assistant Professor Sander Groffen of the Department of Functional Genome Analysis (VU, Science /VUmc) explains how he uses Dataverse to store, share and archive data.
- Figshare for institutions
- Send data
SURFfilesender (SURF, n.d.c.) is used by many Dutch researchers for the secure transmission of data.
An advantage of the above solutions is that the data is stored in the Netherlands. The GDPR stipulates that personal data may only be stored within the European Economic Area (European Union, 2016). A service such as Dropbox (n.d.), where the data is stored in the US, does not comply with this.
In addition to these ‘national solutions’, B2drop (EUdat, n.d.) also offers cloud storage at a European level.
The solutions for long-term storage are discussed in Chapter IV. You will see that some solutions apply both during and after the examination.
In the spotlight
Course to teach researchers to store and share their software code
Module 5 of the Open Science MOOC teaches researchers to store and share their software code in three steps (Tennant, 2018). Versioning tips
If the data is constantly being worked on, it makes sense to introduce a form of version control with which you can follow the changes well. The simplest way to version control is to add a number to the end of a file after each major change. For example, experiment_021213_v2.doc.
You can also apply a form of version control within one file. In the Data Documentation section, you can read a case in which a researcher includes version control in her data files by adding a ‘version control’ tab.
Some programs and virtual research environments have their own automatic form of version control. For example, when working with code/software, it makes sense to use a tool such as GitHub (n.d.), Git (n.d.) or SVN (Apache, n.d.). On the weblog Backlog there is a comparison between Git and SVN (Backlog, 2018).
Need more tips?
- See the section of the CESSDA Data Management Expert Guide (n.d.b.) on version control.
Tips to keep data safe