The researcher must store the research data and transfer and share it safely throughout the research lifecycle. Research data must always be processed in accordance with the protection and processing instructions of your own organisation (File saving and sharing in UEF, in UEF Intranet, requires UEF login).
Storage and preservation solutions are influenced by the
- level of protection of the data
- size of the data
- possible need to use the data in cooperation between different organisations.
The University of Eastern Finland's Digital Services (DiPa) produces a large part of the IT services and server resources used by researchers. Researchers also have access to a wide range of services provided by CSC - IT Center for Science Ltd. for the processing, storing, and opening of research data (see below, Other services).
Protection levels of research data and processing measures
The content of the research data affects the type of data protection needed. For example, public information can be processed, stored, and shared outside the university, usually without special measures. Since this is public information, it cannot, in principle, end up in the wrong hands. In this case, consumer cloud services (e.g., GoogleDrive, Dropbox, iCloud) are also possible, although they are generally not recommended for work use. Other than public information should not be handled in consumer cloud services.
Many kinds of data fall into the base protection level, such as anonymized research data, research plans or other than sensitive personal data. Such data can be stored and shared in many solutions offered by the university, with certain limitations. Sensitive personal or otherwise confidential data require special protection and are subject to high security requirements.
Detailed instructions on the protection levels and storage solutions to be followed in the UEF can be found in the data processing instructions of the University of Eastern Finland (in UEF Intranet, requires UEF login).
File sharing
Sharing files within UEF is relatively straightforward. Of course, data protection must be ensured so that only those with the appropriate access rights can access the shared information. Access rights can be defined, when using the disk space of the research groups.
There are plenty of services for sharing research data. They may, for example, be typical for the field of research or depend on a partner. In this context, we refer, above all, to the general services supported by UEF.
Funet FileSender
Large files can be sent to partners outside the university using the information sharing service Funet FileSender. UEF users can access the Funet FileSender service through Haka login (i.e., UEF ID). A user outside of UEF or other Haka login can also access the service by receiving a so-called Upload voucher invitation from a UEF user.
The service is browser-based and can be used to send files of even over 100 GB. Funet FileSender is not as such suitable for sending sensitive data, but the research data file sent using the service can be encrypted. For encryption, the recipient of the file receives a password from the sender that is not stored on the server but is always sent separately to the recipient (for example, as a text message to the phone).
Other services
IDA storage solution also enables the sharing and storage of research data with various partners. IDA is part of CSC's Fairdata services and, as a rule, offered free of charge to researchers from Finnish higher education institutions or state research institutes and other persons working in research. You can start using the IDA by contacting the IDA contact person of your home organisation. At UEF, you can do this by contacting the IT services for research (servicedesk@uef.fi).
The pan-European EUDAT service catalogue enables the sharing and storage of research data. For example, EUDAT B2DROP allows users to synchronise their active data across different desktops and to share this data with others. EUDAT B2SHAREBasic is for storing, publishing, and sharing research data that also provides a persistent identifier (DOI or Handle). The EUDAT catalogue also includes many other services and functionalities, for example for searching for existing research data or for the long-term preservation of research data. The EUDAT catalogue is jointly maintained by numerous higher education institutions and research institutes.
The quality of the research data refers to slightly different issues depending on the context. In research data management, quality refers to so-called technical or external factors and in this context the suitability of the data content to the research question is not tackled. Rather, the latter is part of the discussion concerning research methods and theory.
Integrity is another term that is used alongside the quality of the data. In general, integrity refers to the fact that the data are in the form they are designed for. The data have not, for example, accidentally changed, and are thus also useful in the research context.
Ensuring the quality and integrity of research data starts at the planning stage. It is important to consider what can happen during data processing that would weaken the suitability or justification of the research data in terms of the research question or, in the worst case, invalidate the research project.
The data types and the data processing methods naturally affect to the quality assurance methods, i.e., what must be considered in, for example, data collection or conversion to another form. These may include calibration of measuring instruments, transcriptions of interview data or data checksums that reveal deviations in values.
The risks affecting the quality of research data are also prevented by measures pertaining to nearly all research data, such as backups, version control, and data description and documentation (see below the sections Backup and version control and Documentation, description and metadata).
Backup and version control (versioning) are an important part of risk management during research and systematic implementation of research data quality management. These measures safeguard the preserving of files and support the comprehensibility of data.
It is a good idea to plan the measures in advance and ensure that all members of the research group are also aware of these measures and responsibilities. Such information should be included in the shared guidelines for the research project and in a place where it can be easily found.
Backup
Ensuring backup protects research data from accidental alterations or destruction, damage caused by hardware or software failures, or by external factors (e.g. hackers, computer viruses, fires, water damage).
Backup measures should take into account, for example,
- routine and regularity
- decentralisation so that not all backups are in the same (physical) location
- the suitability and replacement of the backup-device at regular intervals
- file formats that work during and after the research for as long as necessary.
The storage location of files and data affects the backup process. Although backups are usually automatically secured in the storage locations provided by the university, it is worth remembering to distribute backups. If the research data are stored, for example, on the hard disk of your personal computer, you must perform the backup yourself.
You will find information about the backup of the storage solutions offered by the university in the UEF instructions on information processing (UEF Intranet, requires UEF login).
Version control and file naming
Version control keeps a record of the changes made to the research data. The way version control is implemented depends on the data type. For example, software version control utilises versioning systems, whereas for research data consisting of text files, for example, file naming is a key tool of version control.
Version control is particularly important when several people work with the same research data. Versioning systems typically enable simultaneous work. One example of a versioning system is Git, which is used, for example, on a Microsoft owned GitHub platform.
It is a good idea to plan the organisation and naming of files so that it supports the monitoring of changes to the data. Such methods include dividing research data into file folders and systematic naming of files within the folders. The file name should include a date that is always marked in the same way (e.g., yyyy-mm-dd: 2022-07-22). The date is used to avoid vague "latest version" entries in file names. The folder structure and file naming description should be included as a separate file (e.g., in a * .txt file format).
There are numerous file formats for different purposes. File formats are also constantly being renewed, some go out of use and are replaced by new ones. The longer you work with the same research data, the more important it is to ensure that the files are usable and readable. Special attention must be paid to file formats, especially for long-term preserving and archiving.
As a general instruction, it is recommended that you make at least one copy of the file in a commonly used format. The Ministry of Education and Culture's Open Science and Digital Cultural Heritage entity maintains extensive guidelines on file formats suitable for preserving and transfer which should be examined especially when planning the long-term preserving of research data.
Different file formats
The file format indicates the structure of the file and often how information is stored in digital format (e.g., PDF - Portable Document Format or TIFF - Tagged Image File Format). This facilitates file interoperability. Some file formats are linked to commercial software, such as Microsoft Office, while others are openly accessible to anyone without commercial links, such as OpenDocument.
Openly accessible file formats are recommended especially for opening research data and/or for preserving it after the research, so that the files can be read using different software without paying software licenses. The file format is indicated by a file extension separated by a dot at the end of the file name.
Common text file formats include
- DOC/DOCX (*.doc, *.docx), which contains text formatting and is familiar from Microsoft Word
- unformatted text stored as TXT (* .txt)
- open file format, OpenDocument Text, ODT (*.odt)
- Comma Separated Values, CSV (*.csv).
In statistical data, SPSS software (*.sav) or spreadsheet software (e.g. Excel, *.xlx, *.xlsx) is often used.
A JPEG format (*.jpg, *.jpeg) is commonly used in images files as it does not take up much space. However, it also does not contain as much information as TIFF format (*.tiff, *.tif), for example. Formats that record sound or sound and image are rather dependent on the systems and are therefore constantly changing. When you want to keep such files usable for a longer period of time, they are often converted to formats such as WAV (*.wav, *.wave) or MPEG (*.mpg).
Conversion and digitisation
Transferring files from one format to another is called converting. Conversion may be necessary if other than the originally used software is used, for example, because the hardware does not support the original data format. When converting files, data may be lost or corrupted. Converting should always be well planned and minimising the loss of data. Many software programs have the option to select the save as-storing or export-function when saving a file. There are also separate software for conversion.
Research data in the form of papers can be converted into digital format by scanning. Even in this case, attention should be paid to quality, such as resolution, colour tones or darkness, so that all necessary information is transferred and can be read or viewed as well as possible. At the same time, however, it should be remembered that the higher the quality of the result, the larger the file, which affects the storage and usage requirements of the file.
Scanning is based on imaging the material, but a text file can also be produced using OCR (Optical Character Recognition) programs. PDF (Portable Document Format) is a widely used file format that maintains the layouts of scanned material well. A PDF/A file format is recommended for archiving. UEF provides a free and secure PDF/A converter, which will allow you to convert DOC, DOCX and PDF documents into an archive friendly PDF/A format.
Analogue audio or video and audio recordings can be converted to digital format using separate devices or devices directly connected to a computer.
For the research data to be findable, understandable, and useful for the researchers themselves and others, they must be enriched with additional information. In this context, we talk about metadata, description, and documentation, which should be planned and implemented right from the start throughout the research. This makes it as easy as possible to publish and archive research data at the end of the research. Documentation and metadata are difficult, if not impossible to create retrospectively. Note that even though you cannot fully open your data for some reason, it is highly recommended to publish the metadata whenever possible. This enhances the visibility of the data.
There are no strict definitions for the terms metadata, description and documentation on the practical action level, which may cause confusion. Documentation can generally refer to the diverse description of research data, and metadata to the different types of information needed to understand and use research data. Metadata can be summarized as information about, for example, the content, technical characteristics, context, structure, origin and conditions of use of research data in a concise form. This includes information such as the creator and owner of the data, variables, terminology, and file formats and size.
A metadata standard or schema refers to consistent and machine-readable metadata. Metadata standards promote the discoverability and usability of research data in many ways. At its simplest, using a metadata standard is filling out a form that follows a specific metadata standard structure. In this case, the desired metadata information comes always similar regardless of who fills out the metadata fields. Metadata compares this way to a format familiar from publications that tells the title, author, ownership, etc. Thus, the use of a standard is not always conscious, but metadata standards can be utilised by researchers, for example, when entering information about their research data into a data repository or describing their data using the Qvain tool introduced below.
There are numerous standards. Some are so-called generic metadata standards, such as commonly used Dublin Core (DC), and others are discipline-specific. Researchers are often directed to use the standards of their own research field. These can be found in lists maintained by the Digital Curation Centre or the Research Data Alliance.
Furthermore, glossaries, thesauri and ontologies are recommended to be used for describing research data, which are structured and machine-readable concepts. These are typically utilized in the same way as metadata standards, i.e., the service used can guide the researcher in choosing terms of particular concepts. Because glossaries and ontologies are built on commonly agreed meanings and relationships between terms, they support the quality of the research data's metadata. The researcher can also describe the data with freely chosen keywords, which makes possible the most suitable and versatile description of data from the researcher's perspective but does not necessarily promote the findability of the data as such.
Versatile, carefully designed and implemented, and, where possible, standardized metadata is one of the key means of implementing FAIR principles for the data to be
- Findable
- Accessible
- Interoperable
- Re-usable.
You can read more about the FAIR principles on UEF Data Support website in the section FAIR principles and data management.
Qvain is a web browser-based tool for describing research data. It is part of CSC's Fairdata services. Using Qvain requires the creation of a CSC user account (the instructions can be found here). After that, you can login to the service by using, e.g., the UEF ID (Haka ID).
From the front page, you can either create a new dataset or edit an existing one. The Qvain User Guide helps you to fill in the required metadata step by step.
Once you have published the metadata, the information of the dataset is found in the Etsin service, and through Etsin, in other services as well, such as in the UEF eRepo and in the national Finnish Research.fi service.