Skip to main content

Research Data Management: Organize Your Data

Organize Your Data

Large research projects can generate a massive number of data files. Short descriptive file names and a simple file hierarchy make these files easier to navigate and locate. Once you create, gather, or start manipulating data and files, they can quickly become disorganized. To save time and prevent errors later on, you and your colleagues should decide how you will name and structure files and folders. Including documentation (or 'metadata'), data logs, and other related and useful documentation, will allow you to add context to your data so that you and others can understand it in the short, medium, and long-term. This work in turn helps research teams and collaborators work more effectively and efficiently throughout the entire research life cycle.

Using Metadata 

Metadata is structured information that describes, explains, locates, or otherwise represents something else (NISO, 2004). In the early stages of research, metadata can help keep facets of the research project organized and meaningful to the individual scholar or research team (i.e. agreed upon required fields, meaningful field labels, etc.). When it comes time to share your research, metadata is used to describe data so that other researchers can find it and use it appropriately. Due to the diverse needs of researchers who work with data, there are many different metadata standards to chose from. Your chosen data repository support services should be able to assist with determining what metadata to record.

The most common metadata standards used for data management are Dublin Core and the DDI (Data Documentation Initiative). Other common standards include:

  • CASRAI: describes research administrative information
  • NAP: describes spatial data in North America coordinates
  • EML: describes ecological data

The UK Digital Curation Centre (DCC) maintains a comprehensive list of metadata standards to help you find the most appropriate standard for your research data: http://www.dcc.ac.uk/resources/metadata-standards.

File Format
s

A computer file format is a particular way of encoding information within a computer file so that it can be recognized by an application. File formats are indicated by the file name extension, usually a full stop followed by three letters.

Open File Formats

Open File Formats can be used by anyone. Choose Open File Formats in order to:

  •  increase your ability to open and read your files in the future
  •  make your data accessible to more researchers immediately Because the file specifications are publicly available, the open-source software community can ensure that data stored in these file formats remain accessible over the long term.

Proprietary File Formats (.docx, .raw, DWG, .PSD ).

Proprietary File Formats work only with software provided by the vendor. File specifications are not freely available, so when the software is no longer supported, files in that format are typically unreadable.

Recommended File Formats (.TIFF, .PDF, .XML, .MP3)

XML, CSV E-Books: EPUB Images: JPG, PNG, PDF, TIFF, BMP Sound: MP3, FLAC Text: TXT, CSV, PDF/A, ASCII, UTF-8 Video: MPG, MOV, AVI Spreadsheets: CSV Medical Images: DICOM 


Note: Some research disciplines and industries treat a specific proprietary file format as a de facto standard which you may wish to follow.

 

Source: UBC Library.

Logical File naming

Consistent and thoughtful file naming will help you and your colleagues avoid frustration and work more efficiently. Establishing naming convention will help to provide consistency, which will make it easier to find and correctly identify your files, prevent version control problems when working on files collaboratively. It is wise to develop a logical structure in cooperation with your collaborators at the start of a project.

The following is an example of a standard approach to file naming:

Denote dates in YYYYMMDD format

DO: Use 20140403

DON’T: Use 04032013

BECAUSE: Computers sort YYYYMMDD in chronological order.

Use a short unique identifier (e.g. Project Name or Grant #)

DO: CHHM

DON’T: Centre for Hip Health and Mobility

BECAUSE: Short filenames prevent the need for side scrolling and column adjustment.

Include a summary of content (e.g. Questionnaire or GrantProposal) as part of the file name

DO: FileNm_Guidelines_20140409_v01.docx

DON’T: FileNm_20140409.docx

BECAUSE: Files will be easier to find.

Use _ as delimiters. Avoid these special characters: & , * % # * ( ) ! @$ ^ ~ ‘ { } [ ] ? < > –

DO: FileNm_Guidelines_20140409_v01.docx

DON’T: FileNm Guidelines 2014 04 09 v01.docx

BECAUSE: Different computer systems handle special characters differently – filing order, etc.

Keep track of document versions either sequentially (e.g. v01, v02,) or with a unique date and time ( e.g. 20140403_1800)

DO: FileNm_Guidelines_20140409_v01.docx

DON’T: FileNm_Guidelines_20140409_Review.docx AND FileNm_Guidelines_20140409_Investigation.docx

BECAUSE: Two years from now, you won’t remember what you meant.

Make folder hierarchies as simple as possible

DO: F:/ Env/LIBR/DataMgmt_FileFormats_20140409_v01.docx

DON’T: F:/Environment/Library/Woodward/Data//Mat/Draft6/2014/-DataMgmt_FileFormats_20140409_v01.docx

BECAUSE: Complex folder hierarchies are harder to navigate and offer more opportunities for filing errors. System back-ups may take longer.

Source: UBC data management planning documentation

Version Control is the way to track revisions of a data set, or a process.  If your research involves more than one person, it is essential.  You will want to record every change to a file, no matter how small.  Keep track of the changes to a file in your file naming convention and log files, or version control software.  File sharing software can also be used to track versions.

You can do it manually by including a version control indicator in the file name, such as v01, v02, v1.4.  The standard convention is to use whole numbers for major revisions, and decimals for minor ones.

There are several software programs that are designed for managing versions tracking. Mercurial, TortiseSVN, Apache Subversion, Git, and SmartSVN.

File sharing software can also be used to track versions. Google Docs records version changes as well.

As you think through how to manage this step, keep the following issues in mind:

  • record every change to a file, no matter how small
  • keep track of changes to files
  • use file naming conventions
  • consider how headers are used inside the file
  • understand how log files are used
  • use, or investigate the use of, version control software (SVN, Git, Subversion)
  • use, or investigate the use of, file sharing software (Google Docs)

Source: The University of Virginia Library