Formats & Data Organization
- What should I focus on when organizing data?
- How should I approach naming my files?
- What are the issues around file formats?
- How do I keep track of changes?
There are some fundamental decisions that you need to make when you start your research, and data organization should be within this set. The choices that you make will vary based on the type of research that you do, but everyone must address the same issues.
- File Version Control
- Directory Structure/File Naming Conventions
- File Naming Conventions for Specific Disciplines
- File Structure
- File Structure for Backups
- Be consistent.
- Have conventions for naming (1) Directory structure, (2) Folder names, (3) File names
- Always include the same information (e.g., date and time)
- Retain the order of information (e.g., YYYYMMDD, not MMDDYYY )
- Document your file naming conventions so that other users will understand the structure of your file names and any abbreviations or codes you might use (or something to that extent).
- Be descriptive so others can understand your meaning. Include other relevant information such as:
- Unique identifier (i.e., Project Name or Grant Number in folder name)
- Project or research data name
- Conditions (Lab instrument, Solvent, Temperature, etc.)
- Run of experiment (sequential)
- Date (in file properties too)
- Use application-specific codes in 3-letter file extension: MOV, TIF, WRL
- Keep track of versions
- Use a sequential numbered system: v1, v2, v3, etc.
- Don't use confusing labels: revision, final, final2, etc.
- Consider version control software, if applicable
- Record all changes -- no matter how small
- Discard obsolete versions (but never the raw copy)
- Use auto-backup instead of self-archiving, if possible
File Name Example
File Renaming Applications
If you have many files already named and need to revise your naming system, you might consider using a file renaming application such as:
As of OS X Yosemite, Mac users can do bulk renaming of files from Finder without an external program.
One favorite saying is that the best part about standards is that there are plenty to choose from. This holds true for file formats, and means that it is important to think carefully about what file format will be best for long-term preservation and continued access to your data.
Formats most likely to be accessible in the future are:
- Non-proprietary and not tied to a specific piece of software
- Open, documented standard
- Common, used by the research community
- Standard representation (ASCII, Unicode)
Here are some examples of preferred formats:
- PDF, not Word
- CSV, not Excel
- MPEG-4, not Quicktime
- TIFF or JPEG2000, not GIF or JPG
- XML or RDF, not RDBMS
If your research involves more than one person, tracking changes is a critical element. As you think through how to manage this step, keep the following issues in mind.
- Record every change to a file, no matter how small
- Use file naming conventions (see above)
- Changes to headers inside the file
- Changes to log files
- Availability of version control software (e.g., SVN)
- Availability of file sharing software (e.g., Google Docs or Amazon S3)