Data Storage & File Naming: Data Management Planning
- Data Sensitivity Framework (NCSU OIT)
- Long-term Data Storage
- What should I focus on when organizing data?
- How should I approach naming my files?
- What are the issues around file formats?
- How do I keep track of changes?
NCSU's Office of Information Technology (OIT) has developed a series of guides within the Data Sensitivity Framework to assist in the security and integrity of university data, including research data.
NCSU's Office of Information Technology (OIT) Shared Services group has developed text that you can adapt for your specific project to address long-term storage issues related to data management planning.
For more information about data storage options at NCSU contact:
The following text can be adapted for your project to address NCSU-specific resources for long-term storage:
"Long term data storage is available from NC State's Office of Information Technology Shared Services group. Data is stored on a highly scalable, resilient (no single point of failure) storage system. Data is backed up at a data center ~15 miles from the data center where the storage system is located.
Access to data is provided using web servers, ftp servers, or iRODS (Integrated Rule-Oriented Data System) data grid as appropriate for the data types being accessed. Data access servers are provided using virtual servers provisioned in NC State's Virtual Computing Lab (VCL) environment. Current storage system is built using IBM's SONAS (Scale Out Network Attached Storage) storage system. SONAS provides the capability to independently scale data input/output throughput and storage capacity. SONAS storage system can be upgraded/expanded without being shut down. All SONAS elements involved in data access are redundant with automatic failover. NC State's existing system has 360TB capacity and could be expanded to more than 14PB using currently available disks. The SONAS storage system is located on NC State's campus in a secure data center with battery-based uninterruptible power supply and standby diesel generator.
Data stored on the SONAS system is backed up to a tape library located in a data center at MCNC in Research Triangle Park (approximately fifteen miles from NC State's campus). MCNC operates the North Carolina Research and Education Network (NCREN) and has extensive fiber network across North Carolina including a multi-pair fiber ring connecting NC State, MCNC, UNC-Chapel Hill, and Duke University in the Research Triangle region. A dedicated connection on a dense wavelength division multiplexed lambda between NC State and MCNC is utilized for the backup traffic. Utilizing NC State's VCL, various methods of access to the data can be provided based on what is appropriate. Web servers - either centrally managed shared web server or research group managed dedicated web server - providing access using http, https, or ftp protocols or iRODS server providing data grid access are current options available for off campus data access."
There are some fundamental decisions that you need to make when you start your research, and data organization should be within this set. The choices that you make will vary based on the type of research that you do, but everyone must address the same issues.
- File Version Control
- Directory Structure/File Naming Conventions
- File Naming Conventions for Specific Disciplines
- File Structure
- File Structure for Backups
- Be consistent.
- Have conventions for naming (1) Directory structure, (2) Folder names, (3) File names
- Always include the same information (e.g., date and time)
- Retain the order of information (e.g., YYYYMMDD, not MMDDYYY )
- Document your file naming conventions so that other users will understand the structure of your file names and any abbreviations or codes you might use (or something to that extent).
- Be descriptive so others can understand your meaning. Include other relevant information such as:
- Unique identifier (i.e., Project Name or Grant Number in folder name)
- Project or research data name
- Conditions (Lab instrument, Solvent, Temperature, etc.)
- Run of experiment (sequential)
- Date (in file properties too)
- Use application-specific codes in 3-letter file extension: MOV, TIF, WRL
- Keep track of versions
- Use a sequential numbered system: v1, v2, v3, etc.
- Don't use confusing labels: revision, final, final2, etc.
- Consider version control software, if applicable
- Record all changes -- no matter how small
- Discard obsolete versions (but never the raw copy)
- Use auto-backup instead of self-archiving, if possible
File Name Example
File Renaming Applications
If you have many files already named and need to revise your naming system, you might consider using a file renaming application such as:
One favorite saying is that the best part about standards is that there are plenty to choose from. This holds true for file formats, and means that it is important to think carefully about what file format will be best for long-term preservation and continued access to your data.
Formats most likely to be accessible in the future are:
- Non-proprietary and not tied to a specific piece of software
- Open, documented standard
- Common, used by the research community
- Standard representation (ASCII, Unicode)
Here are some examples of preferred formats:
- PDF, not Word
- CSV, not Excel
- MPEG-4, not Quicktime
- TIFF or JPEG2000, not GIF or JPG
- XML or RDF, not RDBMS
If your research involves more than one person, tracking changes is a critical element. As you think through how to manage this step, keep the following issues in mind.
- Record every change to a file, no matter how small
- Use file naming conventions (see above)
- Changes to headers inside the file
- Changes to log files
- Availability of version control software (e.g., SVN)
- Availability of file sharing software (e.g., Google Docs or Amazon S3)