SOP - PI Project Based File Storage Guidelines --PUBLIC
Project Based File Storage Guidelines
Scope
A PREDICTABLE and TRACTABLE file storage plan enables both you and your collaborators to locate and make sense of your data - even if time has passed. The following guidelines and SOP provide a mechanism to provide "A single home for each file" and an intuitive structure and process to the files are straightforward to find.
- Note - "A single home for each file" means there is a specific location where your file should be, and avoids problems where you use the wrong file because you are keeping multiple copies of a file in unorganized folders. Keeping multiple copies of a file in unorganized folders is not a substitute for a Data Backup Plan - but that is a topic for a different wiki article.
Guidelines
- Avoid using special characters in folder and file names
- Use underscores and hyphens instead of spaces (spaces can cause difficulties with some scripting/automation approaches)
- Follow the style of the lab for UPPERCASE or lowercase or camelCase naming conventions (helpful due to the case-sensitive nature of many programming languages which may be used for data analysis pipelines and automation)
- If folders are likely to have similar data but iterations of settings or software versions to produce the files, append an iteration number to the folder name and place all files of the same iteration into that folder - this is expanded upon in the example below.
- recommend using an underscore and two digit iteration number i.e. AnalyzedData_01, ..., AnalyzedData_42
Procedure
Create a Folder for each project
suggested format:
- [Project Management ID]_[Project Name]
Create appropriate Subfolders for the project
In general, you should have at least have DATA and ANALYSIS subfolders
- suggested format:
- [Project Management ID]_[Project Name]_DATA
- [Project Management ID]_[Project Name]_ANALYSIS
Depending on your project there may be other subfolders that would be appropriate to include
- additional subfolders which may be appropriate to include
- Figures
- References
- Templates
- ...
Populate your subfolders with additional subfolders (if needed) and a README document to help orient people to the folder's purpose
For some studies, the raw data that come from the instrument should go into the Data subfolder, however, the extensions of the raw data file may be identical ('.txt', '.csv') to other raw data files and a nested subfolder specifying the instrument that generated the files. Similarly, some projects involve multiple-step data analysis pipelines where intermediate analyses are generated before a final summary analysis is produced. Additionally, some projects may require iterations of analysis as analysis settings and parameters are adjusted or analysis methods and code are updated. Nested subfolders in an analysis folder should specify the tool used to produce an file and an iteration number should be assigned to the folder if it is likely to be an analysis that is repeated with different settings or updated code. A README file (or better yet, a README for each analysis tool and subfolder cluster) that describes the inputs, outputs, code versions, and relevant notes should also be placed in the ANALYSIS subfolder. It is recommended that a README file should use a simple plain text format (i.e. txt, md) Within the analysis subfolders, an additional subfolder level for code and settings files should be used as appropriate to track the versions and settings used to produce the outputs.
Example (neonate autoresuscitation study)
- PM218_PJ_WildtypeRigRuns
- PM218_PJ_WildtypeRigRuns_DATA
- rigfiles_README.md
- (This readme markdown file describes that the rigfiles subfolder contains raw data generated from neonate autoresuscitation experiments produced using automated assay rig instruments)
- rigfiles
- PM218_R1234_Ply889.txt
- ...
- rigfiles_README.md
- PM218_PJ_WildtypeRigRuns_ANALYSIS
- PKLGZIP_README.md
- (This readme markdown file describes that the PKLGZIP cluster of subfolders contains pkl.gzip format files which were generated from the raw data located in ../PM218_PJ_WildtypeRigRuns_DATA/rigfiles. It also indicates that PCC_extractor.py version 1.7 was used to generate that output. Including the date that the files were produced is also a good idea, especially if the project involves analysis of data in the middle of the data collection phase of the study)
- PKLGZIP_01
- code (contains the PCC_extractor.py code)
- PM218_R1234_Ply889.pkl.gzip
- ...
- SASSI_README.md
- (This readme markdown file describes that the SASSI cluster of subfolders contain outputs of the SASSI script, SASSI_01 contains the output produced using SASSI version 88 using the files in PKLGZIP_01 as input, similar information for the other SASSI subfolders is provided)
- SASSI_01
- code (contains the Breathe_Easy code which include SASSI used for the SASSI_01 output)
- settings (contains the settings used for the SASSI run)
- PM218_R1234_Ply889_all_breathlist.csv
- PM218_R1234_Ply889_beatlist.csv
- ...
- SASSI_02
- SASSI_03
- SHAM_README.md
- (This readme markdown file describes that the SHAM cluster of subfolders contain outputs of the SHAM script, SHAM_01 contains the output produced using SHAM version 36 using the files in PKLGZIP_01 as input, similar information for the other SASSI subfolders is provided)
- SHAM_01
- code (contains the Breathe_Easy code which includes SHAM used for the SHHAM_01 output)
- settings (contains the settings used for the SHAM run)
- SHAM_02
- SHAM_03
- SHAM_04
- SavR_README.md
- (This readme markdown file describes that the SavR cluster of subfolders contain outputs of the SavR script, SavR_01 contains the output produced using SavR version 42 using the files in SHAM_03 and SHAM_04 as inputs - with clarification in case of files with similar names in both of those folders i.e. RUID's R2234, R2567, ... were used from SHAM_04 due to the need for those settings to successfully process those files)
- SavR_01
- code
- PM218_summary.pdf
- PKLGZIP_README.md
- PM218_PJ_WildtypeRigRuns_DATA