Полезные практики для дата саентистов

Data are cheap, time is expensive

  1. Data management
    1. Save the raw data.
    2. Ensure that raw data are backedup in more than one location.
    3. Create the data set you wish you had received.
      1. File formats: Convert data from closed, proprietary formats to open, nonproprietary formats that ensure machine readability across time and computing setups. Good options include CSV for tabular data, JSON, YAML, or XML for nontabular data such as graphs (the node- and-arc kind), and HDF5 for certain kinds of structured data.
      2. Variable names: Replace inscrutable variable names and artificial data codes with self-explaining alternatives
      3. File names: Store especially useful metadata as part of the file name itself, while keeping the file name regular enough for easy pattern matching. For example, a file name like 2016- 05-alaska-b.csvmakes it easy for both people and programs to select by year or by location.
    4. Create tidy data (we will talk about that concept latter).
    5. Record all the steps used to process data.
    6. Anticipate the need to use multiple tables, and use a unique identifier for every record (or better use SQL data bases)
    7. Submit data to a reputable DOI-issuing repository so that others can access and cite it (Figshare, Dryad, and Zenodo).
    8. Prvode meta-data, atleast README-file.
  2. Software
    1. Follow style-guides (PEP8 in case of Python)
    2. Write comments to your program.
    3. Decompose programs into clean functions.
    4. Reuse the functions to eliminate duplication.
    5. Give functions and variables meaningful names
    6. Make dependencies and requirements explicit.
    pip freeze > requirements.txt
    pip install -r requirements.txt
    
    1. Do not comment and uncomment sections of code to control a program’s behavior.
    2. Provide a simple example or test data set.
    3. Submit code to a reputable DOI-issuing repository (Figshare, Zenodo)
  3. Collaboration
    1. Create an overview of your project (README)
    2. Create a shared "to-do" list.
    3. Decide on communication strategies.
    4. Make the license explicit (Creative Commons licenses for data and text, either CC-0 (the "No Rights Reserved" license) or CC-BY (the "Attribution" license, which permits sharing and reuse but requires people to give appropriate credit to the creators). For software, we recommend a permissive open source license such as the MIT, BSD, or Apache license).
  4. Project organization.
    1. Put each project in its own directory, which is named after the project.
    2. Put text documents associated with the project in the doc directory.
    3. Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory.
    4. Put project source code in the src directory.
    5. Put compiled programs in the bin directory
  5. Keeping track of changes
    1. Back up (almost) everything created by a human being as soon as it is created.
    2. Keep changes small.
    3. Share changes frequently.
    4. Store each project in a folder that is mirrored off the researcher’s working machine.
    5. Use a version control system.
  6. Manuscripts.
    1. Write manuscripts using online tools with rich formatting, change tracking, and reference management, such as Google Docs.
    2. Write the manuscript in a plain text format that permits version control, such as LaTeX or Markdown, and then convert them to other formats, such as PDF, as needed using scriptable tools like Pandoc.
    3. Use bibliography managers, as Zotero.

https://drivendata.github.io/cookiecutter-data-science/ https://medium.com/outlier-bio-blog/a-quick-guide-to-organizing-data-science-projects-updated-for-2016-4cbb1e6dac71

Комментарии