# Data Processing Tools hints and tips
This is a short collection of hints and tips for tools that I use frequently for data processing.
## DVC Two main uses in data processing pipelines 1. Defining pipelines in code and executing them efficiently. 2. Remote dependencies with `dvc import` and `dvc import-url`. --- Stages are defined in a `dvc.yaml` file and executed by `dvc repro`. Only the changed stages will run. --- Stages can each use different runtimes ```yaml stage: one: cmd: Rscript process.R two: cmd: python process.py ``` You need to make sure that all dependencies are setup wherever executed. --- ```yaml stages: one: deps: - input_file outs: - stage/one/output.csv two: deps: - stage/one/output.csv ``` Stage `two` runs after stage `one` If `input_file` doesn't change, stage `one` doesn't run If `stage/one/output.csv` doesn't change, stage `two` doesn't run --- You can mark stages as `always_changed` to force them to run. <pre><code data-trim data-line-numbers='4'> stages: one: ... always_changed: true ... </code></pre> This is useful if a stage downloads a file from the web, and you have no way of knowing if it's changed. --- ## Good practices * Small stages are better, as they can reduce unnecessary processing. --- ## Notes * Make sure up to date. Version 3 is current. Raises error in `dvc.lock` file format
## R * On Windows, may need to add location of `Rscript.exe` to Windows path
## GitHub Actions * Install gh cli to run actions locally * Also need [gh-act extension](https://nektosact.com/) * ...which also needs docker
## Docker
## Visual Studio Code * Drag title bars to rearrange windows as tiled. * You can register plugins as recommended to a workspace from the plugin page. * Source Code Management * Select lines to stage only parts of files * Click the little icon to turn it into a tree view!