Avoiding data leaks on github through jupyter notebooks
Published:
A. Preventing a data leakage
Clear all notebooks automatically on commit
Rationale: run a filter over certain files before they are added to git. This will leave the original file on disk as-is, but commit the “cleaned” version.
1- Create a .gitattributes file in your repo
.gitattributes:
*.ipynb filter=jupyternotebook
2- Create a .gitconfig file in your repo
.gitconfig:
[filter "jupyternotebook"]
clean = jupyter nbconvert --to=notebook --ClearOutputPreprocessor.enabled=True --stdout %f
required
smudge = cat
3- Add custom .gitconfig to local git config: git config --local include.path ../.gitconfig
- N.B.: this step has to be repeated every time the repo is cloned.
4- Verify that custom config was added to local git config
–> This hook should be run every time a file is added (git add
)
.git/config:
...
[include]
path = ../.gitconfig
Use github actions to prevent committing executed notebooks
Rationale: Prevent accidental commits of executed notebooks by checking every push with github actions.
- Create a .github/workflows/main.yml file in your repo
.github/workflows/main.yml:
name: GitHub Pre-Push actions
on: [push]
jobs:
ensure_clean_jupyter_notebooks:
runs-on: ubuntu-latest
steps:
- uses: ResearchSoftwareActions/[email protected]
with:
disable-checks: execution_count
This action is provided by ResearchSoftwareActions.
B. In case of a data leak already on github
Clear all notebooks
Whilst in the repo:
jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace */*/*/*/*/*.ipynb
(or nbconvert > 6.0: jupyter nbconvert --clear-output --inplace */*/*/*/*/*.ipynb
)
where one should start from *.ipynb
to as deep as your repo structure goes (*/*/*/*/*/*
) in this case
Clean all jupyter notebooks already on github
Requirements: BFG (can be installed via brew)
Procedure as explained in this blog post.
- Remove your old local repo folder (keep it temporarily save):
mv example-repo example-repo-old
- Clone in mirror repo:
git clone --mirror [email protected]:example/example-repo.git
- Target all jupyter notebooks of repo:
bfg --delete-files "{*.ipynb}" example-repo.git/
- Rewrite history:
cd example-repo && git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push
- Delete mirror repo (as well as your temporary copy of the old version):
cd ../ && rm -rf example-repo.git
- Re-download the clean version of the repo:
git clone [email protected]:example/example-repo.git
- Repeat step 3 and 4 of part A. (reinstaure custom filter after cloning)
This essentially removes all history of these files.