A Collaboration Cycle (Part I: Set-Up)

Michael Paris
5 min readFeb 28, 2022

Google Colab is an (almost) excellent tool for collaborating and using great and affordable resources. It is by far the best tool to get started and provides a nice GPU, however it lacks the features of writing code/text at the same time, as it is possible in Google Docs and Sheets.

As you collaborate, code amasses and bugs start to appear. In this stage the code becomes messy. Though it is technically possible to debug inline and group code into sections, it is not an efficient way to progress. Resolving these issue becomes tedious and mentally straining.

“What would be a good process of fixing these bugs, without loosing the leverage of the google resources (Colab/Drive)?”

And what becomes of the initial collaboration?

Most of the information here is not new and you may consider the articles referenced at the bottom as helpful. Therefore, to spice things up, this process will incorporate Google Drive as a mean to warehouse a medium sized data volume (10–100GB) and focus on collaboration, data and the deploy key usage (i.e Part I).

Prerequisites

For this set-up to work, the following 3 components are necessary:

  • Git account
  • Google Drive and Colab
  • IDE of your choosing (here PyCharm )

Set up

Git Repository

Create a new repository <project_name>_code in your git. You can do so by checking out this post by Alex Chin or the Github instructions.

Once this is done, create a directory <project_name> locally where you keep your projects and clone the repository <project_name>_code in there. Add adata and <dataset_name>_sample directory and __init__.py files for calling the modules, as these will be need shortly.

mkdir <project_name>
cd <project_name>
git clone git@github.com:<your_username>/<project_name>_code
mkdir -p data/<dataset_name>_sampletouch __init__.py
touch <project_name>_code/__init__.py

To finalize the Git set-up, a deploy key needs to be set for this repository. From inside the <project_name> directory create a private and public key pair and head over to the repository <project_name>_code on Github. Attention: The user information some_user@<project_name>.com used during key creation is the accessing user the collaboration will use to clone the <project_name>_code repository.

ssh-keygen -t rsa -b 4096 -C some_user@<project_name>.com
Generating public/private rsa key pair.
Enter file in which to save the key (/home/user/.ssh/id_rsa): <project_name>_deploy_key

In the repository Settings find the Security tab, open the Deploy keys tab and add the content of <project_name>_deploy_key.pub as new deploy key. The local project structure now should look something like this.

<project_name>
├── data
│ └── <datset_name>_sample
├── <project_name>_code
│ ├── __init__.py
│ └── README.md
├── __init__.py
├── <project_name>_deploy_key
└── <project_name>_deploy_key.pub

Google Colab & Drive

The exact same directory structure has to be created in Google Drive and preferably in the home directory, without <project_name>_code. Share the entire project folder with your collaborators and make sure that the path to this project folder is the same (usually /content/drive/MyDrive/<project_name>) for all collaborators. As this is the cloud and storage is less of a constraint, we create and additional folder <datset_name> (where we store the medium sized data) under data and are ready to start collaborating in a new notebook <project_name>.ipynb.

<project_name>
├── data
│ ├── <datset_name>
│ └── <datset_name>_sample
├── __init__.py
└── <project_name>.ipynb

Now, we need to mount the Drive and clone the git repository. Adjust and use the following two functions for that clone_git_repo() and append_module_path().

Mount Drive:

from google.colab import drive
drive.mount('/content/drive/')

Verify with key and clone git:

# Load fresh repo via deploy keyimport os
from google.colab import files
import subprocess
project_name = <project_name>
project_dir = '/content/drive/MyDrive/<project_name>/' # This has to be the same path for all collabobrators
def clone_git_repo(project_dir, project_name, rm_ssh=True):
### Remove last state of repo
os.chdir(project_dir)
if os.path.exists(project_name + '_code'):
print("Path exists:", os.path.exists(project_name + '_code'))
process = subprocess.run('rm -rf ' + project_name + '_code' , shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)
print("Path exists:", os.path.exists(project_name + '_code'))
print(process.stdout)
process = subprocess.run('ls -la', shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)
print(process.stdout)
### Upload deployment private_key
if not os.path.exists("/root/.ssh/"):
deploy_key = files.upload()
### SSH
process = subprocess.run('mkdir ~/.ssh', shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)
print(process)
process = subprocess.run('mv ' + project_name + '_deploy_key ~/.ssh/' + project_name + '_deploy_key', shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)
print(process)
### Known_hosts
process = subprocess.run('touch /root/.ssh/known_hosts', shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)
print(process)
process = subprocess.run('ssh-keyscan github.com >> /root/.ssh/known_hosts', shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)
print(process)
process = subprocess.run('chmod 644 /root/.ssh/known_hosts', shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)
print(process)
# hardcode the user and git information in the following 3 lines### Git access and clone repo
process = subprocess.run('git config --global user.name "<some_user>"', shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)
print(process)
process = subprocess.run('git config --global user.email "<some_user@<project_name>.com>"', shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True) print(process)
!ssh-agent bash -c 'ssh-add ~/.ssh/<project_name>_deploy_key; git clone git@github.com:<user>/<project_name>_code.git'
### Remove everything connected to the private deploy key
if rm_ssh:
process = subprocess.run('rm -rf ~/.ssh', shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)
print(process.stdout)
try:
os.listdir("~/.ssh")
except:
print("~/.ssh is gone.")
# Append project path for module visibility
def append_module_path(project_dir):
if not project_dir in sys.path:
sys.path.append(project_dir)
print(sys.path[-1])
clone_git_repo(project_dir, project_name, rm_ssh=False)
append_module_path(project_dir)

Finally, the project folder in Google Drive should now contain the cloned <project_name>_code and the Google Drive content should resemble this structure.

<project_name>
├── data
│ ├── <datset_name>
│ └── <datset_name>_sample
├── __init__.py
├── <project_name>_code
└── <project_name>.ipynb

It is now be possible to import the module with import <project_name>_code.

IDE (i.e PyCharm)

To really streamline the process of collaboration, the IDE should be configured such that newly created modules are visible from the project_name , meaning that a path needs to be added to the interpreter paths. A visual description is given here. It should now be possible to run a given project within the IDE.

With this set-up, the practicality becomes clear. Code is developed and run over a small sample data sets, after which it is pushed to git and cloned on Google Colab into the Google Drive. Now the entirety of the data set can be processed. If you are interested in managing your data as described above there will be a follow-up post. As new aspects about the data are revealed during pair-programming (Part 2: Collaborative Data Utilization and Investigation), (i.e. shared screen on Google Meet), it influences the development stage and new features can be added locally and tested on the sample data sets. Full Circle.

[How to use Google Colab with GitHub via Google Drive]

[Using Github Deploy Key]

--

--

Michael Paris
0 Followers

Physics, Language, Finance and Models