STAT 39000: Project 6 — Fall 2021

Motivation: Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are typically making a tradeoff between developer speed (the time in which it takes to write a functioning program or scripts) and program speed (how fast your code runs). This is often the best choice depending on your staff and how much your software developers or data scientists earn. However, Python code does not have the advantage of being able to be compiled to machine code for a certain architecture (x86_64, ARM, Darwin, etc.), and easily shared. In Python you need to learn how to use virtual environments (and git) to share your code.

Context: This is the first in a series of 3 projects that explores how to setup and use virtual environments, as well as some git basics. This series is not intended to teach you everything you need to know, but rather to give you some exposure so the terminology and general ideas are not foreign to you.

Scope: Python, virtual environments, git

Learning Objectives

Explain what a virtual environment is and why it is important.
Create, update, and use a virtual environment to run somebody else’s Python code.
Use git to create a repository and commit changes to it.
Understand and utilize common git commands and workflows.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

While this project may look like it is a lot of work, it is probably one of the easiest projects you will get this semester. The question text is long, but it is mostly just instructional content and directions. If you just carefully read through it, it will probably take you well under 1 hour to complete!

Question 1

Once complete, type your GitHub username into a markdown cell.

Items to submit

Your GitHub username in a markdown cell.

Question 2

We’ve created a repository for this project at github.com/TheDataMine/f2021-stat39000-project6. You’ll quickly see that the code will be ultra familiar to you. The goal of this question, is to clone the repository to your $HOME directory. Some of you may already be rushing off to your Jupyter Notebook to run the following.

%%bash

git clone https://github.com/TheDataMine/f2021-stat39000-project6

Don’t! Instead, we are going to take the time to setup authentication with GitHub using SSH keys. Don’t worry, it’s way easier than it sounds!

P.S. As usual, you should have a notebook called firstname-lastname-project06.ipynb (or something similar) in your $HOME directory, and you should be using bash cells to run and track your bash code.

The first step is to create a new SSH key pair on Brown, in your $HOME directory. To do that, simply run the following in a bash cell.

If you know what an SSH key pair is, and already have one setup on Brown, you can skip this step.

%%bash

ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519 -C "lastname_brown_key"

When prompted for a passphrase, just press enter twice without entering a passphrase. If it doesn’t prompt you, it probably already generated your keys! Congratulations! You have your new key pair!

So, what is a key pair, and what does it look like? A key pair is two files on your computer (or in this case, Brown). These files live inside the following directory ~/.ssh. Take a look by running the following in a bash cell.

ls -la ~/.ssh

Output

...
id_ed25519
id_ed25519.pub
...

The first file, id_ed25519 is your private key. It is critical that you do not share this key with anybody, ever. Anybody in possession of this key can login to any system with an associated public key, as you. As such, on a shared system (with lots of users, like Brown), it is critical to assign the correct permissions to this file. Run the following in a bash cell.

chmod 600 ~/.ssh/id_ed25519

This will ensure that you, as the owner of the file, have the ability to both read and write to this file. At the same time, this prevents any other user from being able to read, write, or execute this file (with the exception of a superuser). It is also important get the permissions of files within ~/.ssh correct, as openssh will not work properly otherwise (for safety).

Great! The other file, id_ed25519.pub is your public key. This is the key that is shareable, and that allows a third party to verify that "the user trying to access resource X has the associated private key." First, lets set the correct permissions by running the following in a bash cell.

chmod 644 ~/.ssh/id_ed25519.pub

This will ensure that you, as the owner of the file, have the ability to both read and write to this file. At the same time, everybody else on the system will have read and execute permissions.

Last, but not least run the following to correctly set the permission of the ~/.ssh directory.

%%bash

chmod 700 ~/.ssh

Now, take a look at the contents of your public key by running the following in a bash cell.

%%bash

cat ~/.ssh/id_ed25519.pub

Not a whole lot to it, right? Great. Copy this file to your clipboard. Now, navigate and login to github.com if you haven’t already. Click on your profile in the upper-right-hand corner of the screen, and then click Settings.

If you haven’t already, this is a fine time to explore the various GitHub settings, set a profile picture, add a bio, etc.

In the left-hand menu, click on SSH and GPG keys.

In the next screen, click on the green button that says New SSH key. Fill in the "Title" field with anything memorable. I like to put a description that tells me where I generated the key (on what computer), for example, "brown.rcac.purdue.edu". That way, I can know if I can delete that key down the road when cleaning things out. In the "Key" field, paste your public key (the output from running the cat command in the previous code block). Finally, click the button that says Add SSH key.

Congratulations! You should now be able to easily authenticate with GitHub from Brown, how cool! To test the connection, run the following in a cell.

!ssh -o "StrictHostKeyChecking no" -T git@github.com

If you use the following — you will get an error, but as long as it says "Hi username! …" at the top, you are good to go!

%%bash

ssh -T git@github.com

If you were successful, it should reply with something like:

Hi username! You've successfully authenticated, but GitHub does not provide shell access.

If it asks you something like "Are you sure you want to continue connecting (yes/no)?", type "yes" and press enter.

Okay, FINALLY, let’s get to the actual task! Clone the repository to your $HOME directory, using SSH rather than HTTPS.

If you navigate to the repository in the browser, click on the green "<> Code" button, you will get a dropdown menu that allows you to select "SSH", which will then present you with the string you can use in combination with the git clone command to clone the repository.

Upon success, you should see a new folder in your $HOME directory, f2021-stat39000-project6.

Items to submit

Code used to solve this problem.
Output from running the code.

Question 3

Take a peek into your freshly cloned repository. You’ll notice a couple of files that you may not recognize. Focus on the pyproject.toml file, and cat it to see the contents.

The pyproject.toml file contains the build system requirements of a given Python project. It can be used with pip or some other package installer to download the exact versions of the exact packages (like pandas, for example) required in order to build and/or run the project!

Typically, when you are working on a project, and you’ve cloned the project, you want to build the exact environment that the developer had set up when developing the project. This way you ensure that you are using the exact same versions of the same packages, so you can expect things to function the same way. This is critical, as the last thing you want to have to deal with is figuring out why your code is not working but the developers or project maintainers is.

There are a variety of popular tools that can be used for dependency management and/or virtual environment management in Python. The most popular are: conda, pipenv, and poetry.

What is a "virtual environment"? In a nutshell, a virtual environment is a Python installation such that the interpreter, libraries, and scripts that are available in the virtual environment are distinct and separate from those in other virtual environments or the system Python installation.

We will dig into this more.

There are pros and cons to each of these tools, and you are free to explore and use what you like. Having used each of these tools exclusively for at least 1 year or more, I have had the fewest issues with poetry.

When I say "issues" here, I mean unresolved bugs with open tickets on the project’s GitHub page. For that reason, we will be using poetry for this project.

Poetry was used to create the pyproject.toml file you see in the repository. Poetry is already installed in Brown. See where by running the following in a bash cell.

which poetry

By default, when creating a virtual environment using poetry, each virtual environment will be saved to $HOME/.cache/pypoetry, while this is not particularly bad, there is a configuration option we can set that will instead store the virtual environment in a projects own directory. This is a nice feature if you are working on a shared compute space as it is explicitly clear where the environment is located, and theoretically, you will have access (as it is a shared space). Let’s set this up. Run the following command.

%%bash

poetry config virtualenvs.in-project true
poetry config cache-dir "$HOME/.cache/pypoetry"
poetry config --list

This will create a config.toml file in $HOME/.config/pypoetry/config.toml that is where your settings are saved.

Finally, let’s setup your own virtual environment to use with your cloned f2021-stat39000-project6 repository. Run the following commands.

module unload python/f2021-s2022-py3.9.6
cd $HOME/f2021-stat39000-project6
poetry install

This may take a minute or two to run.

Normally, you’d be able to skip the module unload part of the command, however, this is required since we are already in a virtual environment (f2021-s2022 kernel). Otherwise, poetry would not install the packages into the correct location.

This should install all of the dependencies and the virtual environment in $HOME/f2021-stat39000-project6/.venv. To check run the following.

ls -la $HOME/f2021-stat39000-project6/

To actually use this virtual environment (rather than our kernel’s Python environment, or the system Python installation), preface python commands with poetry run. For example, let’s say we want to run a script in the package. Instead of running python script.py, we can run poetry run python script.py. Test it out!

For each bash cell when running poetry commands — it is critical the cells begin as follows:

%%bash

module unload python/f2021-s2022-py3.9.6

Otherwise, poetry will not use the correct Python environment. This is a side effect of the way we have our installation, normally, poetry will know to use the correct Python environment for the project.

We have a file called runme.py in the scripts directory ($HOME/f2021-stat39000-project6/scripts/runme.py). This script just quickly uses our package and prints some info — nothing special. Run the script using the virtual environment.

You may need to provide execute permissions to the runme files.

chmod 700 $HOME/f2021-stat39000-project6/scripts/runme.py
chmod 700 $HOME/f2021-stat39000-project6/scripts/runme2.py

%%bash

module unload python/f2021-s2022-py3.9.6
chmod 700 $HOME/f2021-stat39000-project6/scripts/runme.py
chmod 700 $HOME/f2021-stat39000-project6/scripts/runme2.py
cd $HOME/f2021-stat39000-project6
poetry run python scripts/runme.py

The script will print the location of the pandas package as well — if it starts with $HOME/f2021-stat39000-project6/.venv/ then you are correctly running the script using our environment! Otherwise, you are not and need to remember to use poetry.

Items to submit

Code used to solve this problem.
Output from running the code.

Question 4

Now, try to run the following script using our virtual environment: $HOME/f2021-stat39000-project6/scripts/runme2.py. What happens?

Make sure to run the script from the project folder and not from the $HOME directory. poetry looks for a pyproject.toml file in the current directory, and if it doesn’t find it, it will throw an error, but this error will not show you what package is missing. So, to be clear. Don’t do:

%%bash

module unload python/f2021-s2022-py3.9.6
poetry run python $HOME/f2021-stat39000-project6/scripts/runme2.py

But do run:

%%bash

module unload python/f2021-s2022-py3.9.
cd $HOME/f2021-stat39000-project6
poetry run python scripts/runme2.py

It looks like a package wasn’t found, and should be added to our environment (and therefore our pyproject.toml file). Run the following command to install the package to your virtual environment.

module unload python/f2021-s2022-py3.9.6
cd $HOME/f2021-stat39000-project6
poetry add packagename # where packagename is the name of the package/module you want to install (that was found to be missing)

Does the pyproject.toml reflect this change? Now try and run the script again — voila!

Items to submit

Code used to solve this problem.
Output from running the code.

Question 5

Read about at least 1 of the 2 git workflows listed here (if you have to choose 1, I prefer the "GitHub flow" style). Describe in words the process you would use to add a function or method to our repo, step by step, in as much detail as you can. I will start for you, with the "GitHub flow" style.

Add the function or method to the watch_data.py module in $HOME/f2021-stat39000-project6/.
…
Deploy the the branch (this could be a website, or package being used somewhere) for final testing, before merging into the main branch where code should be pristine and able to be immediately deployed at any time and function as intended.
…

The goal of this question is to try as hard as you can to understand at a high level what a work flow like this enables, the steps involved, and think about it from a perspective of working with 100 other data scientists and/or software engineers. Any details, logic, or explanation you want to provide in the steps would be excellent!

You do not need to specify actual git commands if you do not feel comfortable doing so, however, it may come in handy in the next project (hint hint).

Items to submit

Code used to solve this problem.
Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

STAT 39000: Project 6 — Fall 2021

Sharing Python code: Virtual environments & git part I

Questions

Question 1

Question 2

Question 3

Question 4

Question 5