class: center, middle, inverse, title-slide .title[ # Programming Tools in Data Science ] .subtitle[ ## Lecture #3: GitHub ] .author[ ### Samuel Orso ] .date[ ### 26 September 2024 ] --- # GitHub <img src="images/github.png" width="950" height="450" style="display: block; margin: auto;" /> --- # Motivation * When working on a project, there are usually different people working on the same file/folder * You want to avoid sending each modification by email * You could use dropbox/google drive and the likes but it is good practice to keep track of modifications and have a platform to plan and discuss changes --- # Motivation GitHub allows you: - record the entire history of a file; - revert to a specific version of the file; - collaborate on the same platform with other people; - make changes without modifying the main file and add them once you feel comfortable with them. --- # Motivation GitHub will be used for: - work in group on projects and homeworks; - submit projects/homeworks; - develop R packages and website; - ... --- # Ready ? <center> <iframe src="https://giphy.com/embed/h4TdHo3RExSbHd9bOe" width="480" height="425" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/cbc-schitts-creek-h4TdHo3RExSbHd9bOe">via GIPHY</a></p> </center> --- # In fact, what is Git? <img src="images/git.png" style="width:150px; position:absolute; top:9%; left:40%" /> Git is a **distributed version control system**. * **distributed**: whenever you instruct Git to share files, Git does not only share the latest file version, but instead, it distributes **every version** it has recorded for that project. * **version control system**: many people are used to have *their own version control system* e.g. by having different versions of the same file (`file_v1.R`, `file_v2.R`, ...). This approach is error-prone and ineffective when working in team project. Thus, a version control system keeps track of changes to modification in your project. --- # Types of VCS There are three types of version control system (VCS): * local * centralized * distributed --- # Types of VCS ## Local .pull-left[ <img src="images/local_vcs.jpg" width="451" height="300" style="display: block; margin: auto;" /> ] .pull-right[ * One of the simplest and most commonly used VCS * It keeps patch sets (modification of a file) locally (on your computer) * It can recreate the file at any point in time by adding up the patches ] --- # Types of VCS ## Centralized .pull-left[ <img src="images/centralized_vcs.png" width="490" height="317" style="display: block; margin: auto;" /> ] .pull-right[ * A single server contains all the versioned files * Risk of failure * Risk of database corruption ] --- # Types of VCS ## Distributed .pull-left[ <img src="images/dst_vcs.jpg" width="460" height="416" style="display: block; margin: auto;" /> ] .pull-right[ * Store the entire history of files locally * Sync local changes back to server * Allow multiple users and minimize risks of centralized VCS ] --- # Benefits of VCS * Allow multiple users to collaborate and communicate while working on a project. * Keep tracks of the change history of the files (risk mitigation) with possibility to roll back to previous version. * Different workflows such as branching and merging (not discussed) <img src="images/branching.jpeg" width="410" height="250" style="display: block; margin: auto;" /> --- # So Git and GitHub are the same things? <center> <iframe src="https://giphy.com/embed/3o6YglDndxKdCNw7q8" width="480" height="478" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/nba-basketball-chicago-bulls-3o6YglDndxKdCNw7q8">via GIPHY</a></p> </center> --- # Git vs GitHub Git is a distributed VCS, so what is GitHub exactly? * Git is a software... * ...and GitHub is web-based plateform for software development and version control that uses Git. * GitHub hosts and shares Git repository. * GitHub is not the only service provider --- #BitBucket <img src="images/bitbucket.png" width="2525" style="display: block; margin: auto;" /> --- #GitLab <img src="images/gitlab.png" width="2476" style="display: block; margin: auto;" /> --- #SourceForge <img src="images/sourceforge.png" width="2511" style="display: block; margin: auto;" /> --- # Okay to continue ? .center[ <iframe src="https://giphy.com/embed/sG4PBWRjI4GSVCDXEq" width="480" height="480" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/nickelodeon-drama-club-sG4PBWRjI4GSVCDXEq">via GIPHY</a></p> ] --- # Files states in Git A file can have different states: **untracked**, **modified**, **staged** or **committed** * **untracked**: a new file that is not tracked by Git (yet); * **modified**: a tracked file which is modified but not recorded (not committed yet); * **staged**: a tracked file which is modified and that has been selected to be saved (committed) into the repository during the next commit snapshot; * **committed**: a file that is successfully recorded into the (local) repository. --- # Files states in Git <img src="images/git-basic-workflow-codesweetly.png" width="2560" style="display: block; margin: auto;" /> --- # you can also `.gitignore` * Some files or folders of your project can be excluded from version control by specifying `.gitignore` * These files or folders will not be shared to other users <img src="images/gitignore.png" width="523" style="display: block; margin: auto;" /> --- # GitHub ## Basic workflow The basic workflow is as follows... 1. Open the RStudio Project connected to your Git(Hub) Repo 2. Work on your computer just like always 3. **Save** your work often just like always 4. When you want to preserve a **snapshot** of your project, you make a "commit" 5. When you have a few commits and want to archive them, you "push" them to the GitHub remote server 6. If you decide to work from a different computer, or want to pick up where a collaborator left off, you can "pull" the most up-to-date version of the files from the GitHub remote to your local computer and go back to step 2. --- class: sydney-blue, center, middle # Demo on RStudio --- # GitHub ## New habits * When you want to preserve a **snapshot** of your project, you make a "commit." * When you have a few commits and want to archive them, you "push" them to the GitHub remote server. * If you decide to work from a different computer, or want to pick up where a collaborator left off, you can "pull" the most up-to-date version of the files from the GitHub remote to your local computer. --- # GitHub ## Commits Make your commit message as informative and concise as possible. <img src="images/git_commit.png" width="439" height="250" style="display: block; margin: auto;" /> --- # GitHub ## "pull" before you "push" Make sure you have the up-to-date version of your project before working on it. Try to avoid the headaches of "merge conflict". .center[ <iframe src="https://giphy.com/embed/cFkiFMDg3iFoI" width="480" height="269" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/git-merge-cFkiFMDg3iFoI">via GIPHY</a></p> ] --- # GitHub ## Common mistakes (and how to solve them) * **Commits in the wrong Repo**. Nothing seems to work? It's a common mistake. Solution: make sure you work on the correct RStudio project that is correctly linked to GitHub. * **Large files error**. GitHub blocks pushes that exceed 100 MB. Solution: find another solution for large files (Dropbox, ...) * **Conflict (not merge)**. Conflicts may happen when two collaborators make different changes to part of a program at the same time but on different lines of code. One of them push the modification to the remote. The second one to push will have a conflict as his/her version of the project is "outdated". Solution: `git pull --rebase` * **Merge conflict**. It happens when two collaborators work on the same lines of code at the same time. It is often a problem of miscommunication within groups and lack of organization. Solution: To resolve these conflicts, we must directly edit the documents making sure potential conflicts are discussed before pushing. --- # GitHub > git gets easier once you get the basic idea that branches are homeomorphic endofunctors mapping submanifolds of a Hilbert space. > <cite> Isaac Wolkerstorfer </cite> --- class: sydney-blue, center, middle # Question ? .pull-down[ <a href="https://ptds.samorso.ch/"> .white[<svg viewBox="0 0 384 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M369.9 97.9L286 14C277 5 264.8-.1 252.1-.1H48C21.5 0 0 21.5 0 48v416c0 26.5 21.5 48 48 48h288c26.5 0 48-21.5 48-48V131.9c0-12.7-5.1-25-14.1-34zM332.1 128H256V51.9l76.1 76.1zM48 464V48h160v104c0 13.3 10.7 24 24 24h104v288H48z"></path></svg> website] </a> <a href="https://github.com/ptds2024/"> .white[<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> GitHub] </a> ] <!-- --- --> <!-- # Exercises --> <!-- 1. Create a GitHub repo for the RMarkdown file (.Rmd) you created in the last class. --> <!-- 1. Edit the README.md file, push the .Rmd. --> <!-- 1. By two. Invite (person A) someone else (person B) to work on your repo and try: --> <!-- - Repo is up-to-date. Person B modifies .Rmd and pushes the changes, person A pulls the changes. --> <!-- - Repo is up-to-date. Person A modifies 1st section of .Rmd, person B modifies 2nd section (no conflict) of .Rmd. No push, no pull in between. Now person A commits and pushes. Then person B tries to commit and push. Try to solve until repo is up-to-date. --> <!-- - Same as last point, but person B modifies 1st section of .Rmd (conflict). --> <!-- 1. (optional) Complete the exercise "The Basics of Github". -->