class: center, middle, inverse, title-slide .title[ # Programming Tools in Data Science ] .subtitle[ ## Lecture #1: Introduction ] .author[ ### Samuel Orso ] .date[ ### 26 September 2024 ] --- # Motivation * "Data Science" hits >900 of job offers on jobup.ch. * More than 3500 job offers for Switzerland on LinkedIn. * "_Job applicants with computer skills are highly sought-after due to the increase of technology in the workplace._", [Indeed](https://www.indeed.com/career-advice/resumes-cover-letters/computer-skills), August 2023. <img width="49%" src="images/linkedin_swiss_1.png"/> <img width="49%" src="images/linkedin_swiss_2.png"/> --- # Motivation * Computer skills help in solving problems. <center><iframe width="640" height="480" src="https://www.youtube.com/embed/Tzin1DgexlE"> </iframe></center> --- # General goals * Introduce tools and workflows for reproducible research (R/RStudio, Git/GitHub, etc.); * Introduce principles of tidy data and tools for data wrangling; * Exploit data structures to appropriately manage data, computer memory and computations; * Data manipulation through controls, instructions, and tailored functions; * Develop new software tools including functions, Shiny applications, and packages; * Manage the software development process including version control, documentation (with embed code), and dissemination for other users. --- # General goals <img src="images/diagram.png" width="593" height="459" style="display: block; margin: auto;" /> --- # Why Choose R for Data Science and Business Analytics? ## 1. **Tailor-Made for Data Science** - **R was built with data in mind**: Designed specifically for statistical computing and data analysis. - **Comprehensive packages**: Tools like `ggplot2`, `dplyr`, and `tidyverse` make data manipulation and visualization a breeze. - **Cutting-edge techniques**: Stay ahead with powerful tools for machine learning, statistical modeling, and data mining. --- ## 2. **Ideal for Business Analytics** - **Advanced Statistical Methods**: R has a wide range of libraries to handle complex business data. - **Visualization Powerhouse**: Create stunning, customized visualizations that help tell compelling business stories. - **Report Generation**: With `RMarkdown` and `Shiny`, you can create automated reports and interactive dashboards for decision-making. --- ## 3. **Supported by a Strong Community** - **Open-source and free**: R is continually evolving with contributions from data scientists worldwide. - **Active Community**: Get access to countless tutorials, forums, and resources to grow your skills. --- ## 4. **Integration Capabilities** - Seamlessly integrate R with other languages and tools (e.g., Python, SQL, Excel, Tableau). - R fits well into a larger business workflow, making it perfect for **end-to-end analytics solutions**. --- # Brief histo`R`y * `R` is a language and environment for statistical computing and graphics; * It was developed around 1995 by Ross Ihaka and Robert Gentleman at the University of Auckland, as an alternative implementation of the basic `S` language developed by John Chambers and colleagues; * Oldest release (available) is version 0.49 (1997-04-23) <img src="images/extending_r.png" width="267" style="display: block; margin: auto;" /> --- # Main featu`R`es * Much code written for `S` runs unaltered under `R`; * `R` provides a variety of statistical and graphical techniques; * `R` is popular for data science (in competition with Python); * `R` is open-source; * `R` is an interpreted language; * `R` is highly extensible via packages (more than 20,000 available on CRAN); * `R` can be interfaced with other languages (C++, Fortran, ...). --- # `R` is an interpreted language <img src="images/cartoon_programming.png" width="551" style="display: block; margin: auto;" /> --- # Compiled program - Program is translated into native machine instruction (compilation) <img src="images/compiled.png" width="750" height="111" style="display: block; margin: auto;" /> <img src="images/compiled_pl.png" width="750" height="117" style="display: block; margin: auto;" /> --- # Interpreted program - Program is translated into another code (bytecode). An interpreter then performs the required actions. <img src="images/interpreted.png" width="750" height="111" style="display: block; margin: auto;" /> <img src="images/interpreted_pl.png" width="750" height="117" style="display: block; margin: auto;" /> --- # `R` interfaces to other languages `R` is basically written in `C` and `Fortran`. Available interfaces to other languages comprise: - `C` via `.Call()` function - `Fortran` via `.Fortran()` function - `C++` via the `Rcpp` package - `Python` via `reticulate`, `rPython`, `rJython` or `XRPython` - `Julia` via `XRJulia` - `JavaScript` via `V8` - `Excel`, `JSON`, `SQL`, `Perl`, ... See Chambers, (2017) for a comprehensive discussion on interfacing `R`. --- # What are LLMs? - **Definition**: Large Language Models (LLMs) are AI models trained on vast amounts of text data to understand and generate human language. - **Examples**: - GPT-3, GPT-4 (OpenAI) - Gemini (Google) - LLaMA (Meta) - Claude (Anthropic) - **Capabilities**: - Natural language understanding and generation - Text completion, summarization, translation - Assistance in various domains, including **programming**. --- **Why are they important for programming?** - LLMs can understand code as a special type of language. - They offer assistance in code generation, debugging, and improving programming productivity. --- # LLM for Programming **Key Features**: 1. **Code Suggestions**: - Automates repetitive coding tasks. - Helps in writing boilerplate code. - See for instance GitHub's copilot. 2. **Error Debugging**: - Identifies and resolves bugs in code snippets. - Suggests alternative solutions or optimizations. 3. **Code Explanation**: - Breaks down complex code into simple explanations. - Helps in learning new programming concepts. --- # Benefits of Using LLMs like ChatGPT in Programming ### **1. Increased Productivity** - Automates repetitive and boilerplate tasks. - Helps explore new coding approaches faster. ### **2. Learning and Discovery** - Explains code, libraries, and new languages in an intuitive manner. - Great for beginners and advanced users alike. --- # Challenges and Considerations ### **1. Not Always Correct** - LLMs can suggest incorrect code, requiring human oversight. ### **2. Context Limitations** - LLMs lack the full project context, so they might not understand the specific requirements. ### **3. Ethical Concerns** - Intellectual property, security, and data privacy must be considered when using AI for programming. --- class: sydney-blue, center, middle # Course logistic and expectation --- # Course logistic and expectation ## Location and time .pull-left[ .scroll-box-5[
]] .pull-right[ .scroll-box-5[ * Anthropole 3032 * Every Thursday morning from 9 to 12. Either class or practical. * Verify the schedule on the course website. ]] --- ## Ideal schedule (every Thursday 9 to 12) | Week | Date | Topic | Instructor | |---|---|---|---| | 2 | 26 Sept | [Introduction](https://ptds2024.github.io/class/lecture01), [RMarkdown](https://ptds2024.github.io/class/lecture02_markdown), [Github](https://ptds2024.github.io/class/lecture03_github), [Project](https://ptds2024.github.io/class/lecture13_project)| Samuel | | 3 | 3 Oct | [Data structures](https://ptds2024.github.io/class/lecture04_datastructure), [Control structures](https://ptds2024.github.io/class/lecture05_controlstructure), [Function](https://ptds2024.github.io/class/lecture06_function) | Samuel | | 4 | 10 Oct | Exercise and Homework 1 | Timofei | | 5 | 17 Oct | [Object-oriented programming](https://ptds2024.github.io/class/lecture07_OOP), [Webscraping](https://ptds2024.github.io/class/lecture08_webscrap), [Shiny App](https://ptds2024.github.io/class/lecture09_shiny) | Samuel | | 6 | 24 Oct | Exercise and Homework 2 | Timofei | | 7 | 31 Oct | [Functional programming](https://ptds2024.github.io/class/lecture10_functional), [Shiny App](https://ptds2024.github.io/class/lecture09_shiny), [Package creation](https://ptds2024.github.io/class/lecture11_pkg) | Samuel | | 8 | 7 Nov | Exercise and Homework 3 | Timofei | | 9 | 14 Nov | Data science with R on Google Cloud | Samuel | | 9 | 21 Nov | TBA | Samuel | | 11 | 28 Nov | Group Project | Timofei | | 12 | 5 Dec | Group project | Timofei | | 13 | 12 Dec | Group project | Timofei | | 14 | 19 Dec | Project Presentations | Samuel and Timofei| --- # Course logistic and expectation ## Requirements * No IT background is assumed from the students but a strong will to learn useful and practical programming skills (Data Science in Business Analytics) * Willing to work and collaborate in groups (4~6 groups) * Be ready to struggle with your computer! <center><img src="https://media.giphy.com/media/bPCwGUF2sKjyE/giphy.gif" alt="gif"/></center> --- ## Grading * Learning outcomes will be assessed based on the performances within each of the following categories: Type | Points :-- | :-- Semester project | 30 Homeworks | 30 * 3 homeworks (individual) of 10 points (**penalty for late submission**). * 1 group project of 30 points. * No final examination for this class. * Final presentation of project last day of class (19 Dec). --- # Course logistic and expectation ## Project The group project comprises: - **Presentation** - **Screencast** - **Shiny app** - **R package** - **GitHub repository** - **Website** --- # Course logistic and expectation ## Communication * We use <img src="images/slack.png" width="200px"/> to communicate and many more * We use the **NEIN rule**! (No Email, only If Necessary) * More info at [https://ptds.samorso.ch/](https://ptds.samorso.ch/) * To access slack, you need to register and wait for your invitation. --- ## Take 3 minutes to complete the form <iframe src="https://docs.google.com/forms/d/e/1FAIpQLScQEYxeMdRYxHnvFHbRcJhtSRZeviKehI0vKjDO0WjhxEuW1Q/viewform" width="100%" height="400px" data-external="1"></iframe> --- class: sydney-blue, center, middle # Question ? .pull-down[ <a href="https://ptds.samorso.ch/"> .white[<svg viewBox="0 0 384 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M369.9 97.9L286 14C277 5 264.8-.1 252.1-.1H48C21.5 0 0 21.5 0 48v416c0 26.5 21.5 48 48 48h288c26.5 0 48-21.5 48-48V131.9c0-12.7-5.1-25-14.1-34zM332.1 128H256V51.9l76.1 76.1zM48 464V48h160v104c0 13.3 10.7 24 24 24h104v288H48z"></path></svg> website] </a> <a href="https://github.com/ptds2024/"> .white[<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> GitHub] </a> ] --- # Everything's done? Follow your first tutorial (10 minutes) ## Make sure you have `R` and `RStudio` installed and follow the ## "R/RStudio installation and setup" tutorial