Objectives For Today
- Introductions!
- Set expectations to the bootcamp!
Snapshot Instructor Introduction
You can read some of my introduction at my website at: https://gefyra.co/about-jonathan-moo/, but I will give some short points on my career:
TLDR - If you prefer video form
If you prefer words
- Worked in Paramount Global (formerly CBS Interactive) for 11 years:
- Spent my first 5.5 years as a software/data engineer on analytics.
- Adobe Analytics
- Google Analytics
- Baidu
- Google Adsense
- Embarked in several data projects that are too numerous to number.
- Flew to Fort Lauderdale, Florida to continue my career in sports predictions and AI initiatives.
- Delivered a Microsoft Golf Forecaster for USD 1.5 million, where I'm the only engineer working with a statistician.
- Delivered an Google Assistant voice app for NFL Fantasy recommendations for USD 1.75 million, where I was the lead engineer working on the voice recognition aspects of the project.
- Became a data engineering manager focusing on data science in CBS Sports, where I led a team to build sports prediction products and services.
- Felt that I spent too much time in CBS Sports after 11 years in 2022, and I made the jump to PlayStation. Currently working in PlayStation as a data engineering manager for Partners.
- Centralizing Partners data into Snowflake for democratization.
Why these short notes?
A boot camp focuses on doing to learn. However, students did mention that they are doing the work and assignments, but they aren't understanding totally.
The objective is to give enough contextual knowledge so that students can understand, and that your learnings stick with you as you progress in your career.
Disclaimer
These short notes is not part of the official curriculum. I'm doing this for the class as my style of teaching.
- Notes might contain erratas.
This site is hosted free on Github, and if it crosses a certain amount of bandwidth, the site might go down: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage
I might host it somewhere else if it happens. Currently, I'm building a site for it.
Introduction To Data
Why is data analytics such a hot skill these days?
What does the term 'data science' mean?
“Data science involves spreadsheets and formulas.”
- In my line of work, we try to use as little spreadsheets and rely on databases instead.
- That’s because our work presents itself as a live service rather than pure analysis.
- However, spreadsheets can be useful for exploratory data analysis (EDA).
- We really want to do an EDA to understand the data, in order to:
- Look for errors
- Identify trends and patterns within the data that would meet our business objectives and outcomes.
- The outcome is to mitigate risk before building because engineering efforts are expensive.
What is the difference between story telling and truth telling?
Personally, I believe that all story telling should be truth telling. However, story telling is about relating descriptive statistics and data to the desired business outcome.
Truth telling is more objective, where you’re gathering insights towards the story in which you have to build.
- Especially in unsupervised learning. Supposed you’re tasked to undercover the highest probability of a dengue fever outbreak over a city.
- Dengue fever is a disease that is transmitted by the Aedes mosquitoes. Very common in tropical areas.
- We don’t start off by knowing when a dengue fever happens. That would make us prophets instead. Data scientists use data to validate their assumptions and observations.
- However, if there’s a person who reported sick with dengue fever, we know that mosquitoes generally travel about 320-330 feet in radius. The probability of a person getting sick with the same disease is much higher if he or she is within that radius, after attributing some margin of error.
- With more data coming in, we can increase our confidence over a certain spot or area of the city, and that’s where we can start to prescribe solutions to the outbreak.
Course Overview
You're going to have 4 projects coming within your course:
- Project 1
- Using Git and introduction to version-control
- I can’t emphasize enough how important this subject is in my line of work, and within the tech world. I will explain more when we get there.
- EDA
- Installing PostgreSQL and pgAdmin
- Project 2
- Extract-Transform-Load (ETL) - Data Collection
- Before analyzing data, we have to collect the data. The quality and volume of data severely impacts your analysis and applications, and thus we’ve got to get fundamentals right here.
- There are many other aspects of how to collect data, but you will be able to extrapolate your learnings to other applications as ETL is where we all started.
- Project 3
- Data Visualization
- Data Ethics
- If you work for a multinational corporation (MNC), you will need to know enough to navigate so that your solution doesn’t end up becoming a lawsuit.
- Particularly in Europe (EU), this topic is highly sensitive. If the business has EU operations, this topic is not something you can skip over.
- Project 4
- Machine Learning (ML) Integration
- Find a problem you want to solve, analyze and visualize.
- Use ML to solve.
Past Learnings
Having taught in Arizona Statue University (ASU) and University of Miami (UM) before, let me share some learnings that could help you out.
I wrote a Medium post on it, and I will highlight the main points while I leave it to you to read the rest: https://medium.com/@jonathan.moo/9-tips-for-succeeding-in-a-data-analytics-boot-camp-ac72b124535d
Before the Break
We really want folks to know each other better, and this can be a prelude to project work.
If you need to confirm your software installations, use the break to do so.
Break for 15 minutes starting from 8:05pm
Group Activity: The Great Debate
Which food do Americans prefer? Italian or Mexican food?
Not the greatest question ever because it lacks an obvious business outcome, but it’s an exaggeration to prove a point.
Many times, binary classification is very important, and I will give you some examples:
- Is this image family safe? Or no?
- Is the sports prediction accurate after the fact, or no?
Analytics Paradigm
The team and business context can shift how this paradigm is going to work for you.
Often, we use Agile principles and methodologies to run teams so that we produce regular and repeatable outputs, in order to move the work towards our business outcomes. It is by no means a waterfall formula.
- “Waterfall” means it flows from top to bottom, and it is usually inflexible.
- In agile, this process is going to be iterative and flexible.
- We use milestones as maturity indicators for a project, and the analytics paradigm fits into our milestones.
- Agile is out of scope in terms of learning, but do remember to be flexible in how you use the analytics paradigm.
In many companies, if you’re a data analyst, data engineer, or a data scientist, you will work with a scrum master who will work with you to apply this analytics paradigm with agile principles. Thus, you don’t have to be overtly concerned with agile principles.
Group Activity: Predicting Gentrification
What is `Gentrification`?
It is the process of changing the character of a neighborhood through the influx of more affluent residents and businesses.
Homework Discussion
I’m not the person grading your assignments. There’s a team that does it.
- This promotes objectivity.
What is the goal of the next homework?
- To ensure you can manipulate data within Excel at will, and perform an EDA over the data set
- There are many ways or tools to use, but Excel is one of the “lowest barrier to entry” tools available in the market.
- Excel has a hard limit in terms of the volume of data it can take. If it goes above a million records, you can expect sluggish behavior from your local machine.
- That’s because your local machine has limits to the amount of memory and processing power. In data science, volume of data matters in terms of accuracy, and thus we use cloud computing regularly to analyze data.
- However, applying your EDA on a sample size that is statistically significant is very important to prove your logic, because engineering and computing is expensive. Excel can be very useful in this regard.