STAT 196K - Analyzing and Processing Big Data
This is the course webpage for STAT 196K, Analyzing and Processing Big Data, at CSU Sacramento. The course notes are available here. Announcements, assignment submissions, grades, discussions, lecture recordings, and anything that may identify students is available on Canvas.
-
Introduction Sql
- Write SQL queries
-
Configuring Sql
- Configure SQL environment (AWS Athena)
-
Covid Data Lake
- Use a GUI dashboard to explore and analyze data
- Define terms: dashboard, data warehouse, data lake
-
Homework Clustering
- apply and interpret k means clustering
- apply and interpret principal components analysis
-
Introduction Statistical Learning
- Define standard terms in statistical learning: feature, observation, n x p training data, objective function
- Calculate one iteration of the k means clustering algorithm
-
Midterm Review
- Prepare for midterm
-
Synchronizing Code With Version Control
- describe idea and use cases for version control
- use version control to synchronize files between a laptop and a server
-
Homework Xml To Matrix
- apply natural language processing (NLP) techniques to convert unstructured text to a numeric matrix
- extract interesting data from XML documents
-
Introduction To Xml
- describe XML data model
- extract data from XML documents using XPath
-
Introduction Natural Language Processing
- Understand key terms in natural language processing
-
Framing Statistical Questions
- Given data, come up with interesting questions
-
Improving Code Stream Homework
- Critique and improve code by applying general engineering principles
-
Chi Squared Test
- Apply the Chi Squared test to determine if data comes from a particular distribution
- Verify calculations in statistical software
-
Graphical Methods For Comparing Distributions
- describe column storage
- create and interpret QQ plots
- sample from probability distributions to test statistical methods
-
Basic Julia Syntax
- use basic syntax in Julia programming language
-
Homework Sampling Stream
- implement a streaming algorithm
- create a custom step for a shell pipeline
- test for a distribution
-
Custom Programs In Shell Pipelines
- write custom scripts that work in shell pipelines
-
Shell Data Processing Pipelines
- Describe how a filter program processes standard input (
stdin
) into standard output (stdout
) - Interactively develop data processing pipelines using standard shell commands
- Create minimal working examples to check programs
- Describe how a filter program processes standard input (
-
Testing Distributions
- apply and interpret chi-squared tests
-
Homework Streaming Large Text File
- calculate summary statistics from a data stream
- use pipelines to process a file larger than memory
-
Cloud Object Storage
- Describe data locality as the key to speed when processing big data
- Perform experiments to estimate time required to download data
- Download objects from Amazon’s Simple Storage Service (S3)
-
Cloud Computing Basics
- Describe the high level concepts of cloud computing and virtualization
- Login to a remote machine through SSH
- Start and terminate an Amazon Elastic Compute (EC2) instance
-
Shell First Steps
- Explain reasons to use the shell
- Do basic operations on files and directories, such as listing, creating, deleting, moving, and copying
-
First Day Expectations
- Set expectations for this class
- Break the ice with an activity so you talk to each other
-
Communication For Current Students
Asking questions during live class is the fastest and most efficient way to communicate. Outside of class, the best way to communicate is through Discord and Canvas. For private matters, you can come to office hours or email me at fitzgerald@csus.edu.
-
How To Record Videos In Canvas
We will share our videos with each other through Canvas discussions. This post describes several ways to upload videos and make them available to a Canvas discussion.
-
Stat196k Schedule
Here is a tentative outline of topics for STAT 196K. The goal of the course is to achieve the following learning outcomes.
-
Stat196k Syllabus
Course Description: Statistical analysis of large, complex data sets. Topics include memory efficient data processing, the split-apply-combine strategy, rewriting programs for scalability, handling complex data formats, and applications such as statistical learning, dimension reduction, and efficient data representation. Students will access data and run code on remote servers. 3.0 Units. Letter Graded.