Machine Learning Foundations Part 1: Understanding Data
Data: The Foundation of Machine Learning
This is the first post in a 4-part series that takes you behind the scenes of how machines learn. We’ll explore the fundamental building blocks of machine learning: the data, the models, the algorithms, and the hardware that powers today’s technologies. By the end of this series, you’ll have a deeper understanding of how these pieces come together to shape the world around us.
What You’ll Learn in This Series
This is Part 1 of our 4-part journey into machine learning fundamentals. By the end of these four posts, you’ll understand:
Part 1 (Today): Data – The Foundation
Part 2: Models – The Brain (How machines find patterns in data)
Part 3: Algorithms – The Learning Process (How machines learn from data)
Part 4: Hardware – The Workhorse (The technology that runs all of this)
You’ll be able to explain how everyday Machine Learning (ML) and Artificial Intelligence (AI) work, from Amazon recommendations to face recognition on your phone. Whether you’re curious about ML and AI or planning to work in this field, this series will give you the foundational knowledge you need.
Like many of you, I was once new to the world of machine learning. When I first started, everyone talked about data, algorithms, and models, but nobody explained the basics. That’s why I’m creating this series – to start from the true beginning.
The Role of Data in Machine Learning
Imagine you’re teaching a small child to recognize cats:
You provide diverse pictures of cat examples: big cats, small cats, different-colored cats, cats in different poses, and even examples that aren’t cats (like dogs and foxes) to help them learn the differences. This is the data you’re providing. It’s critical because it forms the foundation upon which learning happens. It’s what allows the child (or the machine) to discern what makes a cat a cat, and what makes something else different. Without data, there would be no learning; the child wouldn’t have anything to compare to, and the machine wouldn’t be able to recognize patterns.
As the child learns, they create a mental rulebook: “If it has pointed ears, whiskers, a long tail, fur, meows, and moves gracefully like a cat – it’s probably a cat!” (That’s the model being created.)
The child follows a systematic learning process: See picture → Check for cat features → Make a guess → Get corrected if wrong → Remember what was wrong → Do better next time. This process of learning from mistakes, referred to as “training”, is similar to how algorithms work.
They also use their brain as the physical system: eyes capture images like a camera, neural pathways process information like a computer chip, and memories are stored like data on a hard drive. That’s the hardware in action!
Machine learning works similarly, but with computers instead of children, and at an almost unimaginable scale and speed. While a child might learn from hundreds of cat pictures over several months, a computer can process millions of images in a fraction of that time. What takes humans years to learn, machines can potentially master in days—or even hours—not because they’re smarter, but because they can process information at speeds far beyond human capability.
However, just like children, computers still need good examples and proper guidance to learn correctly. And that all begins with data.
What Exactly is Data?
Data is essentially information that has been organized in a way that computers can process and understand. Let’s dive into the details (but don’t worry, we’ll keep it clear with plenty of examples).
Structured vs. Unstructured Data
Structured Data
Structured data is highly organized and fits neatly into predefined categories, like filling out a form with specific boxes for each piece of information. For example:
- Your name (must be text)
- Your age (must be a number)
- Your birth date (must be a date)
- Your gender (must be selected from given options)
In the world of computers, we use a “schema”—a set of rules that tells the system what kind of data it can expect and how to organize it. Think of it like the instructions for a job application, where each field has specific requirements (e.g., phone numbers must be numbers, email addresses must have an “@” symbol).
Technical Detail:
Structured data is stored in tables within databases, following a schema. A schema provides a blueprint for how data should be stored and formatted. For example, a customer database might have a table like this:
Customer Database Schema Example:
customer_id (number) | name (text) | age (number) | signup_date (date) | subscription_type (text) |
---|---|---|---|---|
1 | John Doe | 34 | 2024-01-15 | Premium |
A real-world example of structured data is your Spotify listening history:
- Song ID: 12345
- Timestamp: 2024-01-15 14:30:00
- Duration Played: 180 seconds
- Skip Status: false
Unstructured Data
Unstructured data is much messier and doesn’t fit neatly into tables. It’s the kind of data humans generate every day: text messages, social media posts, photos, videos, and more. Think about a social media post that contains a photo, a caption with hashtags, mentions of other users, and reactions. There’s no easy way to organize all that into neat rows and columns.
Technical Detail:
Unstructured data often requires special techniques for processing before a computer can analyze it. Preprocessing is like preparing ingredients before cooking—organizing everything so it’s easier to work with. For instance, when you upload a photo, the computer converts the image into numbers, where each pixel represents a set of color values. Similarly, voice recordings are translated into numbers that represent sound waves, and text is converted into a numerical code for each letter or symbol.
Example of Image Data:
Pixel: (255, 128, 0)
This represents a single pixel in an image with Red, Green, and Blue (RGB) color values.
Structured and Unstructured Data in Machine Learning
Structured data is highly organized and easy to analyze. It fits neatly into tables, which makes it simple to process with traditional data analysis methods. Examples include customer data, product details, and transaction history. It’s perfect for machine learning models that rely on clear and well-organized information.
Unstructured data, on the other hand, is more complex and requires advanced methods to analyze. Examples include emails, videos, social media posts, and voice recordings. Even though it’s harder to work with, it holds valuable insights. Techniques like Natural Language Processing (NLP) for text or Computer Vision for images help machines understand unstructured data.
Big Data: Scaling Structured and Unstructured Data
Big data refers to enormous datasets that are too large for traditional data management tools. It can consist of both structured and unstructured data, often coming in real-time. Think of big data like managing a massive ocean of constantly shifting information. The key challenge is figuring out how to handle and make sense of it, but when done correctly, big data leads to powerful insights.
In machine learning, big data serves as the fuel that powers algorithms, helping them recognize patterns and make predictions. With more data, machine learning models can become more accurate and predictive. However, it’s not just about the quantity of data—it’s the quality that matters most. High-quality, clean, and relevant data leads to better outcomes in machine learning.
Artificial Intelligence (AI) relies heavily on data too. It uses machine learning and deep learning techniques to simulate human intelligence and decision-making. The more data AI can analyze, the better it can “learn” and improve, making it invaluable in fields like healthcare, customer service, and beyond.
Types of Data (both structured and unstructured)
Understanding the types of data is a key concept in machine learning. The type of data influences how it’s stored, processed, and used to train models (teach them to recognize patterns and make predictions). Here’s a breakdown of common data types, their technical aspects, and real-world examples:
1. Numerical Data
Numerical data represents numbers and can be discrete or continuous:
-
Discrete Numbers
Technical Detail: Stored as integers (whole numbers) in computers.
Examples:- Spotify track play count: 157 plays
- Instagram followers: 1,024
- Items in shopping cart: 5
- Email count: 1,337
Use in ML: Discrete numbers are often used for counting or classification tasks, such as customer segmentation based on purchase frequency.
-
Continuous Numbers
Technical Detail: Stored as floating-point numbers in computers. Floating-point numbers are just numbers with a decimal point, like 3.14 or 0.001.
Examples:- GPU temperature: 72.5°C
- Download speed: 15.7 Mbps
- Battery level: 85.3%
- Location coordinates: (40.7128°N, 74.0060°W)
Use in ML: Continuous numbers are critical in regression tasks. Regression tasks are used to predict continuous values, like how much something will cost or the temperature tomorrow.
2. Categorical Data
Categorical data represents labels or categories and can be nominal or ordinal:
-
Nominal (No Order)
Technical Detail: Often stored as “enums” (enumerated types) or string codes. An enum is a way to represent a fixed set of options in programming, like a list of colors: red, green, and blue.
Examples:- Email categories: PRIMARY, SOCIAL, PROMOTIONS
- Weather conditions: SUNNY, RAINY, CLOUDY
- Payment methods: CREDIT, DEBIT, PAYPAL
- Movie genres: ACTION, COMEDY, DRAMA
Use in ML: Nominal data is used in classification models, such as email spam filters or customer preference analysis.
-
Ordinal (With Order)
Technical Detail: Stored with both a label and a rank value.
Examples:- Hotel stars: ⭐ to ⭐⭐⭐⭐⭐
- Education levels: HIGH SCHOOL (1), BACHELORS (2), MASTERS (3), PHD (4)
- Pain scale: 1 to 10
- Performance ratings: NEEDS IMPROVEMENT (1), MEETS EXPECTATIONS (2), EXCEEDS EXPECTATIONS (3)
Use in ML: Ordinal data is used in models where rank matters, like customer satisfaction surveys or risk assessments.
3. Text Data
Text data consists of strings, often requiring conversion to numbers for machine learning:
Technical Detail: Computers convert text into numbers using encoding systems like ASCII or Unicode. For example, the word “Hello” is encoded as 72 101 108 108 111.
Examples:
- Tweet: “Just had the best coffee! #MorningVibes”
- Product review: “Great battery life, fast shipping!”
- Email subject: “Your order has been shipped”
- Chat message: “See you at 6pm”
Use in ML: Text data is the backbone of Natural Language Processing (NLP) tasks, such as sentiment analysis, chatbots, or translation services. NLP is a field in AI that helps computers understand and work with human languages, like converting speech to text or translating languages.
4. Binary Data
Binary data is represented as 0s and 1s to indicate two states, like true/false or on/off:
Technical Detail: Stored as binary values (0 or 1).
Examples:
- Email opened: YES/NO
- Ad clicked: TRUE/FALSE
- Feature enabled: ON/OFF
- Payment successful: SUCCESS/FAIL
Use in ML: Binary data is widely used in classification tasks, such as detecting spam or predicting churn.
5. Unstructured Data
Unstructured data doesn’t fit into traditional rows and columns like structured data does. Instead, it’s much more varied, and it can be complex, messy, or in a format that’s not immediately understandable to machines without specialized processing techniques. In essence, unstructured data represents the rich, multifaceted nature of human activity, including text, images, video, sound, and more. It’s the type of data generated naturally in most real-world applications, but it requires extra steps to be processed and analyzed effectively by machine learning systems.
Technical Detail:
Unstructured data is typically stored in formats that vary widely, which makes it harder to process directly. These formats can include raw pixel data for images, audio waveforms for sound, text documents, and other free-form formats. Computers must break down this data into components that can be analyzed, such as by converting images into numbers, transforming spoken words into text, or extracting features from video and audio streams.
To make sense of unstructured data, techniques like Natural Language Processing (NLP) for text, Convolutional Neural Networks (CNNs) for images, and audio analysis methods for sound are often used. These methods “transform” unstructured data into more usable forms or insights, such as understanding sentiments in social media posts, detecting objects in photos, or transcribing spoken conversations into text.
Examples of Unstructured Data
-
Surveillance Video
Surveillance video is a prime example of unstructured data. A video file contains frames of images recorded over time, and each frame is filled with pixel data. Additionally, videos may include sound (audio), text-based metadata (like time stamps), and even motion data. Computers can analyze videos to detect certain events (like movement or specific objects) using specialized algorithms, but this requires converting each frame into numerical data and identifying patterns across those frames. This type of data is often analyzed for security purposes, event detection, or anomaly tracking.Technical Detail: A single frame of video might consist of millions of pixels, each of which has a value for its color (RGB values for Red, Green, Blue). Videos can also contain other information such as timestamps or metadata, adding another layer of complexity. Machine learning algorithms analyze frames individually or across multiple frames to detect objects or behaviors.
Real-World Example: Surveillance systems in retail stores can detect suspicious activity, like shoplifting, by analyzing changes in patterns of movement or identifying people who linger in certain areas for extended periods.
-
Customer Service Calls
Phone calls to customer service are another form of unstructured data. These calls may contain valuable information, such as customer complaints, questions, or requests, but the data is in the form of audio, making it difficult to process directly without conversion. Techniques such as speech-to-text transcription are used to convert audio into text, allowing sentiment analysis and topic extraction to be performed on the resulting transcript.Technical Detail: Customer service calls are typically recorded as audio files (wav, mp3, etc.). The speech-to-text process transcribes the audio into written form, making it possible to apply NLP techniques to understand the meaning and sentiment behind the conversations.
Real-World Example: Analyzing customer service calls using NLP allows companies to identify common complaints and feedback trends, improve their services, and even detect signs of frustration or anger in the customer’s voice to prioritize urgent issues.
-
Social Media Photos
Photos posted on social media platforms are a common example of unstructured data. These images can be anything from selfies to product photos, event pictures, or memes. While humans can easily interpret the content of these images, machines need to analyze the pixel data to identify objects, faces, emotions, or even context.Technical Detail: Images are represented as grids of pixels, where each pixel is defined by color values (typically RGB, representing red, green, and blue channels). Deep learning techniques, such as CNNs, are used to process and analyze these images, looking for specific objects, faces, or patterns.
Real-World Example: Social media platforms like Instagram or Facebook use image recognition to tag people in photos, suggest relevant hashtags, or filter inappropriate content, based on the analysis of the image data.
-
Medical Scans
Medical scans, like X-rays, MRIs, and CT scans, are another form of unstructured data. These scans are often in the form of high-resolution images that can contain critical health information but need to be processed to extract useful details, such as identifying signs of disease or abnormalities. For instance, doctors use scans to detect cancer, fractures, and other conditions, but algorithms can assist by automatically analyzing the scans to flag potential issues, speeding up diagnosis.Technical Detail: Medical images often contain large amounts of pixel data that represent tissue densities, textures, and structures within the body. Algorithms that work with medical imaging data usually involve convolutional neural networks (CNNs) to process and detect patterns in these images, sometimes comparing them to vast libraries of other medical scans for comparative analysis.
Real-World Example: Radiologists use AI tools to assist with interpreting X-ray or MRI images, allowing for faster, more accurate diagnoses of diseases such as pneumonia or cancer. For example, AI systems can detect lung nodules in X-rays, which might be missed by human eyes.
Why This Matters for Machine Learning
Unstructured data is everywhere in the modern world, and being able to process it opens up enormous potential for innovation. Whether it’s extracting insights from social media to gauge public sentiment, analyzing surveillance footage to identify criminal activity, or providing life-saving medical diagnoses, unstructured data is a goldmine for machine learning applications. However, it also presents a significant challenge.
Processing this data often requires powerful machine learning models and large amounts of computational resources, especially for tasks like image recognition, speech-to-text conversion, and sentiment analysis. But as technology advances, these techniques are becoming more accessible and powerful, enabling machines to learn from and make sense of the complex, rich data we encounter every day.
How Computers Handle Unstructured Data
For example, an image might be stored as a matrix of pixel values. A matrix is like a digital version of a grid or a chessboard, where each square contains numbers that represent colors in the image.
Use in ML: Unstructured data is vital for advanced tasks like object recognition in images, voice-to-text conversion, or diagnosing diseases through medical imaging. Object recognition is how a computer identifies and labels objects in a photo, like recognizing a cat or a tree in an image.
By recognizing these types of data and their unique challenges, you can better prepare for their use in machine learning models. Each type has its own role in building smarter, more effective systems.
Why This Matters: A Peek at What’s Coming
Now that we’ve explored the foundation of data, let’s uncover how machines work their magic with it. In the next parts of this series, you’ll see how machines transform raw information into intelligent actions:
Part 2: Models
- Discover how Netflix analyzes your watching history to craft spot-on recommendations.
- See how self-driving cars detect stop signs and navigate busy roads.
- Uncover how spam filters defend your inbox from junk mail.
These models are the brains behind machine learning – and they’re smarter than you think!
Part 3: Algorithms
- Learn how machines train on millions of examples to get better every day.
- Find out why some learning methods excel in certain tasks while others fall short.
- See how computers master chess – beating even the world’s greatest players.
Algorithms are the secret sauce, helping machines learn, adapt, and improve.
Part 4: Hardware
- Explore why certain AI tasks require powerful supercomputers.
- Understand how your smartphone runs advanced AI models seamlessly.
- Learn what makes AI training so energy-intensive and how we’re improving it.
Without the right hardware, even the smartest algorithms would hit a wall!
Quick Exercise: Be a Data Detective
Before we dive into these concepts, try this fun exercise:
- Open your favorite app.
- Identify one example of each data type we covered.
- Bonus: Imagine how the app might use this data to improve your experience.
For example, on Instagram:
- Numerical: Like count.
- Categorical: Post type (Photo/Video/Reel).
- Text: Captions.
- Binary: Following status.
- Unstructured: Photos and videos.
Coming Up Next
In Part 2, we’ll explore how machine learning models turn data into actionable intelligence. You’ll learn how Netflix’s recommendation model predicts your next binge-worthy show, how Spotify’s algorithm creates your perfect playlist, and how ChatGPT’s language model generates human-like conversations.
Every AI innovation – from self-driving cars to virtual assistants – starts with models that are trained on vast amounts of data. These models analyze patterns, make predictions, and continuously improve their accuracy over time.
Mastering the fundamentals of machine learning models is key to unlocking the technology behind these innovations. In the next part, we’ll dive into how these models process raw data and transform it into the intelligent decisions we interact with every day.
See you in Part 2, where we’ll start building the foundation for understanding how models work. And as always, drop any questions or insights in the comments – we’re learning together.