BigQuery Basics: Query and Analyze Data

You need an All Access Membership on GCP Study Hub to access labs. Click here to purchase if you don't have one.

Lab Objective

In this lab, you'll learn how to use BigQuery, Google Cloud's serverless data warehouse, to analyze data at scale. You'll create a dataset, load sample e-commerce sales data from a CSV file, write SQL queries to analyze it, and save query results. BigQuery can process billions of rows in seconds, making it perfect for data analysis and business intelligence.

Requirements to Pass This Lab

Before You Begin

Step 0: Authentication

How this works: You need a Google account that uses the same email address as your GCP Study Hub account. When you sign into that Google account, the system will be able to grant you temporary access to a GCP project when you start the lab.

Click the button below to start your lab:

Start Lab

Instructions

Step 1: Open Google Cloud Console

Once your lab starts, use the "Open GCP Console" button from the lab portal to open the Google Cloud Console with your temporary lab project already selected.

Step 2: Navigate to BigQuery

  1. In the left navigation menu, click the hamburger menu (three horizontal lines)
  2. Find and click on BigQuery in the menu
  3. If prompted, click Done to dismiss the welcome message

Alternatively, search for "BigQuery" in the search bar at the top of the console.

Step 3: Create a Dataset

A dataset is a container for tables in BigQuery, similar to a database schema. Some teams organize datasets by department (sales, marketing), data source (app_logs, external_api), or environment (prod, staging), though the optimal organization strategy depends on your team structure and access control requirements.

  1. In the BigQuery Explorer panel on the left, find your project ID
  2. Click the three dots (⋮) next to your project name
  3. Select Create dataset
  4. Configure the dataset:
    • Dataset ID: lab_dataset
    • Location: Select us-central1 (Iowa). Location matters for compliance (data sovereignty) and query performance. Data and queries should be in the same region when possible. This lab project has an organization policy that restricts resources to us-central1.
  5. If you expand Advanced options, you'll see several configuration settings (you don't need to change these, but it's helpful to know they exist):
    • Encryption: By default, Google-managed encryption keys are used. Some organizations use customer-managed keys for additional control.
    • Enable table expiration: Not checked by default. When enabled, tables are auto-deleted after a specified number of days. This can help manage storage costs for temporary analysis tables.
    • Default collation: Controls how text sorting and comparison work. The default is typically sufficient for most use cases.
    • Case insensitivity: Controls whether column name matching in queries is case-sensitive. Some teams enable this for compatibility with legacy systems.
  6. Click CREATE DATASET

You should now see lab_dataset appear under your project in the Explorer panel.

Step 4: Download Sample Data

First, download the sample e-commerce sales data CSV file to your computer.

  1. Click the button below to download the CSV file (94 KB, 2,000 rows):
Download Sample Data CSV

The file will be saved to your Downloads folder. You'll use this in the next step.

Step 5: Create a Table and Load Data

Now you'll create a table in BigQuery and upload the CSV data. BigQuery supports loading data from multiple sources: Cloud Storage, local files, streaming inserts, or other BigQuery tables. For larger datasets, Cloud Storage is generally preferred over direct file uploads due to better reliability and the ability to parallelize the load process.

  1. In BigQuery, click the three dots (⋮) next to lab_dataset
  2. Select Create table
  3. Configure the table creation:
    • Create table from: Upload
    • Select file: Click Browse and choose the sample_sales_data.csv file you just downloaded
    • File format: CSV. BigQuery also supports JSON, Avro, Parquet, and ORC. Columnar formats like Parquet can be faster to import due to better compression and structure, though once the data is loaded into BigQuery it's stored in BigQuery's native columnar format regardless of the source file type.
    • Dataset: lab_dataset (should be pre-filled)
    • Table: sales_data
    • Schema: Check Auto detect. BigQuery will scan your data and infer column types (STRING, INTEGER, FLOAT, DATE, etc.). This is convenient but you can also define schemas manually for more control.
    • Advanced options: Expand and set Header rows to skip: 1. This tells BigQuery the first row contains column names, not data.
  4. Click CREATE TABLE
  5. Wait for the upload and job to complete (should take 10-15 seconds)

Once complete, you'll see the sales_data table under lab_dataset with 2,000 rows loaded.

Step 6: Explore the Data

Before running queries, it's good practice to preview your data to understand its structure and spot any data quality issues.

  1. Click on the sales_data table in the Explorer
  2. Click the SCHEMA tab first to see the detected column types. Notice BigQuery identified quantity as INTEGER, revenue as FLOAT, order_date as STRING (could be DATE with proper formatting), and text fields as STRING.
  3. Click the PREVIEW tab to see sample rows. Table previews don't scan the full table and aren't charged, which makes them useful for data exploration before writing queries.
  4. Notice the columns:
    • order_id - Unique order identifier
    • product_category - Type of product sold (5 categories: Electronics, Clothing, Home & Garden, Books, Sports & Outdoors)
    • quantity - Number of items per order (1-5)
    • revenue - Total revenue for the order in dollars
    • order_date - When the order was placed (2024 data)
    • region - Geographic region (North, South, East, West)

This dataset contains a year of e-commerce transactions across different regions and product categories, which provides a reasonable basis for practicing basic aggregation queries.

Step 7: Run Your First Query

This query calculates revenue by region using standard SQL aggregation functions.

  1. Click COMPOSE NEW QUERY (or the + icon in the query editor)
  2. Enter this SQL query:
SELECT
  region,
  COUNT(*) as order_count,
  SUM(revenue) as total_revenue,
  ROUND(AVG(revenue), 2) as avg_revenue
FROM `lab_dataset.sales_data`
GROUP BY region
ORDER BY total_revenue DESC
  1. Before clicking RUN, note the query validator in the top right. It shows approximately how much data will be processed - in this case it should be in the kilobytes range, so relatively small. BigQuery's pricing is based on data scanned, so this estimate is useful for cost planning.
  2. Click RUN
  3. View the results showing revenue breakdown by region.

About columnar storage: BigQuery uses columnar storage rather than row-based storage. When querying specific columns (like region and revenue), only those columns are read from disk. This design can significantly reduce I/O for analytical queries, particularly on wide tables where you're selecting a small subset of columns. The performance benefit scales with dataset size, though actual query times depend on many factors including query complexity, data distribution, and available compute resources.

Step 8: Save Query Results

Sometimes you want to save query results as a permanent table instead of just viewing them. This is useful for creating summary tables, building data pipelines, or caching expensive query results to avoid re-scanning large datasets. Let's create a table showing product category performance.

  1. Write this query to find top-selling product categories:
SELECT
  product_category,
  COUNT(*) as total_orders,
  SUM(quantity) as total_units_sold,
  ROUND(SUM(revenue), 2) as total_revenue
FROM `lab_dataset.sales_data`
GROUP BY product_category
ORDER BY total_revenue DESC
  1. Instead of clicking RUN immediately, click MOREQuery settings
  2. Under Destination:
    • Check Set a destination table for query results
    • Dataset: lab_dataset
    • Table: query_results
    • Write preference: Write if empty. Other options: "Overwrite table" (replace existing data) or "Append to table" (add to existing data).
  3. Click SAVE
  4. Click RUN
  5. After the query completes, verify the query_results table appears under lab_dataset. This summary table can be queried directly without re-aggregating the raw data.

Note on materialized aggregations: Some data teams maintain pre-aggregated summary tables to reduce query costs and latency for frequently-accessed metrics. This approach trades storage costs for reduced compute costs, and works best when the aggregation pattern is well-understood and relatively stable. The trade-off depends on your specific query patterns and data refresh requirements.

Validate Your Work

Get Your Completion Token

Click the button below to open the lab checker in a new window. Sign in with Google to verify your work and get your completion token.

Open Lab Checker

Submit Your Token

Paste your token below and click Validate. If valid, a button will appear to mark the lesson complete.


            
            
        

What You Learned

Troubleshooting

Validation Fails

If your validation fails, check that:

CSV Upload Failed

If the CSV file won't upload: