3. Ingest data

Ingest data

Before we create a data pipeline, we'll need data to work with. This section explores data assets, the asset lifecycle, and how to set up our first source assets.

Ingestion

Y42 follows the ELT paradigm for processing data: we extract data from a source, then load it, and finally transform it. Extracting data from its source and loading it into a data warehouse is what we call ingestion.

Within Y42, there are over 500 connectors we can use for data ingestion, allowing us to pick and choose the most useful ones. All data ingestion starts with creating a source asset.

About assets

Assets are the building blocks of data pipelines. They are definitions in code that when executed materialize tables in our data warehouse. We call this process a data build. For example, building a source asset results in a source table, and building a staging model results in a staging table (more on those in section 4).

Asset lifecycle

It helps to understand the four lifecycle stages that every asset goes through.

  1. Definition: This is the initial phase where we define the asset either directly as code or through the Y42 interface.

  2. Version Control and CI/CD: After defining the asset, we save and track changes by committing and pushing the changes to Git. This ensures that the asset is properly stored and that we can fix any breaking changes by reverting to a previous version.

  3. Building: When the changes have been successfully saved to the remote Git repository, we can builld the asset with the y42 build command. At this point, Y42 creates the table in our data warehouse and runs any data tests we have defined.

  4. Publication: Once tested and verified, we can publish a specific version of an asset on different branches, such as main or develop. This is when the version of an asset becomes the production version that most people in our team will see. We can set read, write, and discover permissions on the organization, space, and asset-type level to restrict access to published assets as necessary.

Create source assets

Now that we have some background knowledge on assets, let's start with defining source assets. The first source we'll ingest is the Space Fuel Shop dataset. This dataset is hosted in a Postgres database and details orders from intergalactic gas stations. We'll use Airbyte's Postgres connector to ingest the data.

Create a working branch

As you may have gathered from the asset lifecycle, Y42 follows the GitOps paradigm. We'll dive further into this in section 6, but let's implement a best practice immediately. We should create a new branch whenever we want to make changes to our pipelines — such as adding a data source. This gives us a safe environment to work in without disrupting live pipelines.

To create a new branch, click on main [default] at the top center of the screen and type in a name for the new branch (e.g., getting-started-guide). Now hit enter; that's all there is to it.

Create a new source asset

To ingest the Space Fuel Shop dataset, we'll use Airbyte's Postgres connector:

  1. In the left panel, select the List view under Asset Editor
  2. Press on + in the file navigator
  3. Select Source
  4. Select Ingest
  5. Select Postgres (Airbyte) as the Ingest type
  6. Name the source: raw_space_fuel_shop_data

Now, we'll configure the source asset to point to the right Postgres database and pull in all of the data.

Add a connection secret

Source assets (generally) need a secret to authorize their connection to the data source.

  1. In the Authorization panel, click on the search box Search a secret. When no secrets exist yet, you'll be prompted to add a new secret. If secrets do exist, you can select one of them or click on Add new.

  2. Name the secret postgres_space_fuel_shop_data_{YOUR_NAME}, replacing YOUR_NAME with your name.

  3. Insert the following type-specific configuration variables:

    VariableValue
    Host35.198.165.208
    Port5432 (default)
    Database Namey42-demo-data
    Usernamepublic_reader
    Passworddontpanic
  4. Set Replication Method to Standard

  5. Set SSH Tunnel Method to No Tunnel

  6. Leave JDBC URL Parameters (Advanced) empty

  7. Set SSL Modes to allow

When you have entered all authorization settings, click Next in the Settings panel, and wait for the schema to load in the Schema selection panel.

Select the the tables and columns you would like to import

  1. Open the Schema panel.
  2. Wait for the schema to load. For larger datasets this may take a few minutes.
  3. Select all three tables: public/payments, public/orders, and public/customers
  4. You can open the dropdown of a single table to see the individual columns and types. For now, just import all columns.

Commit the new source assets

Our first source is now fully configured. Before we can build the tables and ingest the data, we'll need to commit and push our changes.

  1. Click on the Commit & Push button at the top.
  2. Enter a name for your Commit: ingest: add Space Fuel Shop data source.
  3. Press Commit.
  4. Wait for Y42's pre-configured checks to run.

Y42 will automatically push the changes to the Git remote.

Build source assets

With our source successfully set up, we can build the three tables and load them into the data warehouse. This is where the actual ingestion occurs: we take the raw data from the Postgres source and move it to a place where Y42 can work with it.

Build: materialize the table

  1. Open the drawer at the bottom of your screen
  2. Select the Build tab
  3. Insert the build command to build and import all three tables: y42 build -s source:raw_space_fuel_shop_data.*
  4. Click on Visualize DAG to see the assets this command will build
  5. Click on Build now. You'll see a new row added to the build window. As the blue quarter bar indicates, the build is in queueing mode.
  6. Click on the build row to see more details, where you'll see the pending jobs. You can also navigate to Logs to view build logs. If you click on an individual build job, you'll see that job's details.
  7. After a couple of minutes, the build should be Ready.

Publication: add metadata

We'll add metadata to the assets to get the most out of our data. Y42 has extensive metadata features, which is really helpful when working with many data assets and experts. For now, we'll focus on just a few metadata features: column types, column tests, and column descriptions.

  1. Click on raw_orders in the left file selector

  2. Click Sync columns

  3. Seven columns should be added. If they do not have the correct Type, update the type by double-clicking on the Type and entering the appropriate one:

    ColumnType
    IDstring
    CUSTOMER_IDstring
    UPDATED_ATtimestamp_ntz
    FULFILLMENT_METHODstring
    SERVICE_STATIONstring
    LINE_ITEMSstring
    STATUSint
    _Y42_EXTRACTED_ATtimestamp_ntz
  4. Add the following column tests:

    ColumnTestValues
    IDUnique
    CUSTOMER_IDNot null
    FULFILLMENT_METHODAccepted ValuesSERVICED, SELF_SERVICE
  5. Add the following descriptions:

    ColumnType
    UPDATED_ATTime at which the order was published to the database in CET.
    STATUS0 for canceled, 1 for declined, 2 for accepted
  6. Commit and push the changes with an appropriate commit title (e.g., meta: add raw_orders metadata)

You can add more extensive tests and metadata, of course! As a general practice, it is a good idea to have at least a description per column and the appropriate data types selected.

View the data

We can now take a sneak peek at the data to see how it's shaping up:

  1. Open the bottom drawer
  2. Navigate to Data, where you will see a preview of the row count
  3. Wait for the data to load and verify that it shows up

Up next

Now that we've created source tables in the data warehouse, we can start building a pipeline that combines the data and transforms it into new models!