From Documents to Tables - Snowflake's Document AI Simplifies Unstructured Data Integration

When I first tried Snowflake's Document AI feature, I was genuinely excited about how easy it was to extract information buried in unstructured documents, such as invoices or contracts, directly into Snowflake tables using only SQL.

The process looks like this:

You assign specific Snowflake Document AI roles, DOCUMENT_INTELLIGENCE_CREATOR and ML.DOCUMENT_INTELLIGENCE, to the role used to build the model.
You build your model on some training documents you upload via the user interface, and once you are comfortable with the results, you publish the model.

And that's it! You can now query the model using SQL and extract the same fields defined in the training process for new documents stored in a stage.

How it works

You might be wondering: I understand granting roles and referencing models using SQL, but how do I actually build the model? Do I need a separate external service in AWS, Azure, or GCP to call? How do I deal with cost? Do I need to code it using Snowpark? How do I manage the infrastructure resources associated with it? Well, it turns out it's much, much simpler than that. Snowflake abstracts all the messy aspects of model building, extracting information from documents and validating results through simple questions and answers:

First, upload several test documents to train the model.
Then, ask the questions you are interested in finding the answers from the documents. Questions such as, “What are the payment terms?” or “What is the effective date of the agreement?”, and declare the variables to store the results of each question.
Snowflake will analyze the questions for each uploaded training document and provide the answers along with a confidence score for each answer.
You can mark the answer as correct, override the response, or delete the response if the necessary information is unavailable in the uploaded document.

And that's the model-building part in a nutshell. There is no coding involved or specific Snowflake code snippets to copy and adjust for the model you build. Just plain English questions and answers that you can validate until you are comfortable with the learning process for training the data.

You can now reference the model using <model_name>!PREDICT that will output a JSON containing the overall OCR Score for the document (confidence score for the optical character recognition process), as well as the pair of score and value for each field (answer) defined during training for every new document scanned from the STAGE.

SELECT <model_name>!PREDICT(
  GET_PRESIGNED_URL(@<stage_name>, RELATIVE_PATH), 1)
FROM DIRECTORY(@<stage_name>);

Sample query that will also print the file name, size, path, and the JSON object:

SELECT 
    relative_path as file_name,
    size as file_size,
    last_modified, 
    file_url as snowflake_file_url, 
    DOC_AI_QS_CO_BRANDING!PREDICT(
      GET_PRESIGNED_URL('@doc_ai_stage', RELATIVE_PATH ), 1) as json
FROM DIRECTORY(@doc_ai_stage);

Output after running the PREDICT function on new documents.

From here, you can use JSON functions to extract individual field values and store them in a regular Snowflake table.

CREATE OR REPLACE TABLE output_table
AS
WITH temp as(
  SELECT 
      relative_path as file_name,
      size as file_size,
      last_modified, 
      file_url as snowflake_file_url, 
      DOC_AI_QS_CO_BRANDING!PREDICT(
        GET_PRESIGNED_URL('@doc_ai_stage', RELATIVE_PATH ), 1) as json
  FROM DIRECTORY(@doc_ai_stage)
)
SELECT
  file_name, 
  file_size,
  last_modified,
  snowflake_file_url, 
  json:__documentMetadata.ocrScore::FLOAT AS ocrScore, 
  json:effective_date[0]:score::FLOAT AS effective_date_score, 
  json:effective_date[0]:value::STRING AS effective_date_value,
  json:duration[0]:score::FLOAT AS agreement_duration_score,
  json:duration[0]:value::STRING AS agreement_duration_value, 
  ..
FROM temp;

Furthermore, you can use the score if some of the extracted values need manual adjustments (for example, if the score is less than 0.3, double-check the value's correctness for the specific document).

And that's it; you don't need any other tools to incorporate your private data when building RAGs (Retrieval-Augmented Generation) to enhance LLMs (Large Language Models).

Fine-tune a model

You can also fine-tune a model or adjust some of the questions, and then publish a new version of the model from the same screen where you first built it. Snowflake increments the version number each time you hit publish. You can reference a specific model build version when using the <model_name>!PREDICT function:

SELECT 
  <model_name>!PREDICT(<presigned_url>, [ <model_build_version> ])

Build Continuous Pipelines

From here, you can either use Snowflake's streams and tasks to continuously process new documents or use a declarative orchestration tool built for the cloud, like Y42, that allows you to pull data from various sources using a SQL interface that gives you full flexibility instead of UI-driven interfaces, preprocess the raw data using dbt, and publish data assets all in a stateful manner to consumers data downstream.

If you want to go through a hands-on tutorial on setting up and building a model using Document AI on public data, Snowflake provides this guide.

Would I change anything about the process?

As someone who loves SQL, I would have expected Snowflake to provide a SQL alternative to the UI model-building process. However, this is an iterative process—you ask a question, validate the result, and then move on to the next question. Then, you start over with a new document until you have high confidence in what the model can extract from the training documents. So, I don't feel a SQL interface would simplify the process. It's not like you'd want to train dozens of such models each day without testing.

As I said before, you can re-train a model and reference each version of a model when invoking the PREDICT function. However, you cannot currently name versions or see what changed from one version to another (e.g., whether new training documents were added, accuracy was improved, or questions were changed) or who made the changes. I guess all this metadata will soon be exposed to customers, though.

Furthermore, I don't see how you can undrop a model once it's deleted, or how I can store the object definition in Git for version control, or restore the model based on its definition without going through the UI and redoing the process of uploading documents and asking the same questions and validating the answers. Without these capabilities, there will always be a fear that someone with the Snowflake Document AI roles granted might accidentally change or delete a model. However, in all fairness, this feature is not yet Generally Available (GA), and based on how I've seen Snowflake develop and release features over the years, I believe the complete management of Document AI models will be implemented before it is promoted to GA.

Conclusion

I found the process of setting up, building, and using the Snowflake Document AI model to be straightforward. I really appreciated the balance between using the UI for training and building the model versus using SQL to extract data from new documents. After all, you ideally train the model once, and then need a method to invoke the model repeatedly – and what better way than with SQL? Kudos to the engineering and UX teams at Snowflake for seamlessly integrating this feature with the rest of their stack.

IMO, this is a game-changing feature for every organization on Snowflake that wants to reliably use the power of LLMs to extract information from their documents. You don't need to move data elsewhere to train the model; the data remains protected within Snowflake's boundaries, allowing the same governance and security policies to be applied as with other structured and semi-structured data pipelines and assets in your account. Additionally, the data is trained on a proprietary Snowflake model (Arctic-TILT), so you don't have to worry about another vendor holding your data—it's all Snowflake. The fact that it's so easy to train, and you can use SQL to invoke the model afterwards to process documents, is just the cherry on top.

If you are looking for a stateful, declarative way to build reliable data pipelines, Y42 can extract data using SQL from various sources (yes, SQL), process it with dbt, and then publish data assets downstream.