The data nodes

Taipy is a Python library for building data-driven web applications. Among various features, it offers a high-level interface for designing and executing data pipelines as an execution graph. In Taipy Core, there are two main objects:

  • Data nodes
  • Tasks

In this article, we mainly focus on data nodes. Data nodes provide a unified way to access data from various sources. This article defines data nodes, their role, and how they are used in Taipy pipelines.

What are data nodes?

Data nodes represent any data, variables, parameters, models… Data nodes do not store any data but know how to retrieve it from different sources. To simplify it, we can see a data node as a read and a write function backed up by robust features.

First of all, let’s start by defining some vocabulary. Data nodes can fit into two groups:

  • Input data nodes;
  • Output data nodes.
Data node example

(Note that some data nodes can be both.)

Taipy has a set of predefined data nodes ready to be used when configuring your pipelines. Here’s the list of predefined data nodes:

Data nodes in Taipy

Pickle data node:

The Pickle data node is the default data node. It manages Python objects such as strings, integers, lists, dictionaries, models (Machine learning or else), and data frames. The code below uses two Pickle data nodes: one as an input data node and one as an output data node.

  • model is an input Pickle data node pointing to an existing Pickle file (model.p);
  • predictions is an output data node that doesn’t point to anything yet.
from taipy.config import Config
model_cfg = Config.configure_data_node("model", 
                                       default_path="model.p")
predictions_cfg = Config.configure_data_node("predictions")
task_cfg = Config.configure_task("task",
                              predict,
                              model_cfg,
                              predictions_cfg)
scenario_cfg = Config.configure_scenario_from_tasks("scenario", [task_cfg])
After configuring this minimalist graph, let’s create a scenario from it and submit it for execution:
scenario = tp.create_scenario(scenario_cfg)
tp.submit(scenario)
When submitting the scenario (for execution), Taipy:

  • retrieves and reads the model,
  • executes the ‘predict‘ function,
  • and writes the results in a Pickle file.

Where does it write the output Pickle file?

Taipy automatically creates and manages paths for Pickle files if none have been defined. This eases the creation of the configuration. After creating multiple scenarios, output data nodes of different scenarios will point to different Pickle files.

scenario_2 = tp.create_scenario(scenario_cfg)
tp.submit(scenario_2)

In this example, we create a second scenario and, as a consequence, a new set of data nodes (model and predictions). model points to the same Pickle file because its path was predefined (i.e., set by the developer), while the new Prediction data  node points to a different Pickle file created by Taipy at runtime.

Tabular data nodes:

Tabular data nodes are a group of data nodes representing tabular data. By default, the data pointed will be exposed to the user/developer as a Pandas DataFrame. The list of predefined Tabular data nodes is:

  • SQL
  • CSV
  • Excel
  • Parquet
from taipy.config import Config
ini_cfg = Config.configure_csv_data_node("initial_data", 
                                         default_path="data.csv")
preprocessed_data_cfg = Config.configure_parquet_data_node("preprocessed_data", 
                                                           default_path="preprocessed_data.parquet")
task_cfg = Config.configure_task("preprocessing",
                              some_preprocessing,
                              ini_cfg,
                              preprocessed_data_cfg )
scenario_cfg = Config.configure_scenario_from_tasks("scenario", [task_cfg])

To use Tabular data nodes, you just need to declare them in the configuration and specifiy some parameters such as a default path for CSV or Parquet. Note that you can always change this path at run-time. For example, if you create a new scenario, you might want them to write the results in a different file or directory (avoiding overwriting).

scenario = tp.create_scenario(scenario_cfg)
tp.submit(scenario)

When submitting the scenario above for execution, Taipy:

  • reads the CSV file (data.csv) as it is the input data node;
  • passes it to the some_preprocessing function with the chosen exposed type (by default, a Pandas DataFrame);
  • writes or overwrites the result in the Parquet file located at ‘data.parquet’.

We demonstrate below how you can change this exposed type to other types like Numpy arrays:

Generic data nodes:

The Generic data node can be customized by users to include their own read and write functions. This feature is handy when dealing with data sources that do not have a predefined Taipy Data node. A generic data node can be tailored to access data in specific formats.
Check the documentation here for more information.

Document-based data nodes:

Taipy integrates two other predefined storage types to work with documents. Check the documentation for more details.

Conclusion

As said previously, data nodes act as pointers to data sources. They abstract away the details of how data is stored and retrieved. This makes working with data in a pipeline or a complete Web Application easier.

In addition, Taipy’s ability to model data nodes makes it possible to skip redundant tasks. Taipy can detect when inputs haven’t changed between two runs and will produce the same outputs, making it unnecessary to execute the task again. This “skippability” feature increases data processing efficiency, saving users time and resources.

Florian Jacta
Florian Jacta

Taipy Customer Success Engineer