Hi, I have trained a link prediction model on a large original graph using the GS CLI (graphstorm.gconstruct.construct_graph & graphstorm.run.gs_link_prediction).
Now I am trying to switch to the GS Python API to perform the following steps:
-
Load the trained model.
-
Prepare new graph data (more than 1M small graphs).
-
Prepare subgraph data (more than 1M subgraphs, each consisting of a subgraph from the original graph plus some new nodes and edges).
-
Compute node embeddings for both new graph data and subgraph data.
Here are my questions for each step:
Q1. Model loading & feature dimensions
I load the trained model with the following code. However, the model only contains gnn and embed parameters, which take 128-dim features as input. Since my original node/edge data only has 5-dim features, how should I transform my new data so that it matches the model input?
model = gs.create_builtin_lp_gnn_model(train_data.g, config, train_task=False)
model.restore_dense_model('model_path/epoch-2')
Q2. Graph construction performance
I need to process a large number of new graphs. Using graphstorm.gconstruct.construct_graph is extremely slow (processing a graph with 5 nodes takes roughly the same time as a graph with 1M nodes: 20–30 minutes).
Is there a way to accelerate this process?
Alternatively, can I directly use a DGLGraph (instead of DistGraph or GSgnnData) for inference on small graphs?
Q3. Handling subgraphs with new nodes/edges
I need to run inference on subgraphs that consist partly of the original graph and partly of new nodes/edges (each node has a unique ID).
How should I represent this data so that the model can distinguish between the “original” subgraph part and the “new” part?
Q4. Extracting node embeddings
Finally, what’s the recommended way to obtain node embeddings for subgraphs (i.e., a subgraph plus newly added nodes and edges)?
It would be very helpful if you could also provide some sample code for these steps.
Thanks in advance for your help!
Hi, I have trained a link prediction model on a large original graph using the GS CLI (
graphstorm.gconstruct.construct_graph&graphstorm.run.gs_link_prediction).Now I am trying to switch to the GS Python API to perform the following steps:
Load the trained model.
Prepare new graph data (more than 1M small graphs).
Prepare subgraph data (more than 1M subgraphs, each consisting of a subgraph from the original graph plus some new nodes and edges).
Compute node embeddings for both new graph data and subgraph data.
Here are my questions for each step:
Q1. Model loading & feature dimensions
I load the trained model with the following code. However, the model only contains
gnnandembedparameters, which take 128-dim features as input. Since my original node/edge data only has 5-dim features, how should I transform my new data so that it matches the model input?Q2. Graph construction performance
I need to process a large number of new graphs. Using graphstorm.gconstruct.construct_graph is extremely slow (processing a graph with 5 nodes takes roughly the same time as a graph with 1M nodes: 20–30 minutes).
Is there a way to accelerate this process?
Alternatively, can I directly use a DGLGraph (instead of DistGraph or GSgnnData) for inference on small graphs?
Q3. Handling subgraphs with new nodes/edges
I need to run inference on subgraphs that consist partly of the original graph and partly of new nodes/edges (each node has a unique ID).
How should I represent this data so that the model can distinguish between the “original” subgraph part and the “new” part?
Q4. Extracting node embeddings
Finally, what’s the recommended way to obtain node embeddings for subgraphs (i.e., a subgraph plus newly added nodes and edges)?
It would be very helpful if you could also provide some sample code for these steps.
Thanks in advance for your help!