Skip to content

WIP add caching to .skb.apply#2017

Draft
jeromedockes wants to merge 2 commits intoskrub-data:mainfrom
jeromedockes:dataops-caching
Draft

WIP add caching to .skb.apply#2017
jeromedockes wants to merge 2 commits intoskrub-data:mainfrom
jeromedockes:dataops-caching

Conversation

@jeromedockes
Copy link
Copy Markdown
Member

this is very much a POC / draft. the calls to estimators used in .skb.apply() are cached with joblib

@jeromedockes jeromedockes added the data_ops Something related to the skrub DataOps label Apr 2, 2026
@jeromedockes
Copy link
Copy Markdown
Member Author

example script
import time

import skrub
from sklearn.datasets import make_classification
from sklearn.ensemble import HistGradientBoostingClassifier

X_a, y_a = make_classification(n_samples=100_000, n_features=200)
pred = skrub.X(X_a).skb.apply(HistGradientBoostingClassifier(), y=skrub.y(y_a))
split = pred.skb.train_test_split()
learner = pred.skb.make_learner()

tic = time.perf_counter()
learner.fit(split["train"])
toc = time.perf_counter()
print('first fit', toc - tic)

tic = time.perf_counter()
learner.fit(split["train"])
toc = time.perf_counter()
print('second fit', toc - tic)

tic = time.perf_counter()
result = learner.predict_proba(split["test"])
toc = time.perf_counter()
print('first predict', toc - tic)

tic = time.perf_counter()
result = learner.predict_proba(split["test"])
toc = time.perf_counter()
print('second predict', toc - tic)
first fit 3.222974759002682
second fit 0.22381325099559035
first predict 0.13882737699896097
second predict 0.06695482600480318

@adrinjalali adrinjalali added this to Labs Apr 13, 2026
@adrinjalali adrinjalali moved this to In progress in Labs Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data_ops Something related to the skrub DataOps

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants