Skip to content

Commit 58d3eca

Browse files
committed
add readme and update names
1 parent 82d323f commit 58d3eca

7 files changed

Lines changed: 148 additions & 41 deletions

File tree

.env.example

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# CKAN source
2+
SOURCE_CKAN_URL=https://ckan.com
3+
SOURCE_CKAN_API_KEY=
4+
SOURCE_CKAN_ORG_ID=
5+
6+
# PortalJS Cloud target
7+
PORTALJS_CKAN_URL=https://my-org.portaljs.com
8+
PORTALJS_CKAN_API_KEY=xyz
9+
PORTALJS_ORG_ID=my-org
10+
11+
# Harvest behavior
12+
CONCURRENCY=4
13+
RATE_LIMIT_RPS=2
14+
RETRY_MAX_ATTEMPTS=2
15+
RETRY_BASE_MS=500
16+
17+
# Incremental window
18+
# If set, harvest only datasets with metadata_modified >= SINCE_ISO
19+
SINCE_ISO=2025-02-01T00:00:00Z
20+
# Alternatively, roll-forward state (persisted between runs)
21+
STATE_FILE=.harvest_state.json
22+

README.md

Lines changed: 117 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,134 @@
1-
Open-source framework and scripts for harvesting datasets into [PortalJS](https://portaljs.com).
2-
This repo is designed as a **template** — fork or clone it to quickly set up your own dataset harvesting pipelines.
1+
# PortalJS CKAN Harvester
32

4-
It includes:
3+
A template harvester that pulls datasets from a **CKAN source** and upserts them into a **PortalJS CKAN target**.
54

6-
* Reusable scripts for extracting datasets from common sources (APIs, CSVs, spreadsheets, etc.)
7-
* A plug-and-play **ETL framework** for transforming and publishing datasets
8-
* GitHub Actions workflow for automated harvesting
9-
* Config-driven setup — no need to hard-wire pipelines
5+
**fetch → map → upsert**
106

11-
## 🚀 Quickstart
7+
---
128

13-
1. **Use this template**
14-
Click **“Use this template”** on GitHub to bootstrap your own repo.
9+
## Quick Start
1510

16-
2. **Configure harvesters**
17-
Edit `config.yml` to define dataset sources and pipelines:
11+
```bash
12+
npm install
13+
cp .env.example .env # or edit the existing .env
14+
npm start # run the harvester
15+
```
1816

19-
```yaml
20-
sources:
21-
- name: world-bank
22-
type: api
23-
url: https://api.worldbank.org/v2/
24-
format: json
25-
```
17+
---
2618

27-
3. **Run**
19+
## Environment Variables (.env)
2820

29-
TODO
21+
Use these exact names. Example values are placeholders:
3022

31-
4. **Automate with GitHub Actions**
32-
Push your repo — harvesting will run on schedule using the included workflow (`.github/workflows/harvest.yml`).
23+
```env
24+
# CKAN source
25+
SOURCE_CKAN_URL=<https://source-ckan.example.org>
26+
SOURCE_CKAN_API_KEY=<source-api-key-or-empty>
27+
SOURCE_CKAN_ORG_ID=<org-slug-or-empty>
3328
34-
## 🛠 Features
29+
# PortalJS Cloud target
30+
PORTALJS_CKAN_URL=<http://localhost:5000>
31+
PORTALJS_CKAN_API_KEY=<target-api-key>
32+
PORTALJS_ORG_ID=<target-org-id>
3533
36-
* **Modular scripts** – add your own connectors or reuse provided ones
37-
* **Config-driven** – no need to edit code for new datasets
38-
* **CI/CD ready** – run pipelines directly in GitHub Actions
39-
* **Extensible** – works with PortalJS or standalone
34+
# Harvest behavior
35+
CONCURRENCY=4
36+
RATE_LIMIT_RPS=2
37+
RETRY_MAX_ATTEMPTS=2
38+
RETRY_BASE_MS=500
4039
41-
## 📦 Repo Structure
40+
# Incremental window
41+
SINCE_ISO=2025-02-01T00:00:00Z
42+
STATE_FILE=.harvest_state.json
4243
43-
TODO
44+
```
4445

45-
## 🤝 Contributing
46+
* **`SOURCE_CKAN_URL`** – source CKAN base URL
4647

47-
PRs and new connectors welcome!
48-
Please open an issue if you’d like to propose a new feature or source integration.
48+
* **`SOURCE_CKAN_API_KEY`** – source API key (optional)
4949

50-
## 📄 License
50+
* **`SOURCE_CKAN_ORG_ID`** – restrict harvest to one org (optional, empty = harvest all)
5151

52-
MIT License. See [LICENSE](./LICENSE) for details.
52+
* **`PORTALJS_CKAN_URL`** – target CKAN base URL
53+
54+
* **`PORTALJS_CKAN_API_KEY`** – target API key (**required**)
55+
56+
* **`PORTALJS_ORG_ID`** – target org where datasets will be created (must exist first)
57+
58+
* **`CONCURRENCY`** – how many datasets to process in parallel (optional, default 4)
59+
60+
* **`RATE_LIMIT_RPS`** – max HTTP requests per second (optional, default 2)
61+
62+
* **`RETRY_MAX_ATTEMPTS`** – number of retry attempts on failure (optional, default 2)
63+
64+
* **`RETRY_BASE_MS`** – base delay (ms) for exponential backoff (optional, default 500)
65+
66+
* **`SINCE_ISO`** – harvest only datasets modified after this date (overrides state file) (optional)
67+
68+
* **`STATE_FILE`** – JSON file used to track last run. Stores `lastRunISO`. Lets the harvester run incrementally instead of fetching everything every time.
69+
70+
---
71+
72+
## How It Works
73+
74+
1. **Discover** datasets from source CKAN (`package_search`), filtered by org and/or date.
75+
2. **Map** each dataset from source schema → target schema.
76+
3. **Upsert** into target CKAN (update if exists, create if not).
77+
4. **Persist state** in `STATE_FILE` for the next incremental run.
78+
79+
---
80+
81+
## Project Structure
82+
83+
84+
85+
* **`index.ts`** – main entry. Loads env + state, chooses full vs incremental run, loops datasets, maps, upserts, logs results, updates state.
86+
* **`config.ts`** – loads `.env` with `dotenv` and validates using **Zod**.
87+
* **`gen-schema.ts`** – generates `schemas/target-schema.d.ts` from target CKAN scheming API.
88+
* **`.github/workflows/run-index.yml`** – GitHub Action to run on schedule or manual trigger.
89+
90+
* **`schemas/`**
91+
92+
* **`source-schema.d.ts`** – interface for source datasets.
93+
* **`target-schema.d.ts`** – auto-generated interface for target datasets.
94+
95+
* **`src/`**
96+
97+
* **`source.ts`** – source CKAN client.
98+
99+
* `iterSourcePackages()` async generator over `package_search`.
100+
* Supports org filter and incremental filtering (`metadata_modified >= …`).
101+
102+
* **`target.ts`** – target CKAN helpers.
103+
104+
* Preloads dataset list with `package_list`.
105+
* `upsertPortalDataset()` creates or updates dataset with API key.
106+
107+
* **`map.ts`** – mapping logic.
108+
109+
* Sets `owner_org` to `PORTALJS_ORG_ID`.
110+
* Prefixes dataset `name` with `<owner_org>--` (unique, PortalJS-friendly).
111+
* Maps `title`, `notes`, resources, and ensures defaults (language = EN, description fallback, etc.).
112+
113+
* **`state.ts`** – reads/writes the `STATE_FILE` JSON.
114+
115+
* **`utils.ts`** – small helpers (`withRetry()`, `sleep()`, etc.).
116+
117+
---
118+
119+
## Running
120+
121+
1. Edit `.env`.
122+
2. Run `npm start`.
123+
3. Logs will show:
124+
125+
* “Full harvest mode” or “Incremental mode since <ISO>”
126+
* Final summary: `total=… upserts=… failures=…`
127+
128+
---
129+
130+
## Extending
131+
132+
* **Mapping** – extend `src/map.ts` to add fields (tags, extras, licenses, etc.).
133+
* **Filters** – extend `iterSourcePackages()` to filter by groups, tags, etc.
134+
* **Retries** – tweak retry/backoff logic in `utils.ts`.

index.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
import Bottleneck from "bottleneck";
22
import { env } from "./config";
3-
import { iterSourcePackages } from "./src/ckan";
3+
import { iterSourcePackages } from "./src/source";
44
import { mapCkanToPortalJS } from "./src/map";
5-
import { upsertPortalDataset } from "./src/cloud";
5+
import { upsertPortalDataset } from "./src/target";
66
import { readState, writeState } from "./src/state";
77
import { withRetry } from "./src/utils";
88

package-lock.json

Lines changed: 5 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,11 @@
22
"name": "portaljs-ckan-harverster",
33
"version": "1.0.0",
44
"scripts": {
5-
"start": "ts-node index.ts",
5+
"start": "ts-node gen-schema && ts-node index.ts",
66
"build": "tsc"
77
},
88
"devDependencies": {
9+
"@types/node": "^24.3.0",
910
"ts-node": "^10.9.2",
1011
"typescript": "^5.9.2"
1112
},
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)