|
1 | | -Open-source framework and scripts for harvesting datasets into [PortalJS](https://portaljs.com). |
2 | | -This repo is designed as a **template** — fork or clone it to quickly set up your own dataset harvesting pipelines. |
| 1 | +# PortalJS CKAN Harvester |
3 | 2 |
|
4 | | -It includes: |
| 3 | +A template harvester that pulls datasets from a **CKAN source** and upserts them into a **PortalJS CKAN target**. |
5 | 4 |
|
6 | | -* Reusable scripts for extracting datasets from common sources (APIs, CSVs, spreadsheets, etc.) |
7 | | -* A plug-and-play **ETL framework** for transforming and publishing datasets |
8 | | -* GitHub Actions workflow for automated harvesting |
9 | | -* Config-driven setup — no need to hard-wire pipelines |
| 5 | +**fetch → map → upsert** |
10 | 6 |
|
11 | | -## 🚀 Quickstart |
| 7 | +--- |
12 | 8 |
|
13 | | -1. **Use this template** |
14 | | - Click **“Use this template”** on GitHub to bootstrap your own repo. |
| 9 | +## Quick Start |
15 | 10 |
|
16 | | -2. **Configure harvesters** |
17 | | - Edit `config.yml` to define dataset sources and pipelines: |
| 11 | +```bash |
| 12 | +npm install |
| 13 | +cp .env.example .env # or edit the existing .env |
| 14 | +npm start # run the harvester |
| 15 | +``` |
18 | 16 |
|
19 | | - ```yaml |
20 | | - sources: |
21 | | - - name: world-bank |
22 | | - type: api |
23 | | - url: https://api.worldbank.org/v2/ |
24 | | - format: json |
25 | | - ``` |
| 17 | +--- |
26 | 18 |
|
27 | | -3. **Run** |
| 19 | +## Environment Variables (.env) |
28 | 20 |
|
29 | | -TODO |
| 21 | +Use these exact names. Example values are placeholders: |
30 | 22 |
|
31 | | -4. **Automate with GitHub Actions** |
32 | | - Push your repo — harvesting will run on schedule using the included workflow (`.github/workflows/harvest.yml`). |
| 23 | +```env |
| 24 | +# CKAN source |
| 25 | +SOURCE_CKAN_URL=<https://source-ckan.example.org> |
| 26 | +SOURCE_CKAN_API_KEY=<source-api-key-or-empty> |
| 27 | +SOURCE_CKAN_ORG_ID=<org-slug-or-empty> |
33 | 28 |
|
34 | | -## 🛠 Features |
| 29 | +# PortalJS Cloud target |
| 30 | +PORTALJS_CKAN_URL=<http://localhost:5000> |
| 31 | +PORTALJS_CKAN_API_KEY=<target-api-key> |
| 32 | +PORTALJS_ORG_ID=<target-org-id> |
35 | 33 |
|
36 | | -* **Modular scripts** – add your own connectors or reuse provided ones |
37 | | -* **Config-driven** – no need to edit code for new datasets |
38 | | -* **CI/CD ready** – run pipelines directly in GitHub Actions |
39 | | -* **Extensible** – works with PortalJS or standalone |
| 34 | +# Harvest behavior |
| 35 | +CONCURRENCY=4 |
| 36 | +RATE_LIMIT_RPS=2 |
| 37 | +RETRY_MAX_ATTEMPTS=2 |
| 38 | +RETRY_BASE_MS=500 |
40 | 39 |
|
41 | | -## 📦 Repo Structure |
| 40 | +# Incremental window |
| 41 | +SINCE_ISO=2025-02-01T00:00:00Z |
| 42 | +STATE_FILE=.harvest_state.json |
42 | 43 |
|
43 | | -TODO |
| 44 | +``` |
44 | 45 |
|
45 | | -## 🤝 Contributing |
| 46 | +* **`SOURCE_CKAN_URL`** – source CKAN base URL |
46 | 47 |
|
47 | | -PRs and new connectors welcome! |
48 | | -Please open an issue if you’d like to propose a new feature or source integration. |
| 48 | +* **`SOURCE_CKAN_API_KEY`** – source API key (optional) |
49 | 49 |
|
50 | | -## 📄 License |
| 50 | +* **`SOURCE_CKAN_ORG_ID`** – restrict harvest to one org (optional, empty = harvest all) |
51 | 51 |
|
52 | | -MIT License. See [LICENSE](./LICENSE) for details. |
| 52 | +* **`PORTALJS_CKAN_URL`** – target CKAN base URL |
| 53 | + |
| 54 | +* **`PORTALJS_CKAN_API_KEY`** – target API key (**required**) |
| 55 | + |
| 56 | +* **`PORTALJS_ORG_ID`** – target org where datasets will be created (must exist first) |
| 57 | + |
| 58 | +* **`CONCURRENCY`** – how many datasets to process in parallel (optional, default 4) |
| 59 | + |
| 60 | +* **`RATE_LIMIT_RPS`** – max HTTP requests per second (optional, default 2) |
| 61 | + |
| 62 | +* **`RETRY_MAX_ATTEMPTS`** – number of retry attempts on failure (optional, default 2) |
| 63 | + |
| 64 | +* **`RETRY_BASE_MS`** – base delay (ms) for exponential backoff (optional, default 500) |
| 65 | + |
| 66 | +* **`SINCE_ISO`** – harvest only datasets modified after this date (overrides state file) (optional) |
| 67 | + |
| 68 | +* **`STATE_FILE`** – JSON file used to track last run. Stores `lastRunISO`. Lets the harvester run incrementally instead of fetching everything every time. |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## How It Works |
| 73 | + |
| 74 | +1. **Discover** datasets from source CKAN (`package_search`), filtered by org and/or date. |
| 75 | +2. **Map** each dataset from source schema → target schema. |
| 76 | +3. **Upsert** into target CKAN (update if exists, create if not). |
| 77 | +4. **Persist state** in `STATE_FILE` for the next incremental run. |
| 78 | + |
| 79 | +--- |
| 80 | + |
| 81 | +## Project Structure |
| 82 | + |
| 83 | + |
| 84 | + |
| 85 | +* **`index.ts`** – main entry. Loads env + state, chooses full vs incremental run, loops datasets, maps, upserts, logs results, updates state. |
| 86 | +* **`config.ts`** – loads `.env` with `dotenv` and validates using **Zod**. |
| 87 | +* **`gen-schema.ts`** – generates `schemas/target-schema.d.ts` from target CKAN scheming API. |
| 88 | +* **`.github/workflows/run-index.yml`** – GitHub Action to run on schedule or manual trigger. |
| 89 | + |
| 90 | +* **`schemas/`** |
| 91 | + |
| 92 | + * **`source-schema.d.ts`** – interface for source datasets. |
| 93 | + * **`target-schema.d.ts`** – auto-generated interface for target datasets. |
| 94 | + |
| 95 | +* **`src/`** |
| 96 | + |
| 97 | + * **`source.ts`** – source CKAN client. |
| 98 | + |
| 99 | + * `iterSourcePackages()` async generator over `package_search`. |
| 100 | + * Supports org filter and incremental filtering (`metadata_modified >= …`). |
| 101 | + |
| 102 | + * **`target.ts`** – target CKAN helpers. |
| 103 | + |
| 104 | + * Preloads dataset list with `package_list`. |
| 105 | + * `upsertPortalDataset()` creates or updates dataset with API key. |
| 106 | + |
| 107 | + * **`map.ts`** – mapping logic. |
| 108 | + |
| 109 | + * Sets `owner_org` to `PORTALJS_ORG_ID`. |
| 110 | + * Prefixes dataset `name` with `<owner_org>--` (unique, PortalJS-friendly). |
| 111 | + * Maps `title`, `notes`, resources, and ensures defaults (language = EN, description fallback, etc.). |
| 112 | + |
| 113 | + * **`state.ts`** – reads/writes the `STATE_FILE` JSON. |
| 114 | + |
| 115 | + * **`utils.ts`** – small helpers (`withRetry()`, `sleep()`, etc.). |
| 116 | + |
| 117 | +--- |
| 118 | + |
| 119 | +## Running |
| 120 | + |
| 121 | +1. Edit `.env`. |
| 122 | +2. Run `npm start`. |
| 123 | +3. Logs will show: |
| 124 | + |
| 125 | + * “Full harvest mode” or “Incremental mode since <ISO>” |
| 126 | + * Final summary: `total=… upserts=… failures=…` |
| 127 | + |
| 128 | +--- |
| 129 | + |
| 130 | +## Extending |
| 131 | + |
| 132 | +* **Mapping** – extend `src/map.ts` to add fields (tags, extras, licenses, etc.). |
| 133 | +* **Filters** – extend `iterSourcePackages()` to filter by groups, tags, etc. |
| 134 | +* **Retries** – tweak retry/backoff logic in `utils.ts`. |
0 commit comments