Skip to content

Commit 1f5ebd4

Browse files
Update README.md
1 parent 4833f01 commit 1f5ebd4

1 file changed

Lines changed: 71 additions & 77 deletions

File tree

README.md

Lines changed: 71 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -37,92 +37,86 @@ We're enabling **the first publicly available and transparent research for acade
3737
The project is organized into a modular structure to promote maintainability, reusability, and clear separation of concerns. This is the current folder layout but can change over time:
3838

3939
```
40-
WebKnoGraph/ (Project root)
41-
├── assets/ # Project assets (images, etc.)
42-
│ ├── 01_crawler.png
43-
│ ├── 02_embeddings.png
40+
WebKnoGraph/ (Project Root)
41+
├── .github/
42+
│ └── workflows/
43+
│ ├── lint_and_format.yaml
44+
│ └── python_tests.yaml
45+
├── assets/
4446
│ ├── 03_link_graph.png
4547
│ ├── 04_graphsage_01.png
4648
│ ├── 04_graphsage_02.png
47-
│ ├── 06_HITS_PageRank_Sorted_URLs.png
48-
│ ├── WL_logo.png
49+
│ ├── bmc-brand-logo.png
50+
│ ├── crawler_ui.png
51+
│ ├── embeddings_ui.png
4952
│ ├── fcse_logo.png
50-
│ └── kalicube.com.png
51-
├── data/ # (This directory should typically be empty in the repo, used for runtime output)
52-
│ ├── link_graph_edges.csv # Example of existing data files
53+
│ ├── internal-linking-seo-roi-cropped.png
54+
│ ├── kalicube.com.png
55+
│ ├── pagerank_ui.png
56+
│ ├── product_roadmap.png
57+
│ ├── test_completed_1.png
58+
│ ├── test_completed_2.png
59+
│ ├── WebKnoGraph.png
60+
│ └── WL_logo.png
61+
├── data/
62+
│ ├── crawled_data_parquet/
63+
│ │ └── crawl_date=2025-06-28/
5364
│ ├── prediction_model/
54-
│ │ └── model_metadata.json # Example of existing data files
55-
│ └── url_analysis_results.csv # Example of existing data files
56-
├── notebooks/ # Jupyter notebooks, each acting as a UI entry point
57-
│ ├── crawler_ui.ipynb # UI for Content Crawler
58-
│ ├── embeddings_ui.ipynb # UI for Embeddings Pipeline
59-
│ ├── link_crawler_ui.ipynb # UI for Link Graph Extractor
60-
│ ├── link_prediction_ui.ipynb # UI for GNN Link Prediction & Recommendation
61-
│ └── pagerank_ui.ipynb # UI for PageRank & HITS Analysis (Newly added)
62-
├── src/ # Core source code for the application
63-
│ ├── backend/ # Backend logic for various functionalities
64-
│ │ ├── __init__.py
65-
│ │ ├── config/ # Configuration settings for each module
66-
│ │ │ ├── __init__.py
67-
│ │ │ ├── crawler_config.py
68-
│ │ │ ├── embeddings_config.py
69-
│ │ │ ├── link_crawler_config.py
70-
│ │ │ ├── link_prediction_config.py
71-
│ │ │ └── pagerank_config.py
72-
│ │ ├── data/ # Data loading, saving, and state management components
73-
│ │ │ ├── __init__.py
74-
│ │ │ ├── repositories.py # For Content Crawler state (SQLite)
75-
│ │ │ ├── embeddings_loader.py
76-
│ │ │ ├── embeddings_saver.py
77-
│ │ │ ├── embedding_state_manager.py
78-
│ │ │ ├── graph_dataloader.py # For Link Prediction data loading
79-
│ │ │ ├── graph_processor.py # For Link Prediction data processing
80-
│ │ │ └── link_graph_repository.py # For Link Graph Extractor state (SQLite) & CSV saving
81-
│ │ ├── graph/ # Graph-specific algorithms and analysis
82-
│ │ │ ├── __init__.py
83-
│ │ │ └── analyzer.py
84-
│ │ ├── models/ # Machine learning model definitions
85-
│ │ │ ├── __init__.py
86-
│ │ │ └── graph_models.py # For GNN Link Prediction (GraphSAGE)
87-
│ │ ├── services/ # Orchestrators and core business logic for each module
88-
│ │ │ ├── __init__.py
89-
│ │ │ ├── crawler_service.py
90-
│ │ │ ├── embeddings_service.py
91-
│ │ │ ├── graph_training_service.py
92-
│ │ │ ├── link_crawler_service.py
93-
│ │ │ ├── pagerank_service.py
94-
│ │ │ └── recommendation_engine.py
95-
│ │ └── utils/ # General utility functions
96-
│ │ ├── __init__.py
97-
│ │ ├── http.py # HTTP client utilities (reusable)
98-
│ │ ├── url.py # URL filtering/extraction for Content Crawler
99-
│ │ ├── link_url.py # URL filtering/extraction for Link Graph Extractor
100-
│ │ ├── strategies.py # Crawling strategies (BFS/DFS), generalized for both crawlers
101-
│ │ ├── text_processing.py # Text extraction from HTML
102-
│ │ ├── embedding_generation.py # Embedding model loading & generation
103-
│ │ └── url_processing.py # URL path processing (e.g., folder depth)
104-
│ └── shared/ # Components shared across frontend and backend
65+
│ │ ├── edge_index.pt
66+
│ │ ├── final_node_embeddings.pt
67+
│ │ ├── graphsage_link_predictor.pth
68+
│ │ └── model_metadata.json
69+
│ ├── url_embeddings/
70+
├── notebooks/
71+
│ ├── automatic_link_recommendation_ui.ipynb
72+
│ ├── crawler_ui.ipynb
73+
│ ├── embeddings_ui.ipynb
74+
│ ├── link_crawler_ui.ipynb
75+
│ ├── link_prediction_ui.ipynb
76+
│ └── pagerank_ui.ipynb
77+
├── results/
78+
│ ├── automatic_led/
79+
│ │ ├── folder_batches/
80+
│ │ ├── high_batches/
81+
│ │ ├── high_boosters/
82+
│ │ ├── low_batches/
83+
│ │ ├── mixed_batches/
84+
│ │ └── random_batches/
85+
│ ├── base_file_types/
86+
│ ├── expert_led/
87+
│ │ ├── folder_batches/
88+
│ │ ├── high_batches/
89+
│ │ ├── low_batches/
90+
│ │ ├── mixed_batches/
91+
│ │ └── random_batches/
92+
├── src/
93+
│ ├── backend/
94+
│ │ ├── config/
95+
│ │ ├── data/
96+
│ │ ├── graph/
97+
│ │ ├── models/
98+
│ │ ├── services/
99+
│ │ ├── utils/
100+
│ │ └── __init__.py
101+
│ └── shared/
105102
│ ├── __init__.py
106-
│ ├── interfaces.py # Abstract interfaces (e.g., ILogger)
107-
│ └── logging_config.py # Standardized logging setup
108-
├── tests/ # Top-level directory for all unit tests
109-
│ ├── backend/ # Mirrors src/backend
110-
│ │ ├── services/ # Mirrors src/backend/services
111-
│ │ │ ├── test_crawler_service.py # Unit tests for crawler_service
112-
│ │ │ ├── test_embeddings_service.py # Unit tests for embeddings_service
113-
│ │ │ ├── test_link_crawler_service.py # Unit tests for link_crawler_service
114-
│ │ │ ├── test_graph_training_service.py # Unit tests for graph_training_service
115-
│ │ │ └── test_pagerank_service.py # Unit tests for pagerank_service (Newly added)
116-
│ │ └── __init__.py # Makes 'services' a Python package
117-
│ └── __init__.py # Makes 'backend' a Python package
118-
├── .github/
119-
│ └── workflows/
120-
│ └── python_tests.yaml # GitHub Actions workflow for automated tests
103+
│ ├── interfaces.py
104+
│ └── logging_config.py
105+
├── tests/
106+
│ ├── backend/
107+
│ │ ├── services/
108+
│ │ └── __init__.py
109+
│ └── __init__.py
110+
├── .gitignore
111+
├── .pre-commit-config.yaml
112+
├── CHANGELOG.md
113+
├── CITATION.cff
114+
├── generate_structure_insightful.py
115+
├── HOW-IT-WORKS.md
121116
├── LICENSE
122117
├── README.md
123118
├── requirements.txt
124-
└── technical_report/ # Placeholder for documentation
125-
└── WebKnoGraph_Technical_Report.pdf
119+
└── trim_ws.py
126120
```
127121

128122
## Starting a Fresh Crawl

0 commit comments

Comments
 (0)