Add search_people_with_past_company tool for advanced people filtering#205
Add search_people_with_past_company tool for advanced people filtering#205guykwan wants to merge 580 commits intostickerdaniel:mainfrom
Conversation
…iel#91) <!-- CURSOR_SUMMARY --> > [!NOTE] > Automates Docker Hub page updates during releases. > > - Adds `Update Docker Hub description` step in `release.yml` using `peter-evans/dockerhub-description@v5` with repo credentials and `readme-filepath` pointing to `docs/docker-hub.md` > - Introduces `docs/docker-hub.md` containing concise image description, features, and quick-start instructions (cookie auth and uvx session mount) > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit ee39269. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
…kerdaniel#92) <!-- CURSOR_SUMMARY --> > [!NOTE] > Improves documentation for authentication, session handling, and Docker usage across `README.md` and `docs/docker-hub.md`. > > - **Security**: Adds warning that `~/.linkedin-mcp/session.json` contains sensitive auth data > - **Auth/session flow**: Promotes `--get-session` for browser login, clarifies captcha/2FA handling, and points users to uvx to resolve challenges > - **Docker guidance**: Clearly states `--get-session`/`--no-headless` aren’t available in Docker; provides two auth options (mount session or pass `li_at` cookie) with examples and notes > - **DXT and local setup**: Updates steps to create session first, then run; simplifies notes and troubleshooting; separates login vs scraping issues > - **Copy/consistency**: Tightens wording, aligns CLI options and examples, fixes/updates links and formatting > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 40b8ef4. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
<!-- CURSOR_SUMMARY --> > [!NOTE] > Minor release version bump. > > - Updates project version in `pyproject.toml` from `2.1.1` to `2.1.2` > - Syncs `uv.lock` to reflect the new package version > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 4350e34. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
…ependencies chore(deps): pin dependencies
Updated instructions to use an incognito tab for obtaining the 'li_at' cookie.
…sh-setup-bun-digest chore(deps): update oven-sh/setup-bun digest to db6bcf6
…sh-setup-bun-digest chore(deps): update oven-sh/setup-bun digest to 3d26778
…opics-claude-code-action-digest chore(deps): update anthropics/claude-code-action digest to a017b83
Move semantic validation (ranges, positive values) from loaders to schema classes. Add BrowserConfig.validate() for viewport, timeout, and slow_mo validation. Call validate() at end of load_config(). - Add new env vars: TIMEOUT, USER_AGENT, HOST, PORT, HTTP_PATH, SLOW_MO, VIEWPORT - Add --linkedin-cookie CLI argument - Fix --viewport default to None (was overwriting env vars) - Change viewport CLI error from warning to ConfigurationError
…iel#99) Move semantic validation (ranges, positive values) from loaders to schema classes. Add BrowserConfig.validate() for viewport, timeout, and slow_mo validation. Call validate() at end of load_config(). - Add new env vars: TIMEOUT, USER_AGENT, HOST, PORT, HTTP_PATH, SLOW_MO, VIEWPORT - Add --linkedin-cookie CLI argument - Fix --viewport default to None (was overwriting env vars) - Change viewport CLI error from warning to ConfigurationError <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Shifts semantic validation from loaders into `BrowserConfig.validate()` and `AppConfig.validate()`, with a final `config.validate()` call in `load_config()`. > > - Adds env vars: `TIMEOUT`, `USER_AGENT`, `HOST`, `PORT`, `HTTP_PATH`, `SLOW_MO`, `VIEWPORT`; removes `DEFAULT_TIMEOUT` > - Adds CLI: `--linkedin-cookie`; sets `--viewport` default to `None` and raises `ConfigurationError` on bad format > - Validates and parses integers for `TIMEOUT`, `PORT`, `SLOW_MO`; rejects invalid `TRANSPORT` > - Keeps loaders focused on reading values; schema enforces ranges/format (viewport, timeout, slow_mo, port, path) > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 77e159f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | ghcr.io/astral-sh/uv | final | pinDigest | → `9a23023` | | python | stage | pinDigest | → `4a3ceab` | --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 👻 **Immortal**: This PR will be recreated if closed unmerged. Get [config help](https://redirect.github.com/renovatebot/renovate/discussions) if that's undesired. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/stickerdaniel/linkedin-mcp-server). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi43NC41IiwidXBkYXRlZEluVmVyIjoiNDIuNzQuNSIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOltdfQ==-->
Use direct .get() lookup for date_posted and sort_by (single-select filters). Remove unreachable _RATE_LIMITED_MSG check after early break. Query _get_total_search_pages only once per search to avoid repeated evaluate() calls when the element is absent.
Apply quote_plus to date_posted and sort_by passthrough values to prevent malformed URLs from unexpected input. Use consistent 1-indexed page numbers in all debug log messages.
Warn when search page rate-limit retry also fails. Add console.debug in scroll_job_sidebar when no scrollable container is found.
Skip sidebar scrolling when <main> is absent to avoid 5s timeout on edge-case pages. Fix off-by-one in total_pages log message. Add page count assertion to test_deduplication_across_pages.
Append text to page_texts before breaking on no new IDs so the LLM can read LinkedIn's feedback (e.g. "No jobs found") instead of receiving empty sections.
Add await_count == 2 assertion to test_page_texts_joined_with_separator matching the pattern already used in test_deduplication_across_pages.
Switch from innerText to textContent in _get_total_search_pages so the "Page X of Y" text is readable regardless of CSS visibility.
- Replace console.debug in scroll_job_sidebar JS with sentinel return so the message is logged via Python logger instead - Wrap _get_total_search_pages in its own try/except to prevent an exception from discarding already-fetched page text and job IDs - Inline offset calculation into URL ternary for clarity
- Add debug log when sidebar container is found but no new content loads (scrolled == 0) - Add debug log when <main> is absent and body fallback is used on search pages
- Use -2 sentinel for "job card link vanished" vs -1 for "no
scrollable container" vs 0 for "no new content loaded"
- Return {source, text} from search page JS evaluate so the body
fallback log fires based on actual DOM state, not the pre-evaluate
wait_for_selector flag
- Add URL sanity check before _extract_job_ids to prevent extracting IDs from a stale page after a swallowed navigation failure - Add test_no_ids_on_first_page_captures_text to pin the behavior where non-empty text with zero job IDs is returned in sections - Change total_pages mock to None in test_pagination_uses_fixed_page_size since max_pages=2 caps the loop before total_pages is relevant
…uard - Move _NOISE_MARKERS comment to directly precede the list it describes - Log when <main> appears after wait_for_selector timeout but before evaluate (sidebar scroll skipped on late-appearing element) - Add test_url_redirect_skips_id_extraction to exercise the URL sanity guard that prevents extracting IDs from a stale/redirect page
Capture _get_total_search_pages mock in test_stops_at_total_pages and verify await_count == 1 to pin the query-once optimization.
…ols_add_job_ids_sidebar_scrolling_and_pagination_to_search_jobs feat(tools): add job IDs, sidebar scrolling, and pagination to search_jobs
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…l commands (stickerdaniel#202) <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR adds a "Verifying Bug Reports" section to `AGENTS.md` with step-by-step `curl` commands for testing the MCP server end-to-end via HTTP transport. The `SESSION_ID` extraction via `grep`/`awk`/`tr -d '\r'` is correct and properly handles Windows-style line endings in curl header output. However, the **server startup command blocks the terminal** — without `&` or an explicit note to use a separate shell, developers or agents following the script linearly will never reach the `curl` commands. <h3>Confidence Score: 4/5</h3> - Safe to merge once the server startup command is backgrounded or explicit terminal-switching instructions are added. - The change is documentation-only and does not affect runtime code. The session-ID extraction logic is correct. The primary issue is a usability blocker: the server startup command blocks the terminal, preventing the documented workflow from executing end-to-end in a single shell. This is straightforward to fix with `&` or an explicit note. - AGENTS.md — specifically the server startup command (line 138) needs to either background the process or include explicit instructions to use a separate terminal. <sub>Last reviewed commit: e8e8eb9</sub> > Greptile also left **1 inline comment** on this PR. <!-- /greptile_comment -->
Activity feed pages lazy-load post content after tab headers render. Add wait_for_function check and slower scroll params for /recent-activity/ URLs so posts section returns actual content instead of just tab headers. Resolves: stickerdaniel#201
…ity-feed-posts-empty fix(scraping): Wait for activity feed content before extracting
| @@ -0,0 +1,347 @@ | |||
| """ | |||
There was a problem hiding this comment.
File placed in wrong directory — tool never registered
This file is added to tools/person.py at the repository root, but the MCP server imports from linkedin_mcp_server.tools.person (see linkedin_mcp_server/server.py line 20):
from linkedin_mcp_server.tools.person import register_person_toolsThe actual module that is loaded and registered is at linkedin_mcp_server/tools/person.py. This new file at tools/person.py is never imported by anything, so search_people_with_past_company will never be registered as an MCP tool and is completely dead code. The new tool and helper functions need to be added to linkedin_mcp_server/tools/person.py instead.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tools/person.py
Line: 1
Comment:
**File placed in wrong directory — tool never registered**
This file is added to `tools/person.py` at the repository root, but the MCP server imports from `linkedin_mcp_server.tools.person` (see `linkedin_mcp_server/server.py` line 20):
```python
from linkedin_mcp_server.tools.person import register_person_tools
```
The actual module that is loaded and registered is at `linkedin_mcp_server/tools/person.py`. This new file at `tools/person.py` is never imported by anything, so `search_people_with_past_company` will never be registered as an MCP tool and is completely dead code. The new tool and helper functions need to be added to `linkedin_mcp_server/tools/person.py` instead.
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| await ctx.report_progress( | ||
| progress=30 + int((idx / len(profile_urls)) * 60), | ||
| total=100, | ||
| message=f"Checking profile {idx + 1}/{len(profile_urls[:max_results * 3])}: {username}" |
There was a problem hiding this comment.
Wrong keyword argument name causes TypeError at runtime
scrape_person is defined with the parameter name requested (see linkedin_mcp_server/scraping/extractor.py line 254):
async def scrape_person(self, username: str, requested: set[str]) -> dict[str, Any]:Calling it with the keyword argument requested_sections will raise a TypeError: scrape_person() got an unexpected keyword argument 'requested_sections' at runtime, causing every profile check to fail.
| await ctx.report_progress( | |
| progress=30 + int((idx / len(profile_urls)) * 60), | |
| total=100, | |
| message=f"Checking profile {idx + 1}/{len(profile_urls[:max_results * 3])}: {username}" | |
| profile_result = await extractor.scrape_person( | |
| username, requested={"experience"} | |
| ) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tools/person.py
Line: 218-222
Comment:
**Wrong keyword argument name causes `TypeError` at runtime**
`scrape_person` is defined with the parameter name `requested` (see `linkedin_mcp_server/scraping/extractor.py` line 254):
```python
async def scrape_person(self, username: str, requested: set[str]) -> dict[str, Any]:
```
Calling it with the keyword argument `requested_sections` will raise a `TypeError: scrape_person() got an unexpected keyword argument 'requested_sections'` at runtime, causing every profile check to fail.
```suggestion
profile_result = await extractor.scrape_person(
username, requested={"experience"}
)
```
How can I resolve this? If you propose a fix, please make it concise.| ) | ||
|
|
||
| # Extract profile URLs from search results |
There was a problem hiding this comment.
URL extraction from innerText will always return an empty list
extractor.search_people() calls extract_page(), which returns main.innerText — plain text with no HTML markup. LinkedIn profile URLs (e.g. https://www.linkedin.com/in/username) are rendered as hyperlinks in the DOM, not printed as visible text. They will never appear in the innerText string, so _extract_profile_urls will always return [], meaning the second-step filtering never runs and the function always returns zero matches.
To reliably extract profile URLs, the extractor would need to read href attributes directly from the DOM (similar to how _extract_job_ids does it via page.evaluate) rather than parsing plain text.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tools/person.py
Line: 196-198
Comment:
**URL extraction from `innerText` will always return an empty list**
`extractor.search_people()` calls `extract_page()`, which returns `main.innerText` — plain text with no HTML markup. LinkedIn profile URLs (e.g. `https://www.linkedin.com/in/username`) are rendered as hyperlinks in the DOM, not printed as visible text. They will never appear in the `innerText` string, so `_extract_profile_urls` will always return `[]`, meaning the second-step filtering never runs and the function always returns zero matches.
To reliably extract profile URLs, the extractor would need to read `href` attributes directly from the DOM (similar to how `_extract_job_ids` does it via `page.evaluate`) rather than parsing plain text.
How can I resolve this? If you propose a fix, please make it concise.| def _extract_profile_urls(search_text: str) -> list[str]: | ||
| """Extract LinkedIn profile URLs from search results text.""" | ||
| import re |
There was a problem hiding this comment.
profile_result.get("username") always returns None
scrape_person returns {"url": ..., "sections": ...} — there is no "username" key in its return dict. This means every profile in matching_profiles and partial_matches will have "username": None, making it impossible for callers to look up or identify the matching profiles.
| def _extract_profile_urls(search_text: str) -> list[str]: | |
| """Extract LinkedIn profile URLs from search results text.""" | |
| import re | |
| "username": url.split("/in/")[-1].rstrip("/") if url else None, |
Or more cleanly, pass the username variable (already extracted on line 210) into _parse_profile_for_filters.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tools/person.py
Line: 270-272
Comment:
**`profile_result.get("username")` always returns `None`**
`scrape_person` returns `{"url": ..., "sections": ...}` — there is no `"username"` key in its return dict. This means every profile in `matching_profiles` and `partial_matches` will have `"username": None`, making it impossible for callers to look up or identify the matching profiles.
```suggestion
"username": url.split("/in/")[-1].rstrip("/") if url else None,
```
Or more cleanly, pass the `username` variable (already extracted on line 210) into `_parse_profile_for_filters`.
How can I resolve this? If you propose a fix, please make it concise.| ) -> dict[str, Any]: | ||
| """Parse profile result and check if it matches filters.""" | ||
| sections = profile_result.get("sections", {}) | ||
| experience_text = sections.get("experience", "") | ||
| main_text = sections.get("main", "") |
There was a problem hiding this comment.
Non-deterministic URL ordering from set() deduplication
_extract_profile_urls returns [f"https://linkedin.com/in/{username}" for username in set(matches)]. The set conversion removes duplicates but destroys the original ordering from the search results page (where LinkedIn orders results by relevance). Each call may iterate profiles in a different order, producing inconsistent results. Use dict.fromkeys to preserve insertion order while deduplicating:
| ) -> dict[str, Any]: | |
| """Parse profile result and check if it matches filters.""" | |
| sections = profile_result.get("sections", {}) | |
| experience_text = sections.get("experience", "") | |
| main_text = sections.get("main", "") | |
| seen = dict.fromkeys(matches) | |
| return [f"https://linkedin.com/in/{username}" for username in seen] |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tools/person.py
Line: 290-294
Comment:
**Non-deterministic URL ordering from `set()` deduplication**
`_extract_profile_urls` returns `[f"https://linkedin.com/in/{username}" for username in set(matches)]`. The `set` conversion removes duplicates but destroys the original ordering from the search results page (where LinkedIn orders results by relevance). Each call may iterate profiles in a different order, producing inconsistent results. Use `dict.fromkeys` to preserve insertion order while deduplicating:
```suggestion
seen = dict.fromkeys(matches)
return [f"https://linkedin.com/in/{username}" for username in seen]
```
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| except Exception as e: | ||
| raise_tool_error(e, "search_people") # NoReturn | ||
|
|
There was a problem hiding this comment.
Non-English inline comment
The comment # 更长超时,因为需要获取多个档案 is in Chinese. The rest of the codebase uses English exclusively for comments and documentation. Please translate this to English to keep the codebase consistent:
| timeout=TOOL_TIMEOUT_SECONDS * 3, # Longer timeout because multiple profiles need to be fetched |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tools/person.py
Line: 124
Comment:
**Non-English inline comment**
The comment `# 更长超时,因为需要获取多个档案` is in Chinese. The rest of the codebase uses English exclusively for comments and documentation. Please translate this to English to keep the codebase consistent:
```suggestion
timeout=TOOL_TIMEOUT_SECONDS * 3, # Longer timeout because multiple profiles need to be fetched
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| profile_result: dict[str, Any], | ||
| past_company_list: list[str], | ||
| current_title: str | None, | ||
| ) -> dict[str, Any]: | ||
| """Parse profile result and check if it matches filters.""" | ||
| sections = profile_result.get("sections", {}) | ||
| experience_text = sections.get("experience", "") | ||
| main_text = sections.get("main", "") | ||
|
|
There was a problem hiding this comment.
import re inside function body
re is imported inside both _extract_profile_urls (line 288) and _extract_username_from_url (line 298). While Python caches module imports, the convention in this codebase (and generally) is to place all imports at the top of the module. Move import re to the module-level imports alongside import asyncio and import logging.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tools/person.py
Line: 287-295
Comment:
**`import re` inside function body**
`re` is imported inside both `_extract_profile_urls` (line 288) and `_extract_username_from_url` (line 298). While Python caches module imports, the convention in this codebase (and generally) is to place all imports at the top of the module. Move `import re` to the module-level imports alongside `import asyncio` and `import logging`.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
fd80f60 to
7661f43
Compare
Summary
This PR introduces a new tool
search_people_with_past_companythat enables advanced people search with filtering by past companies and current job titles.New Feature:
search_people_with_past_companyUse Cases
Parameters
keywords(required): Search keywords (e.g., "founder", "CEO")location(optional): Location filter (e.g., "Beijing", "San Francisco")past_companies(optional): Comma-separated company names (e.g., "Alibaba,ByteDance,Tencent")current_title(optional): Current job title filter (e.g., "founder", "CEO")max_results(optional): Maximum results (default: 10)Example
mcporter call linkedin.search_people_with_past_company \ keywords="founder" \ location="Beijing" \ past_companies="Alibaba,ByteDance" \ current_title="founder"Implementation
Changes
search_people_with_past_company()toolasyncioimportTesting
Related
Useful for talent acquisition, investment research, and competitive intelligence.
Greptile Summary
This PR introduces a
search_people_with_past_companytool that performs a two-step search: first fetching LinkedIn people search results, then iterating each profile to filter by past company and current title. Unfortunately, the implementation has several blocking issues that prevent it from functioning at all.Key issues found:
tools/person.py(repository root) instead oflinkedin_mcp_server/tools/person.py(the actual module). The server only imports fromlinkedin_mcp_server.tools.person, so the new tool is never registered.extractor.scrape_person(username, requested_sections={"experience"})uses a non-existent parameter name — the actual parameter isrequested. This raises aTypeErroron every profile fetch._extract_profile_urlssearches for fullhttps://linkedin.com/in/...URLs insideinnerText, butextract_pagereturns plain text (no HTML). Profile URLs are only inhrefattributes and are never printed as visible text, so this function always returns an empty list.usernamefield alwaysNone:scrape_personreturns{"url": ..., "sections": ...}— no"username"key — so every matched profile'susernamefield will beNone.set()in_extract_profile_urlsloses LinkedIn's relevance-ranked ordering.import restyle issues.Confidence Score: 1/5
tools/person.py— all changes are in this single file, which needs to be moved tolinkedin_mcp_server/tools/person.pyand the logic bugs fixed before any of the new functionality can work.Important Files Changed
tools/instead oflinkedin_mcp_server/tools/), making the new tool completely unreachable. Contains multiple critical bugs: wrong keyword argument name onscrape_person, URL extraction frominnerTextthat will always return empty, andusernamealways beingNonein output.Sequence Diagram
sequenceDiagram participant Client participant MCP as MCP Server participant Tool as search_people_with_past_company participant Extractor as LinkedInExtractor Client->>MCP: call search_people_with_past_company(keywords, location, past_companies, current_title) MCP->>Tool: invoke Tool->>Extractor: search_people(keywords, location) Extractor-->>Tool: {url, sections: {search_results: innerText}} Note over Tool: _extract_profile_urls(innerText)<br/>⚠️ Always returns [] — URLs not in plain text loop For each profile URL (up to max_results × 3) Tool->>Extractor: scrape_person(username, requested_sections={"experience"})<br/>⚠️ TypeError: wrong kwarg name (should be 'requested') Extractor-->>Tool: {url, sections: {experience: text}} Note over Tool: _parse_profile_for_filters()<br/>profile_result.get("username") → None always alt matches_all Tool->>Tool: append to matching_profiles else matches_partial Tool->>Tool: append to partial_matches end Note over Tool: asyncio.sleep(1.5) end Tool-->>Client: {search_url, total_checked, filters, matching_profiles, partial_matches}Last reviewed commit: 0192ab1
(2/5) Greptile learns from your feedback when you react with thumbs up/down!