as a reminder: https://github.com/erc-dharma/project-documentation/issues/7
regarding fuzzy settings for Old Javanese, the following gives a useful list: https://chromewebstore.google.com/detail/sealang-plus/elnlcjojbmjmhheahimkikghiajeloha though it implies a slightly different transliteration system and so the list wouild need to be adapted
@michaelnmmeyer wrote:
The new search system is at: https://dharmalekha.info/search. I will eventually move it to https://dharmalekha.info/texts when done with the interface.
might I suggest the reverse, and possibly the use of "find" rather than search", i.e. make both texts as such and strings in texts findable via https://dharmalekha.info/find or https://dharmalekha.info/search? It might be desirable immediately to separate searching metadata about texts (i.e., searching in teiHeader) and searching within texts by presenting two different search boxes.
@michaelnmmeyer wrote:
I invite you to tell me what you want the matching behavior to be like. My plan is to define one or more matching modes. For instance, there could be a "default" mode which is case-insensitive and ignores hyphens, a "Tamil" mode that does the same and also treats 'k' and 'g' as equivalent, a "Sanskrit" mode that ignores spaces, etc.
MATCHING BEHAVIOR
- I approve of the idea of modes, though I would like us to try not to call them by the names of specific languages (as one of the fundamental ambitions of DHARMA was and is to make terminological affinities findable across language boundaries)
- we could have two basic modes, 'precise' and 'loose'
- there could be a generic (default) version of 'loose', with features such as
-- 1. being case-insensitive,
-- 2. ignoring hyphens,
-- 3. ignoring any milestone elements and any <g> elements
-- 4. ignoring any transliterated virāmas ·
-- 5. ignoring differences ē/e and ō/o
- and then there could be custom settings for 'loose'
-- 1. ignoring difference between voiced/unvoiced,
-- 2. ignoring difference between aspirated/unaspirated,
-- 3. ignoring difference between dental/retroflex plosives,
-- 4. ignoring difference between sibilants
-- 5. ignoring difference between vowel characters with or without macron (i.e., between vowels transliterated as long or short)
-- 6. ignoring difference between characters with and without diacritics (like searching in google docs)
-- 7. ignoring difference between sequences CC and CəC
- I would like us to offer a selection of regular expressions
PRESENTATION OF SEARCH RESULTS
I would like this to be more space-efficient and suggest that we might display only title and complete file name. E.g., compared to this
displaying only
Camundi
tfc-nusantara-epigraphy/DHARMA_INSIDENKCamundi
⟨01⟩ (0) [nama]ś cāmuṇḍyai
might be sufficient
as a reminder:
https://github.com/erc-dharma/project-documentation/issues/7regarding fuzzy settings for Old Javanese, the following gives a useful list: https://chromewebstore.google.com/detail/sealang-plus/elnlcjojbmjmhheahimkikghiajeloha though it implies a slightly different transliteration system and so the list wouild need to be adapted
@michaelnmmeyer wrote:
might I suggest the reverse, and possibly the use of "find" rather than search", i.e. make both texts as such and strings in texts findable via https://dharmalekha.info/find or https://dharmalekha.info/search? It might be desirable immediately to separate searching metadata about texts (i.e., searching in teiHeader) and searching within texts by presenting two different search boxes.
@michaelnmmeyer wrote:
MATCHING BEHAVIOR
-- 1. being case-insensitive,
-- 2. ignoring hyphens,
-- 3. ignoring any milestone elements and any
<g>elements-- 4. ignoring any transliterated virāmas ·
-- 5. ignoring differences ē/e and ō/o
-- 1. ignoring difference between voiced/unvoiced,
-- 2. ignoring difference between aspirated/unaspirated,
-- 3. ignoring difference between dental/retroflex plosives,
-- 4. ignoring difference between sibilants
-- 5. ignoring difference between vowel characters with or without macron (i.e., between vowels transliterated as long or short)
-- 6. ignoring difference between characters with and without diacritics (like searching in google docs)
-- 7. ignoring difference between sequences CC and CəC
PRESENTATION OF SEARCH RESULTS
I would like this to be more space-efficient and suggest that we might display only title and complete file name. E.g., compared to this
displaying only
might be sufficient