New book - 'Building a Better World in Your Backyard' - on Kickstarter (sponsored friend)

Public Domain Search/Searching Gutenberg.org

From Appropedia
Jump to: navigation, search

Gutenberg will be removed temporarily from the Appropedia's Public Domain Search search until a separate filtered search can be made, as one component of the Public Domain Search.

The reasons in more detail:

  • It doesn't have much much public domain content of relevance to Appropedia, and
  • It is more difficult than expected to find public domain content on Gutenberg. When I've looked through search results, many of them were marked as copyrighted. Yet the copyright page says most content is public domain. In any case, a special filtered search will be needed. Any help is appreciated (edit this page, add information/suggestions here or on the talk page, or contact Chriswaterguy on his talk page).

The filtered search will be first attempted using regular Google search with the term site:gutenberg.org, and once we've worked out the right filter, we'll find a way to integrate it into APDS.

First attempts at a filtered search[edit]

Note: this attempt failed as it was filtering by terms found on the link page, and not in the document itself.

  • Mixed results on filtered search: Googling for site:gutenberg.org "is a public domain work" gives 2100 hits. (Large numbers are only found after clicking "repeat the search with the omitted results included" at the bottom, but this shouldn't be a problem in the final search engine, as other text will be added to each search.) Searching for site:gutenberg.org "Not copyrighted in the United States" gives 11,000. Searching for either phrase, site:gutenberg.org "Not copyrighted in the United States" OR "is a public domain work", gives 11,500. The numbers seem to vary day to day - does this indicate a problem with how thorough the indexing is? How comprehensive are these searches? and which is the most comprehensive?
  • Tried using the Search engine keywords: (Google Custom Search Engine control panel -> Basics). Using -(site:gutenberg.org "Check the license") didn't work. E.g. searching for "several riotous sailors" (PD text found here) gave no hits. Using this appeared to break the search - no results found even for a basic search like water or filter - but when it was reimplemented, it seemed to work again. This was strange, but either way, it didn't give the desired results, so the keywords were removed, the box left empty.

Problem: The reason these failed is identified. Googling for a phrase found in a public domain document, e.g. site:gutenberg.org "several riotous sailors", works (as long as the phrase is not too long, for some reason). However, with the filter it doesn't work, i.e. site:gutenberg.org "is a public domain work" "several riotous sailors", as the filter works on text found in the link page, not the document itself. Searching in Yahoo has the same problem. This realization takes us to the next stage...

Next attempts: searching the document texts[edit]

We want a phrase from the actual text of the public domain documents, and this phrase might be buried in the verbose statements in the text files. Do those files just have boilplate license? What specific phrase is used in Gutenberg's public domain documents pages, and only on the public domain pages?

This is where progress has stalled - I don't have time to work on this aspect of the search in the next few months, as it's not a priority for the kind of content Appropedia needs. --Chriswaterguy · talk 22:29, 10 February 2008 (PST)

See also[edit]