(→‎Improving exclusion list further: list of fed sites (& more))
(→‎Make a list of federal sites: usa.gov site incorrect info)
Line 120: Line 120:
Okay I got impatient. I went to http://www.usa.gov/Agencies/Federal/All_Agencies/index.shtml  and compiled a list. Quite a painful process - got an enormous file that was hard to open as source in Ubuntu. Finally achieved with Opera, and "view -> source".  
Okay I got impatient. I went to http://www.usa.gov/Agencies/Federal/All_Agencies/index.shtml  and compiled a list. Quite a painful process - got an enormous file that was hard to open as source in Ubuntu. Finally achieved with Opera, and "view -> source".  


The usa.gov site is inconsistent. It suggests (e.g. from the url) that these are federal agencies, but states are included. So the list is useless for a fed search, but will still serve the purpose here (assuming that the local govt and state gov dept sites which haven't been excluded yet are unlikely to be on this list).  
The usa.gov site is inconsistent. It suggests (e.g. [http://www.usa.gov/Agencies/federal.shtml here], and in the url) that these are federal agencies, but states are included. So the list is useless for a fed search, but will still serve the purpose here, in refining the state and local exclusion list (assuming that the local govt and state gov dept sites which haven't been excluded yet are unlikely to be on this list).  


Here is the list, in tab-delimited form: [[Image:CSE PDTS federal sites and more.doc]]
Here is the list, in tab-delimited form: [[Image:CSE PDTS federal sites and more.doc]]

Revision as of 16:26, 27 November 2007

This is a work in progress. -- Chriswaterguy · talk 04:39, 18 November 2007 (PST)

The Public Domain Beta Search is up and running. The results won't be 100% public domain, so check carefully, but it should help you find a lot of public domain stuff.

This is a beginning stage in creating a comprehensive search engine for public domain content, i.e. free for unrestricted use, on the internet. This is particularly useful for Appropedia as we can use it to find useful information to port here.

Help out! I'd like some help (otherwise I'll have to do it myself and that means leaving it till late December). If you can, help with any of the following:

  • Please add public domain sites and indexes to #Add to included list. I didn't add many to start as I was mainly refining the US government search, but now it's ready to expand. (I'll do further refining elsewhere.)
  • Add any "false positive" sites i.e. non public domain, to #Add to exclusion list. This is lower priority than working out an efficient way to do this - see next point.
  • To help my refining of the US govt search:
    • Find a single long page with all (or as many as possible) federal department, agency etc websites. Put the link here. (http://www.washlaw.edu/doclaw/executive5m.html - may not be very complete...)
    • ...or make one yourself (doesn't have to be valid html, as long as it has the needed urls). E.g. use http://www.usa.gov/Agencies/Federal/All_Agencies/index.shtml - it's easiest to use Opera to open and then save multiple files. Then concatenate i.e. make into a single file. (E.g. Run on Linux command line: cat * >catfile.txt ).
    • Then process the file, to make a tab delimited file for all US federal government sites. Maybe best if I do this, as I've had practice and it's quick.
    • Help me find an OpenOffice function that returns true or false for whether a cell contains certain text (and specifically, the text will be ".gov"). Can't get that working... it's essential for efficient sorting of google outputs, to pull out the false positives and put them into the exclusion file. =FIND(".gov"; B6) where B6 is the location of the string. Gives a number (as opposed to #VALUE i.e. not found). Does the job - thanks Curt!
    • Make suggestions for terms to eliminate state and local websites - including state agencies that have their own domains. I'm using or going to use: county, counties, mayor, local, city-hall, (some kind of "department of" term... state-department doesn't so much yield hits related to the states, in the USA), city-council, council-chambers... what else?

Thanks!

Warning - not all pages on federal sites are PD

Sites and pages much be checked individually. If the work is created by a US federal employee and is not based on copyright work, then it is public domain.

An example of non-PD content: This page for example is on a federal site, but appears to have been prepared by someone from Orange County - not federal government, and thus not automatically public domain.

Building the custom search engine. Step 1 - US Federal government sites

This is a task that's likely to be repeated in future so I thought I'd document how it was done. The first major task was to get the search to cover all federal US government sites but no local or state sites - there isn't one available, surprisingly (at least based on a fairly extensive Google search).

Included sites to search

This part was easy - major public domain sites were chosen, the most important being the US federal government sites. government:

  • .gutenberg.org
  • .fed.us
  • .gov
  • .mil

Add to included list

Surely there's a lot more public domain sites? If you can find an index of such sites, please add it here.Individual public domain sites are also welcome. We don't want open access content, like ibiblio.org (that would be a different project - which you're most welcome to start of course!) - we just want public domain.

Excluded sites

The current list of excluded sites, tab delimited for easy upload (via advanced tab of the custom search control panel) is: File:CSE file - US state and local government sites.doc - a text file, but had to be called .doc to be allowed to upload.

Federal sites

Federal sites with significant content which is not public domain:

  • pubmedcentral.nih.gov[1]

Local and state sites

Making this is the hard part of searching US federal sites. The .gov domain includes many state and local sites, which are not public domain. State were not too hard, but finding a single list of all local government sites was impossible.

The approach taken was to make an excluded sites list as follows: extract urls for all local and state government sites, using the pages at http://www.statelocalgov.net/ - I saved the pages (using Opera - easier for repeated open-in-new-tab functions and saving multiple files), zipped them and sent them to Amit Pathik, a friend who is a data warehouser. He extracted the urls, and removed everything from the second last period, forward (so http://www.dhs.alabama.gov became alabama.gov, for example). He sent them back in a spreadsheet (which I opened in OpenOffice), and gave some tips for the next steps.

  • Some states use more than one domain, based on the state's name or postal abbreviation. To cover against this as best as could:
    • I got the list of US states and the lists of dependencies, along with their postal abbreviations, from Wikipedia:United States postal abbreviations, and put them together, one after another, in a spreadsheet. Armed Forces codes weren't used - if they did have websites, they'd presumably be public domain, being federal institutions.
    • I made them all lower case (OpenOffice text editor), then removed all spaces from the names column.
    • Turned into urls: Wildcard search used to replace .. with &.gov in the first column (.. finding the two letters of each code). The second column was harder - the text had to be taken out of the column (Edit -> Paste special -> Unformatted) then replaced $ with &.gov\n. (I think the \n page break was needed because of a bug in OpenOffice - it shouldn't have been necessary.)
    • These two columns of generated domain names for possible US sites were pasted into Amit's spreadsheet, below the other urls. Those with U.S. in the name had the periods removed manually. Explanatory note was removed from one entry. Some of the sites are too long and seem unlikely, but I just left everything as it turned out, for now.
  • To remove non .gov sites, I searched for .com (then .us and .net). After scanning to check for false positives, I hit Edit -> Delete -> Delete entire rows.
  • I searched for spaces in urls and deleted them (search and replace space with nothing). This was to ensure the EXACT function would work later. (Realized this late in the piece and had to redo it.)
  • I then did a basic sort to put them in alphabetical order (Data->Sort)
  • The urls were now all in column A. In the next column, cell to, I use the formula =EXACT(A1;A2) to test for a duplicate url. This was copied to the bottom of row of urls. This row of TRUE and FALSE readings was then copied to a text file, and copied back back (now as simple text) on top of the formulas. One option was to sort by true and false; instead I just searched for TRUE and deleted those rows (after checking for false positives). The first approach was probably more certain way of avoiding errors though, in hindsight.

At this point there was a list of sites, which should cover all local and state websites in the .gov domain. At the time of writing I am trying to upload the list to the search. In the Google custom search, this can be done on the "Sites" tab of the CSE control panel. Beneath the list of included sites, we click "Excluded sites" and add them. In this case we want to exclude local and state governments, as these sites are generally not public domain. So we add them in the form alabama.gov etc. The site is not responsive (via a slow connection in Jakarta) so I'll get someone else to add them, or I will try later.

Second pass

When the excluded list was uploaded (tab delimited file in advanced tab of CSE settings) it was found that many county pages came up - the excluded list was incomplete. Searching for "county" is one way to show how many there are - and the answer is, a lot. More Googling followed.

Another directory site seemed to have more links - http://www.newsdirectory.com, with city and county pages. http://www.oultwood.com had links, but didn't seem to have any more than the first site used.

Just a few notes follow, in case I or someone else has to repeat this process.

All pages saved and put in their own folder. Run on Linux command line:

 cat * >catfile.txt

Edit catfile.txt, search for .gov", replace with .gov_BINGO_GOV_SITE - this marker will help sift out the gov sites; using this rather than directly testing for ".gov" as it avoids the possibility of a ".gov" in a position other than final. (37 found - not many, actually...). Then I realized this doesn't allow for subpages, but checked discarded lines, later and didn't miss anything not covered already (under .wa.gov). Replace beginning of those lines up to www with *, then delete all semicolons (to keep things in one column in spreadhseet), paste in spreadsheet and sort. Delete all that don't start with * (or functions will die, over 14,000 rows).

Final tidy up: replaced _BINGO GOV FILE.* with /* _cse_exclude_p04xreki1j4 (marked regular expressions).


Try

=VLOOKUP(Search criterion;array;index;Sort order)

hung, when doing on the raw file of 14,000 rows. Used search function instead, manually marking next column.

got 26 results; delete those on a (state).gov subdomain, -> 18 left. 11 others already on list, so 7 new. A lot of fussing for little benefit.

After this one, consider converting all suffixes to .gov...? e.g. nashville.org is actually nashville.gov and shows up in searches.

Third pass

11:04, 21 November 2007 (PST): Resort to brute force "purification" method - search for terms like mayor and county, do analysis on results, continually upgrading exclusion list. Manually scan to avoid fed sites.

Found some cases where the "_cse..." label was " cse..." - watch the lack of underscore. Also missing tabs, just "/*_cse"

Add to exclusion list

If you find local or state government sites that still turn up in the search, please add them here: Only the domain matters - extra text esp at the end will be deleted in the processing. Repetitions are also removed automatically - they don't cause a problem.

  • www.azgovernor.gov/
  • westcoastoceans.gov/
  • www.okcommerce.gov
  • mich.gov
  • nyhealth.gov
  • azarts.gov
  • ilga.gov/
  • jacksoncounty-il.gov
  • www.nyhealth.gov
  • www.bellevuewa.gov/directions_to_city_hall.htm - 19k - Cached
  • councilbluffs-ia.gov/documents/NOISE-VARIANCE-REQUEST.pdf
  • www.pubmedcentral.nih.gov

Improving exclusion list further

To help find these sites that are not yet excluded, search for county OR counties OR mayor OR local OR city-hall and variations on that (dropping some terms perhaps). The term city-hall yields a lot, and still needs to be run again, and department of (or equivalent term) should be worth a try, if federal sites are excluded from search. Also try: City-Council, council-chambers... what else?

Make a list of federal sites

Make an exclusion list of known Federal sites - add this to a test search along with local and state sites. This will make it easy to find the "rogue sites" i.e. those that haven't been excluded yet.

-site:ed.gov -site:epa.gov etc. e.g.

-site:ed.gov -site:epa.gov -site:.mil -site:energystar.gov/ -site:nsf.gov/ -site:cdc.gov -site:nih.gov -site:state.gov -site:weather.gov -site:neh.gov -site:hhs.gov -site:ssa.gov -site:loc.gov -site:whitehouse.gov

-site:energy.gov -site:noaa.gov -site:usdoj.gov -site:access-board.gov -site:senate.gov -site:house.gov

I thought a complete list would be excessive - hundreds of sites - so I'm building up from frequent hits. But if added as a separate exclusion list, that would work... so that's the next step, for when I return to Australia mid Dec 07, if no one beats me to it.)

--

Okay I got impatient. I went to http://www.usa.gov/Agencies/Federal/All_Agencies/index.shtml and compiled a list. Quite a painful process - got an enormous file that was hard to open as source in Ubuntu. Finally achieved with Opera, and "view -> source".

The usa.gov site is inconsistent. It suggests (e.g. here, and in the url) that these are federal agencies, but states are included. So the list is useless for a fed search, but will still serve the purpose here, in refining the state and local exclusion list (assuming that the local govt and state gov dept sites which haven't been excluded yet are unlikely to be on this list).

Here is the list, in tab-delimited form: File:CSE PDTS federal sites and more.doc

Questions

What about US territories (?) - these are listed as federal sites on http://www.usa.gov/Agencies/Federal/All_Agencies/index.shtml - but I think they're also all included in the state and local government exclusion list. should they be removed? E.g.

  • americansamoa.gov/
  • .arctic.gov/
  • usmarshals.gov
Cookies help us deliver our services. By using our services, you agree to our use of cookies.