This search interface uses Solr as backend, therefore it accepts everything that Solr does as search queries.
In these instructions two terms will be used frequently:
hit - A single instance of a passage being repeated in the dataset. A single hit will be a passage from page of an issue in the dataset. A single page may contain mulitple hits, though they're generally part of different clusters.
cluster - A group of hits that all share the same (or similar enough) passage.
To search for a word, you can type word into the search bar and the search engine will find hits with that word in its passage.
When doing simple searches for a single word you can use the word alone, but you can also specify a field
(which is also necessary for some more advanced searches).
E.g. typing
text:word into the search field will find hits that contain the word
word in the field
text.
The available fields depend on whether you're searching for hits or clusters. To search for clusters, you must specify it from the
advanced search.
Below is a list of all available fields for searching, followed by examples that show how some different search terms can be combined for more advanced queries.
Available fields when searching for hits:
cluster_id - Specifies this hit's cluster.
country - Country of the issue.
date - Full date of the issue.
doc_id - The exact ID of the page.
length - The length of the hit.
location - The city of the issue.
text - The text of the hit.
title - The title of the issue.
year - Year of the issue.
Available fields when searching for clusters:
all_countries - All countries the cluster spread to.
all_locations- All cities the cluster spread to.
average_length - The average length of all hits in the cluster.
cluster_id - The ID of the cluster.
count - The count of unique hits in the cluster.
crossed - true / false. True: if the cluster spanned across two or more countries.
ending_country - The country of the last hit in the cluster.
ending_date - The date of the last hit in the cluster.
ending_location - The city of the last hit in the cluster.
ending_year - The year of the last hit in the cluster.
first_text - The text of the first hit in the cluster.
gap - The biggest gap in the cluster, i.e. the maximum difference in publishing date of two subsequent hits.
locations - The amount of different unique locations in the cluster.
starting_country - The country of the first hit in the cluster.
starting_date - The date of the first hit in the cluster.
starting_location - The city of the first hit in the cluster.
starting_year - The year of the first hit in the cluster.
in_city - The incoming city of this cluster.
in_country - The incoming country of this cluster.
in_date - The date a text in a cluster appeared in a new country.
out_city - The port city of this cluster.
out_country - The port country of this cluster.
out_date - The date a text in a cluster left its original country of printing.
timespan - The amount of days between the first and last hit in the cluster.
titles - The amount of different unique titles in the cluster.
virality_score - The virality score of the cluster.
multiple_starting_locations - true/false. False: the cluster has one starting location, True: The cluster includes texts that were first printed in multiple locations on the same day.
Term instructions:
This engine uses Solr's default query parser, lucene.
Below are some of the most common terms that can be used.
More info can be found on Solr's documentation
here.
Boolean
+text:word -- the word 'word' must appear in the
text field of the hit.
-text:word -- the word 'word' must not appear in the
text field of the hit.
These two can also be combined:
+text:word -text:cat - The word
word must appear in the
text field and the world
cat must not.
Phrase search
Different solr terms are separated using whitespace, so if you want to search for a multi word phrase, you must wrap it in quotation marks:
text:"this is a word" -
"This is a word" must appear in the passage.
Fuzzy matching
Solr can perform fuzzy matching, where words that are very similar to the query word are accepted.
text:dog~ - Words that are similar to dog are accepted. E.g. dag.
This can be useful to find hits in the database, as sometimes the OCR process may have degraded the quality so that exact matches arent sufficient anymore.
Range queries:
count:[50 TO *] - Shows clusters that have more than 50 hits.
locations:[4 TO 5] - Shows clusters that have spread to 4 or 5 different unique locations.
Wildcards:
word* - Search for words that start with
word and then any possible endings.
word? - Search for word
word where there is one extra character at the end.
Real examples queries:
If you want to find and see all the hits and/or clusters, type: *:*
brand* AND Åbo
Finds hits or clusters with different forms of the word brand (for example, branden) and the word Åbo which occur in the same text
brand* AND Åbo NOT Brandenburg
Finds hits or clusters with different forms of the word brand (for example, branden) and the word Åbo but excludes the word Brandenburg
location:Mal*
Finds all hits from locations starting Mal-.
title:"Vårt Land"
Finds all hits for the paper Vårt Land and also all clusters, including hits from Vårt Land.
timespan:[* TO 10]
Finds clusters with a timespan from 0 to 10. This is preprints within the same day (0) to those within ten days.
cluster_id:cluster_13180197
Finds a particular cluster.
count:[* TO 50]
Finds clusters with a count from 2 to 50. (The minimum count is 2, basically a text with one reprint.)
crossed:true AND count:[100 TO *]
Finds clusters that spanned two or more countries and with a count of 100 or more.
locations:[10 TO *]
Finds clusters which contain 10 or more printing locations.
all_locations:(Umeå AND Oulu)
Finds all clusters which have Umeå and Oulu in its printing locations.
all_locations:(Umeå OR Oulu)
Finds all clusters which have either Umeå or Oulu in their printing locations.
all_countries:(Finland AND Norway)
Finds all clusters which have been printed in both Finland and Norway.
all_countries:(Finland OR "United States")
Finds all clusters which have been printed in either Finland or in the United States.
starting_date:"1904-10-01T00:00:00Z"
Finds clusters that started on 1 October 1904.
starting_date:[1804-10-01T00:00:00Z TO 1904-10-01T00:00:00Z]
Finds clusters that have started between 1 October 1804 and 1 October 1904.
virality_score: [90 TO 100]
Finds the clusters with the highest viral scores. The viral score is a number from 0 to 100.
gap:[100 TO *]
Finds clusters with reprints containing a largest time distance (a gap) of 100 years or more.
Combinations:
in_city:Turku AND all_locations:(Oulu AND Vaasa)
Finds all clusters where the first printing in Finland occurred in Turku (incoming city) and that have Oulu and Vaasa among their printing locations.
multiple_starting_locations:True AND ending_location:Kokkola
Finds clusters with texts that were first printed in more than one title and the last printing was by a newspaper in Kokkola.
starting_country:Finland AND in_city:Umeå
Finds clusters with a text that were first printed in Finland and moved to Sweden via Umeå.