[Please remove <h1>]
Finding all documents that match a query is a central requirement for search engines, and our goal is to do so efficiently. First, we have developed improved indexing techniques that are more compact and/or provide faster retrieval for conjunctive queries. More specifically, we proposed a new structure, partial bitvectors, and showed that it combines the benefits of bitvectors for dense data with the benefits of skips for sparse data [9]. By suitably choosing document identifiers for a collection, postings lists can be skewed to magnify density or sparsity and to broaden the spans of dense and sparse regions, thus making partial bitvectors effective storage structures. Similar benefits arising from skew can also be obtained by distributing documents according to their size [8]. Performance improvements are also achieved by distributing small postings lists (those for terms that appear in few documents) using term distribution and the remaining lists using document distribution [10]. Finally, we have also begun to investigate how to improve the efficiency of ranking systems without deteriorating their effectiveness [11].
In a second line of research, we have designed and built specialized search engines for matching queries expressed as mathematical formulae [3]. We designed a compact representation and associated algorithms for indexing mathematical formulae based on consolidating all shared subexpressions, and we have shown this approach to be both effective [5] and efficient [6]. We have also designed a database-like query interface that permits users to use wildcards and constraints to specify precise information needs, and we have shown it to be effective [4]. More recently, we improved the Tangent Math Retrieval System to provide more effective retrieval by designing and implementing a core engine that uses significantly less space for the index and significantly less time for retrieval [14, 1, 2].
Building on our earlier work to computerize the Oxford English Dictionary [12], we have also designed search engines for several humanities projects. In one, we developed a search engine for matching passages that are shared by pairs of documents [7]. For this purpose we adapted the winnowing technique from plagiarism detection software to index normalized Latin text and then augmented it to be able to identify maximal overlaps between two texts. In another, we adapted Boolean search to accommodate variants in Latin orthography in the absence of postings lists and created Dragoman, a display engine based on simple-to-use style sheets [13].
In a third application area, we wish to provide researchers with a reliable framework to search electronic health records. Health researchers would like to access all the data available in a clinical database to conduct accurate studies, however at the same time patients wish to protect private data contained in their personal records. In our current study, we are investigating how to apply active learning to build an analytics-aware search engine that can support cohort identification. In addition, when dealing with records for which a patient has not given consent, protecting their privacy is an essential problem that must be addressed. For this latter situation, we are investigating how to apply differential privacy to documents.
References
- K. Davila, R. Zanibbi, A. R. Kane, and F. W. Tompa, "Tangent-3 at the NTCIR-12 MathIR Task," NTCIR 12 Conference on Evaluation of Information Access Technologies, June 2016, 8 pp.
- D. Fraser, A. R. Kane, and F. W. Tompa, "Choosing Math Features for BM25 Ranking with Tangent-L," 18th ACM Symposium on Document Engineering (DocEng 2018), 10 pp. "SIGWEB DocEng Best Paper Award".
- S. Kamali and F. W. Tompa, "Improving Mathematics Retrieval," 2nd Workshop Towards a Digital Mathematics Library (DML 2009), Grand Bend, Canada, July 8-9, 2009, 37-48.
- S. Kamali and F. W. Tompa, "A New Mathematics Retrieval System," ACM 19th Conference on Information and Knowledge Management (CIKM 2010), Toronto, October 26-30, 2010, 1413-1416.
- S. Kamali and F. W. Tompa, "Structural Similarity Search for Mathematics Retrieval," Conference on Intelligent Computer Mathematics (CICM 2013), Springer LNAI 7961, Bath, July 8-12, 2013, 246-262.
- S. Kamali and F. W. Tompa, "Retrieving Documents with Mathematical Content," 36th Annual ACM SIGIR Conference (SIGIR 2013), Dublin, July 29-31, 2013, 353-362.
- A. R. Kane and F. W. Tompa, "Janus: the Intertextuality Search Engine for the Electronic Manipulus florum Project," Literary and Linguistic Computing 26,4 (December 2011) 407-415.
- A. R. Kane and F. W. Tompa, "Distribution by Document Size," 11th Int. Workshop on Large-Scale and Distributed Systems for Information Retrieval (LSDS-IR'14), New York, February 28, 2014, 6 pp.
- A. R. Kane and F. W. Tompa, "Skewed Partial Bitvectors for List Intersection," 37th Annual ACM SIGIR Conference (SIGIR 2014), Broadbeach (Queensland), July 6-11, 2014, 263-272.
- A. R. Kane and F. W. Tompa, "Small-Term Distribution for Disk-Based Search," 17th ACM Symposium on Document Engineering (DocEng 2017), 49-58.
- A. R. Kane and F. W. Tompa, "Split-Lists and Initial Thresholds for WAND-based Search," 41st Annual ACM SIGIR Conference (SIGIR 2018), 877-880.
- F. W. Tompa, "Not Just Another Database Project: Developments at UW," Reflections on the Future of Text, Proc. 10th Conf. of Univ. of Waterloo Centre for the New OED (October 20-21, 1994), pp. 82-89.
- F. W. Tompa, "Fashioning a Search Engine to Support Humanities Research," 18th ACM Symposium on Document Engineering (DocEng 2018), 10 pp.
- R. Zanibbi, K. Davila, A. R. Kane, and F. W. Tompa, "Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metrics at Scale," 39th Annual ACM SIGIR Conference (SIGIR 2016), Pisa, July 17-21, 2016, 145-154.