A Framework For Aggregating And Retrieving Relevant Information Using TF-IDF And Term Proximity In Support Of Maize Production
Philemon Kasyoka, Waweru Mwangi, Michael Kimwele
Index Terms: Inverse Document Frequency, Information Retrieval, RSS, Term Frequency, Term Proximity
Abstract: This paper presents a framework for aggregating and retrieving relevant maize information using Term Frequency Inverse Document Frequency and Term Proximity. The framework aggregates information from agricultural websites and blogs through the use of RSS technology. Term Frequency Inverse Document Frequency is able to retrieve relevant documents from the aggregated RSS feeds however; the presence of a query term within a retrieved document does not necessarily imply relevance. Documents with same similarity score do not necessarily have the same level of relevance. To mitigate that problem we implement a term proximity scoring approach that will be able to improve relevance in the top-k documents returned by TF-IDF. The approach for term proximity score uses both the span-based method and pair-based method to ensure effective proximity scoring. User preference profile is based on keywords which form user query while text documents are composed of RSS description content and RSS title tag content. Stemming is applied on query and document terms for better precision. This framework will ensure maize farmers get the most relevant information from online sources.
 Ahlgren, P., Jarneving, B., & Rousseau, R. (2003). Requirement for a Cocitation Similarity Measure, with Special Reference to Pearson’s Correlation Coefficient. Journal of the American Society for Information Science and Technology, 54(6), 550-560.
 A. Berger (2000). Bridging the Lexical Chasm: Statistical Approaches to Answer Finding. In Proc. Int. Conf. Research and Development in Information Retrieval, 192-199.
 C. Monz. Minimal span weighting retrieval for question answering.In Rob Gaizauskas, Mark Greenwood, and Mark Hepple, editors, Proceedings of the SIGIR Workshop on Information Retrieval for Question Answering, pages 23–30, 2004
 C. J. Van Rijsbergen. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2):106–119, 1977
 D. Hawking and P. Thistlewaite. Proximity operators – so near and yet so far. In Proceedings of the Fourth Text REtrieval Conference (TREC-4), pages 131–143, 1995.
 D. Nagao (2008). Web Content Recommender System on RSS using weighted TFIDF University of Aizu, Graduation Thesis. March, 2008.
 De Silva, Harsh & Dimuthu Ratnadiwakara, ‘Using ICT to reduce transaction costs in agriculture through better communication: A case-study from Sri Lanka’, mimeo, 2008.
 Guo, L. & Peng, Q.K. (2013).A Combinative Similarity Computing Measure for Collaborative Filtering-Applied Mechanics and Materials, Volumes 347-350,pg 2919.
 H. Yan, S. Shi, F. Zhang, T. Suel, and J. Wen. Eﬃcient term proximity search with term-pair indexes. In Proceedings of the 19th ACM CIKM, CIKM ’10, pages 1229–1238, 2010.
 Jinglei Zhao and Yeogirl Yun. A proximity language model for information retrieval. In SIGIR ’09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in informationretrieval,pages291-298, NewYork, NY, USA, 2009. ACM
 Makoto Mukai and Masaki Aono, “A Prototype of Content-based Recommendation System based on RSS,” Tech. Rep. 2005-FI-80, IPSJ SIG, 2005.
 R. Schenkel, A. Broschart, S. Hwang, M. Theobald and G. Weikum. Efficient text proximity search. In Proc. of the 14th String Processing and Information Retrieval Symposium, 2007.
 R. Song, M. Taylor, J. Wen, H. Hon, Y. Yu. Viewing term proximity from a different perspective. vol 4956, pp. 346357, Springer Berlin /Heidelberg, 2008
 R. Cummins and C. O’Riordan. An axiomatic study of learned term-weighting schemes. Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval - SIGIR’09, July 19–23, 2009, Boston, Massachusetts, USA.
 S. Buttcher, C. Clarke, and B. Lushman. Term proximity scoring for ad-hoc retrieval on very large text collections. In SIGIR ’03: Proceedings of the 26nd annual international ACM SIGIR conference on Research and development in information retrieval, 2006.
 T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 295–302, New York, NY, USA, 2007. ACM.
 Y. Lv and C. Zhai. Positional language models for information retrieval. In SIGIR 2009,pages 299–306, Boston, MA, USA, 2009. ACM.
 Y. Rasolofo and J. Savoy. Term proximity scoring for keyword-based retrieval systems. In Proceedings of the 25th European Conference on IR Research (ECIR 2003), pages 207–218, 2003.