Ferret is a native Ruby port of Apache Lucene, a powerful search engine. You can interface ferret with ruby on rails using the acts_as_ferret plug-in and start playing with full text search quite immediately. Here is some modifications to get the better of it.
Unicode search add-on
This is a refactored code from the excellent post of Albert
require 'ferret'
ACCENTUATED_CHARS = 'ÅÄÀAÂåäàâaÖÔôöÉÈÊËéèêëÜüùç'
REPLACEMENT_CHARS = 'aaaaaaaaaaooooeeeeeeeeuuuc'
# replace accentuated chars with ASCII one
class ToASCIIFilter < Ferret::Analysis::TokenFilter
def next()
token = @input.next()
unless token.nil?
token.text = token.text.downcase.tr(ACCENTUATED_CHARS, REPLACEMENT_CHARS)
end
token
end
end
# This regexp don't split words based on chars.
class EuropeanTokenizer < Ferret::Analysis::RegExpTokenizer
P = /[_\/.,-]/
HASDIGIT = /\w*\d\w*/
def token_re()
%r([[:alpha:]#{ACCENTUATED_CHARS}]+(('[[:alpha:]#{ACCENTUATED_CHARS}]+)+
|\.([[:alpha:]#{ACCENTUATED_CHARS}]\.)+
|(@|\&)\w+([-.]\w+)*
)
|\w+(([\-._]\w+)*\@\w+([-.]\w+)+
|#{P}#{HASDIGIT}(#{P}\w+#{P}#{HASDIGIT})*(#{P}\w+)?
|(\.\w+)+
|
)
)x
end
end
# This analyzer find token based on space character. All accentuated characters
# are remplaced with 7bit ascii one.
class EuropeanAnalyzer < Ferret::Analysis::Analyzer
def token_stream(field, string)
return ToASCIIFilter.new(EuropeanTokenizer.new(string))
end
end
module FerretMixin
module Acts #:nodoc:
# Reindex with the correct Analyzer
# bugfix for acts_as_ferret plugin.
module ARFerret #:nodoc:
def rebuild_index
index = Index::Index.new(ferret_configuration)
self.find_all.each { |content| index << content.to_doc }
logger.debug("Created Ferret index in: #{class_index_dir}")
index.flush
index.optimize
index.close
end
end
end
end
Then you need to declare the right analyser in your model with this line :
acts_as_ferret {:fields => :title, :text, :description}, :analyzer => EuropeanAnalyzer.new
Yeah, good thing to have a full text search so easily.
Comments