You [Gerald Bauer¹] have been permanently banned [for life] from participating in r/ruby (because of your writing off / outside of r/ruby). I do not see your participation adding anything to this [ruby] community.
-- Richard Schneeman (r/ruby mod and fanatic illiberal ultra leftie on a cancel culture mission)
¹: I know. Who cares? Who is this Gerald Bauer anyway. A random nobody for sure. It just happens that I am the admin among other things of Planet Ruby.
Case Studies of Code of Conduct "Cancel Culture" Out-Of-Control Power Abuse - Ruby - A Call for Tolerance On Ruby-Talk Results In Ban On Reddit RubyUpdate (August, 2022) - A Call for More Tolerance And Call For No-Ban Policy Results In Ban On Ruby-Talk (With No Reason Given)
> I just banned gerald.bauer@gmail.com. > > -- SHIBATA Hiroshi > >> THANK YOU >> >> -- Ryan Davis >> >> >> My full support to moderators. >> >> -- Xavier Noria >> >> My full support to moderators. >> >> -- Carlo E. Prelz >> >> That's fun. >> >> -- Alice
« 25 Days of Ruby Gems - Ruby Advent Calendar 2020, December 1st - December 25th
Written by Matt Swanson
Contrarian-in-training. Building products. Karl Pilkington is my spirit animal. Hacking on Boring Rails.
If you’ve ever built an application that involved file uploads, inevitably you will receive a request to be able to search through those files.
While there are plenty of articles and tools for implementing full-text search with Ruby, nearly all of these examples are for searching your database records. But what if you need to search the contents of a PDF? Or a Microsoft Word document? Or even a PowerPoint presentation? Sounds like a nightmare.
The basic strategy for this problem is to extract as much textual content from the file as your can, break into into chunks – maybe by page or paragraph – and then index those chunks in a tool like ElasticSearch, Algolia, or PgSearch.
But how do you get the text out of these files? It’s not as simple as reading a .txt
file.
The henkei
gem is a small wrapper around the Apache Tika project.
You can extract the text of any supported file using a common interface:
require 'henkei'
data = File.read('TPS Report v2.docx)'
text = Henkei.read(:text, data)
Here are some of the formats supported:
For the complete list of supported formats, please visit the Apache Tika Supported Document Formats page.
In most cloud environments, files are stored on an external service. Henkei can also open a file from a URL:
henkei = Henkei.new 'http://my-bucket.s3.aws.com/uploads/2020-projections.pptx'
text = henkei.text
Now that you’ve got any text from the file in a big Ruby String
, you can use whatever methods you want to split the data into chunks and integrate it into full-text search tools.
def extract_text_chunks(s3_url)
raw_text = Henki.new(s3_url).text
chunks = []
chunk = ""
raw_text.split(/[^[[:word:]]]+/).each do |word|
chunk += "#{word} "
if chunk.size > MAX_CHUNK_SIZE
chunks << chunk.squish
chunk = ""
end
end
chunks.flatten.compact.reject(&:blank?)
end
One note is that since this gem wraps the Apache Tika library, you will need a Java runtime in your environment to use this gem. It’s should not be a problem to add a Java runtime to most hosting providers, but be aware of this dependendancy.
Built with Ruby
(running Jekyll)
on 2023-01-25 18:05:39 +0000 in 0.371 seconds.
Hosted on GitHub Pages.
</> Source on GitHub.
(0) Dedicated to the public domain.