You [Gerald Bauer¹] have been permanently banned [for life] from participating in r/ruby (because of your writing off / outside of r/ruby). I do not see your participation adding anything to this [ruby] community.

-- Richard Schneeman (r/ruby mod and fanatic illiberal ultra leftie on a cancel culture mission)

¹: I know. Who cares? Who is this Gerald Bauer anyway. A random nobody for sure. It just happens that I am the admin among other things of Planet Ruby.

Case Studies of Code of Conduct "Cancel Culture" Out-Of-Control Power Abuse - Ruby - A Call for Tolerance On Ruby-Talk Results In Ban On Reddit Ruby

Update (August, 2022) - A Call for More Tolerance And Call For No-Ban Policy Results In Ban On Ruby-Talk (With No Reason Given)

>  I just banned gerald.bauer@gmail.com.
>
>  -- SHIBATA Hiroshi
>
>> THANK YOU
>> 
>>  -- Ryan Davis
>>
>>
>> My full support to moderators.
>>
>> -- Xavier Noria
>> 
>> My full support to moderators.
>>
>>  -- Carlo E. Prelz
>>
>>  That's fun.
>>
>>  -- Alice

Read the full story »

« 25 Days of Ruby Gems - Ruby Advent Calendar 2020, December 1st - December 25th

Day 18 - henkei Gem - Read Text and Meta Data from Word, PowerPoint, and PDF Files

Written by Matt Swanson

Contrarian-in-training. Building products. Karl Pilkington is my spirit animal. Hacking on Boring Rails.

Searching within uploaded files

If you’ve ever built an application that involved file uploads, inevitably you will receive a request to be able to search through those files.

While there are plenty of articles and tools for implementing full-text search with Ruby, nearly all of these examples are for searching your database records. But what if you need to search the contents of a PDF? Or a Microsoft Word document? Or even a PowerPoint presentation? Sounds like a nightmare.

The basic strategy for this problem is to extract as much textual content from the file as your can, break into into chunks – maybe by page or paragraph – and then index those chunks in a tool like ElasticSearch, Algolia, or PgSearch.

But how do you get the text out of these files? It’s not as simple as reading a .txt file.

Enter henkei

The henkei gem is a small wrapper around the Apache Tika project.

You can extract the text of any supported file using a common interface:

require 'henkei'

data = File.read('TPS Report v2.docx)'
text = Henkei.read(:text, data)

Here are some of the formats supported:

Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
Apple iWorks Formats
Rich Text Format (.rtf)
Portable Document Format (.pdf)

For the complete list of supported formats, please visit the Apache Tika Supported Document Formats page.

How it works in practice

In most cloud environments, files are stored on an external service. Henkei can also open a file from a URL:

henkei = Henkei.new 'http://my-bucket.s3.aws.com/uploads/2020-projections.pptx'
text = henkei.text

Now that you’ve got any text from the file in a big Ruby String, you can use whatever methods you want to split the data into chunks and integrate it into full-text search tools.

def extract_text_chunks(s3_url)
  raw_text = Henki.new(s3_url).text
  chunks = []
  chunk = ""

  raw_text.split(/[^[[:word:]]]+/).each do |word|
    chunk += "#{word} "
    if chunk.size > MAX_CHUNK_SIZE
      chunks << chunk.squish
      chunk = ""
    end
  end

  chunks.flatten.compact.reject(&:blank?)
end

Installation Note

One note is that since this gem wraps the Apache Tika library, you will need a Java runtime in your environment to use this gem. It’s should not be a problem to add a Java runtime to most hosting providers, but be aware of this dependendancy.

Find Out More

References

Built with Ruby (running Jekyll) on 2023-01-25 18:05:39 +0000 in 0.371 seconds.
Hosted on GitHub Pages. </> Source on GitHub. (0) Dedicated to the public domain.