Hack@Brown2018: The Retextualizationizer

APIs used:

Flask

jQuery

Gensim

NLTK

HTML5 Speech Recognition

PyPDF2

Textract

There's a lot of information in the world. We tried to create a tool that makes it easy to gain insight into crucial points in any long piece of text. (We're hoping this will be especially useful to all those college students with the heavy reading assignments.)

Description

We use Gensim and NLTK to implement natural language processing and machine learning in an easy-to-use tool. Specify your preferred summary target length and keyword count (in words), or use our default settings! Paste text directly, use the microphone for speech-to-text, or select a PDF. Note: Keywords may be stemmed.

How we built it

We used Flask to integrate our Python backend and processing with our HTML5/CSS/JS frontend. Our backend used the above APIs to provide summarization functionality through multiple methods.

Challenges

Most of us are first-time hackers! It was also our first experience with Flask, PDF processing, speech recognition software, and with full-stack development in general.

Backlog/what's next:

  1. Handle pdf formatting better: cleanly formatted pdfs are handled well, but more complex formatting can be a problem. Also, in scientific papers the program sometimes get sentences from the references section, so in future we would trim out the reference section.
  2. One idea we started with but did not implement was to link sentences in summary back to text so that users can learn more about that topic quickly, making is easy to use quotes/pinpoint lines
  3. For now, we return stemmed keywords. We need to stem them so that inflected/derived words are reduced to base meanings, but the resulting words are sometimes not actual words. We would eventually handle this so that we return actual words.

Built With

Share this project:

Updates