Introduction
This is a program that was originally meant to be built in R (a language used frequently in the statistics field, like Python and Stata) but instead is now programmed in Java that treats the problem of email validation as a classification problem. That is to say, we are looking to see whether a given email address is valid or invalid, or in the language of programming, "1" (true) or "0" (false). We are given a training set of valid email addresses and feed this into Java, together with a natural language processing library called OpenNLP. In the absence of actually sending an email to the email address in question, the next best approach is to try to use statistics based on the structure of the email to uncover any hidden/obscure patterns that identify a valid email address. Whilst we can't say for sure, we can say on the balance of probabilities that a particular email address is valid or invalid, and that is sometimes good enough for practical purposes (as it is for this particular hackathon).
This program uses the list of validated email addresses and then tries to uncover which of the email addresses are valid in the unfiltered list of email addresses. On a more technical level, the technique used is known as the Naive Bayes classifier.
Requirements
In order to use this program, you will need the following:
- Windows/Mac/Linux desktop or laptop.
- Internet access (mainly to install some dependencies).
- An Integrated Development Environment (IDE) that can process Java files.
- For testing: The unfiltered list of email addresses, both valid and invalid.
- For testing: The training set of email addresses, both valid and invalid.
*For the categories marked as "For testing", you can also substitute these with your own.
Instructions
Rationale
Future improvement
In a future improvement of this code, one possibility is leveraging a custom model and using logistic regression to determine whether or not a particular email address is valid. A logistic regression is a special kind of regression which

Log in or sign up for Devpost to join the conversation.