Haskell @ Club De Science - MailProcessor



> module MailProcessor (processEmail) where
> 
> import Data.Char
> import Data.Maybe
> import Data.List (isPrefixOf)
> 
> -- import NLP.Stemmer
> triads n = [(x, y, z) 
>     | x<-[1..], y<-[1..n], z<-[1..n], 
>       z^2+y^2==x^2]

processEmail is the mail function that processed the email. It goes from the full content of the emai (typed String) to the list of the normalized words of the input email. It does three steps

> processEmail :: String -> [String]
> processEmail = process . words . deleteChars 
>   where
>   	deleteChars :: String -> String
>   	deleteChars = undefined 
>   	punc        = ['!', '?', '.', ',']
> 
>   	process :: [String] -> [String]
>   	process = undefined 

processWord is the function that actually processes each word. It proceeds in four steps: - converts the word to lower case letters (fill in toLowerWord) - normalizes the word - strips out the HTML code, and - stems the word according to the NLP algorithm.

Stemming just keeps the roots of the words, i.e. it will make the following transformations that are crucial for email crassification

am, are, is ->  be 
car, cars, car's, cars' ->  car
> processWord :: String -> Maybe String 
> processWord = stripHTML . normalize . toLowerWord
> toLowerWord :: String -> String
> toLowerWord = undefined 

Normalization of a word normalizes URLS, emails, dollars, and numbers:

> normalize :: String -> String
> normalize = normalizeURL . normalizeEmail . normalizeNumber . normalizeDollar

Next, you should fill in the definitions for the normalization functions.

Function normalizeDollar replaces the character ‘$’ with the word “dollar”:

normalizeDollar "$"   = "dollar"
normalizeDollar "foo" = "foo"
> normalizeDollar :: String -> String 
> normalizeDollar  = undefined

Function normalizeURL replaces URLS with the word “httpaddr”:

normalizeURL "http://google.com"  = "httpaddr"
normalizeURL "https://google.com" = "httpaddr"
normalizeURL "foo"                = "foo"
> normalizeURL :: String -> String 
> normalizeURL  = undefined

Function normalizeEmail replaces email addresses with the word “email”:

normalizeEmail "nvazou@cs.ucsd.edu"   = "email"
normalizeEmail "foo"                  = "foo"
> normalizeEmail :: String -> String 
> normalizeEmail  = undefined

Finally, function normalizeNumber replaces numbers with the word number:

normalizeNumber "42"  = "number"
normalizeNumber "42$" = "number$"
> normalizeNumber :: String -> String 
> normalizeNumber  = undefined
> stripHTML :: String -> Maybe String
> stripHTML x | isNonWord x = Nothing 
>             | otherwise   = Just x
> 
> isNonWord x = isHTML x || x == ">"
> isHTML x    = head x == '<' && last x == '>'