Tuesday, February 16, 2016

Stop Words

Stop words usually refer to the most common words in a language.

Any natural language processing tool may want to get rid of the stop words to extract the desired keywords. There is no standard list of stop words that all the natural language processing tools can use.
 
In SQL server, system supplied stop words are loaded in the Resource database that can be using the below "Full-Text Search and Semantic Search Catalog Views"

  • SYS.FULLTEXT_SYSTEM_STOPWORDS  - Retrieves the list of stop words for specific languages with language_id reference
  • SYS.FULLTEXT_LANGUAGES - Contains Language name for the language_id reference
In reality, in any natural language, the stop words exist like
  1. Beginning of the sentence like Stop word-"Space"
  2. In between sentences like "Space"-Stop word-"Space"
  3. End of the sentence like "Space"-Stop word-"Full Stop"
Considering the most common above given scenarios, list of stop words useful for natural query processing can be created using the below query. However, only a smaller set of words are covered in this.
 
SELECT ' '+SSW.StopWord+' ' AS SW
FROM SYS.FULLTEXT_SYSTEM_STOPWORDS AS SSW
INNER JOIN SYS.FULLTEXT_LANGUAGES SLG ON SLG.lcid = SSW.language_id
WHERE SLG.lcid =1033 -- Where languageid 1033 refers to ENGLISH
UNION ALL
SELECT SSW.StopWord+' '
FROM SYS.FULLTEXT_SYSTEM_STOPWORDS AS SSW
INNER JOIN SYS.FULLTEXT_LANGUAGES SLG ON SLG.lcid = SSW.language_id
WHERE SLG.lcid =1033 -- Where languageid 1033 refers to ENGLISH
UNION ALL
SELECT ' '+SSW.StopWord+'.'
FROM SYS.FULLTEXT_SYSTEM_STOPWORDS AS SSW
INNER JOIN SYS.FULLTEXT_LANGUAGES SLG ON SLG.lcid = SSW.language_id
WHERE SLG.lcid =1033 -- Where languageid 1033 refers to ENGLISH
 

I have collected list of nearly 760 ENGLISH stop words which can be loaded into the database and used. The database script is given at the end of the post.


Once the  Stop words dump is loaded into the database, the natural language field value is taken into a variable and can be processed using the below logic to remove the stop words. This can be looped for all the field values.


DECLARE @StopWords TABLE (StopWord NVARCHAR(128))

DECLARE @Details1 VARCHAR(3000) -- The variable that holds the text value that has to be analyzed

INSERT INTO @StopWords

SELECT StopWord FROM dbo.[StopList]

SELECT @Details1 = (<<<<< query >>>>>)

-- Clean the data

-- Remove new line and tab characters --> Tab - char(9); Line feed - char(10); Carriage return - char(13)

SELECT @Details1 = REPLACE(REPLACE(REPLACE(@Details1, CHAR(13), ''), CHAR(10), ' '), CHAR(9), ' ')

-- Remove the Stopwords

SELECT @Details1 = REPLACE(REPLACE(REPLACE(REPLACE(@Details1,' – ','-'),' ',' '),' ',' '),' ',' ')

SELECT @Details1 = REPLACE(REPLACE(@Details1,' ',','),',,',',')

SELECT @Details1 = REPLACE(REPLACE(REPLACE(REPLACE(

REPLACE(REPLACE(REPLACE(@Details1,'[]'+S.StopWord+' ',',') -- stop word at the beginning

,' '+S.StopWord+' ',',')-- stop word in between

,' '+S.StopWord+',',',')-- stop word in between

,','+S.StopWord+' ',',')-- stop word in between

,','+S.StopWord+',',',')-- stop word in between

,' '+S.StopWord+'.',',')-- stop word at the end

,','+S.StopWord+'.',',')-- stop word at the end

FROM @StopWords AS S WHERE ' '+LTRIM(RTRIM(@Details1))+' ' LIKE '%'+LTRIM(RTRIM(S.StopWord))+'%'

ORDER BY LEN(S.StopWord) DESC

SELECT @Details1 = REPLACE(REPLACE(REPLACE(REPLACE(@Details1,' ',','),'.,',','),':',','),',,',',')


 
Stopwords are managed in databases using objects called stoplists (list of stopwords).
A stoplist, when associated with a full-text index, is applied to full-text queries on that index.
In SQL Server Full text search, custom Stop Lists can also be created and managed. 
 

Stop words dump script


SELECT * INTO dbo.[StopList]

FROM (

VALUES ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('$'),('a'),('able'),('about'),('above'),('abst'),('accordance'),('according'),('accordingly'), ('across')

,('act'),('actually'),('added'),('adj'),('affected'),('affecting'),('affects'),('after'),('afterwards'),('again'),('against'),('ah'),('ain''t'),('all'),('allow'), ('allows')

,('almost'),('alone'),('along'),('already'),('also'),('although'),('always'),('am'),('among'),('amongst'),('an'),('and'),('announce'),('another'),('any'),('anybody')

, ('anyhow'),('anymore'),('anyone'),('anything'),('anyway'),('anyways'),('anywhere'),('apart'),('apparently'),('appear'),('appreciate'),('appropriate'),('approximately')

,('are'), ('aren'),('arent'),('aren''t'),('arise'),('around'),('as'),('a''s'),('aside'),('ask'),('asking'),('associated'),('at'),('auth'),('available'),('away'),('awfully')

,('b'),('back'), ('be'),('became'),('because'),('become'),('becomes'),('becoming'),('been'),('before'),('beforehand'),('begin'),('beginning'),('beginnings'),('begins'),

('behind'),('being'),('believe'), ('below'),('beside'),('besides'),('best'),('better'),('between'),('beyond'),('biol'),('both'),('brief'),('briefly'),('but'),('by'),('c')

,('ca'),('came'),('can'),('cannot'),('cant'), ('can''t'),('cause'),('causes'),('certain'),('certainly'),('changes'),('clearly'),('c''mon'),('co'),('com'),('come'),('comes')

,('concerning'),('consequently'),('consider'),('considering'), ('contain'),('containing'),('contains'),('corresponding'),('could'),('couldnt'),('couldn''t'),('course')

,('c''s'),('currently'),('d'),('date'),('definitely'),('described'),('despite'), ('did'),('didn''t'),('different'),('do'),('does'),('doesn''t'),('doing'),('done'),('don''t')

,('down'),('downwards'),('due'),('during'),('e'),('each'),('ed'),('edu'),('effect'),('eg'), ('eight'),('eighty'),('either'),('else'),('elsewhere'),('end'),('ending')

,('enough'),('entirely'),('especially'),('et'),('et-al'),('etc'),('even'),('ever'),('every'),('everybody'), ('everyone'),('everything'),('everywhere'),('ex'),('exactly')

,('example'),('except'),('f'),('far'),('few'),('ff'),('fifth'),('first'),('five'),('fix'),('followed'),('following'), ('follows'),('for'),('former'),('formerly'),('forth')

,('found'),('four'),('from'),('further'), ('furthermore'), ('g'), ('gave'), ('get'), ('gets'), ('getting'), ('give'), ('given'), ('gives'), ('giving'), ('go'), ('goes')

, ('going'), ('gone'), ('got'), ('gotten'), ('greetings'), ('h'), ('had'), ('hadn''t'), ('happens'), ('hardly'), ('has'), ('hasn''t'), ('have'), ('haven''t'), ('having')

, ('he'), ('hed'), ('hello'), ('help'), ('hence'), ('her'), ('here'), ('hereafter'), ('hereby'), ('herein'), ('heres'), ('here''s'), ('hereupon'), ('hers'), ('herself')

, ('hes'), ('he''s'), ('hi'), ('hid'), ('him'), ('himself'), ('his'), ('hither'), ('home'), ('hopefully'), ('how'), ('howbeit'), ('however'), ('hundred'), ('i'), ('id')

, ('i''d'), ('ie'), ('if'), ('ignored'), ('i''ll'), ('im'), ('i''m'), ('immediate'), ('immediately'), ('importance'), ('important'), ('in'), ('inasmuch'), ('inc')

, ('indeed'), ('index'), ('indicate'), ('indicated'), ('indicates'), ('information'), ('inner'), ('insofar'), ('instead'), ('into'), ('invention'), ('inward'), ('is')

, ('isn''t'), ('it'), ('itd'), ('it''d'), ('it''ll'), ('its'), ('it''s'), ('itself'), ('i''ve'), ('j'), ('just'), ('k'), ('keep'), ('keeps'), ('kept'), ('kg'), ('km')

, ('know'), ('known'), ('knows'), ('l'), ('largely'), ('last'), ('lately'), ('later'), ('latter'), ('latterly'), ('least'), ('less'), ('lest'), ('let'), ('lets'), ('let''s')

, ('like'), ('liked'), ('likely'), ('line'), ('little'), ('''ll'), ('look'), ('looking'), ('looks'), ('ltd'), ('m'), ('made'), ('mainly'), ('make'), ('makes'), ('many')

, ('may'), ('maybe'), ('me'), ('mean'), ('means'), ('meantime'), ('meanwhile'), ('merely'), ('mg'), ('might'), ('million'), ('miss'), ('ml'), ('more'), ('moreover')

, ('most'), ('mostly'), ('mr'), ('mrs'), ('much' ), ('mug'), ('must'), ('my'), ('myself'), ('n'), ('na'), ('name'), ('namely'), ('nay'), ('nd'), ('near'), ('nearly')

, ('necessarily'), ('necessary'), ('need'), ('needs'), ('neither'), ('never'), ('nevertheless'), ('new'), ('next'), ('nine'), ('ninety'), ('no'), ('nobody'), ('non')

, ('none'), ('nonetheless'), ('noone'), ('nor'), ('normally'), ('nos'), ('not'), ('noted'), ('nothing'), ('novel'), ('now'), ('nowhere'), ('o'), ('obtain'), ('obtained')

, ('obviously'), ('of'), ('off'), ('often'), ('oh'), ('ok'), ('okay'), ('old'), ('omitted'), ('on'), ('once'), ('one'), ('ones'), ('only'), ('onto'), ('or'), ('ord')

, ('other'), ('others'), ('otherwise'), ('ought'), ('our'), ('ours'), ('ourselves'), ('out'), ('outside'), ('over'), ('overall'), ('owing'), ('own'), ('p'), ('page')

, ('pages'), ('part'), ('particular'), ('particularly'), ('past'), ('per'), ('perhaps'), ('placed'), ('please'), ('plus'), ('poorly'), ('possible'), ('possibly')

, ('potentially'), ('pp'), ('predominantly'), ('present'), ('presumably'), ('previously'), ('primarily'), ('probably'), ('promptly'), ('proud'), ('provides'), ('put')

, ('q'), ('que'), ('quickly'), ('quite'), ('qv'), ('r'), ('ran'), ('rather'), ('rd'), ('re'), ('readily'), ('really'), ('reasonably'), ('recent'), ('recently'), ('ref')

, ('refs'), ('regarding'), ('regardless'), ('regards'), ('related'), ('relatively'), ('research'), ('respectively'), ('resulted'), ('resulting'), ('results'), ('right')

, ('run'), ('s'), ('said'), ('same'), ('saw'), ('say'), ('saying'), ('says'), ('sec'), ('second'), ('secondly'), ('section'), ('see'), ('seeing'), ('seem'), ('seemed')

, ('seeming'), ('seems'), ('seen'), ('self'), ('selves'), ('sensible'), ('sent'), ('serious'), ('seriously'), ('seven'), ('several'), ('shall'), ('she'), ('shed')

, ('she''ll'), ('shes'), ('should'), ('shouldn''t'), ('show'), ('showed'), ('shown'), ('showns'), ('shows'), ('significant'), ('significantly'), ('similar'), ('similarly')

, ('since'), ('six'), ('slightly'), ('so'), ('some'), ('somebody'), ('somehow'), ('someone'), ('somethan'), ('something'), ('sometime'), ('sometimes'), ('somewhat')

, ('somewhere'), ('soon'), ('sorry'), ('specifically'), ('specified'), ('specify'), ('specifying'), ('still'), ('stop'), ('strongly'), ('sub'), ('substantially')

, ('successfully'), ('such'), ('sufficiently'), ('suggest'), ('sup'), ('sure'), ('t'), ('take'), ('taken'), ('taking'), ('tell'), ('tends'), ('th'), ('than'), ('thank')

, ('thanks'), ('thanx'), ('that'), ('that''ll'), ('thats'), ('that''s'), ('that''ve'), ('the'), ('their'), ('theirs'), ('them'), ('themselves'), ('then'), ('thence')

, ('there'), ('thereafter'), ('thereby'), ('thered'), ('therefore'), ('therein'), ('there''ll'), ('thereof'), ('therere'), ('theres'), ('there''s'), ('thereto')

, ('thereupon'), ('there''ve'), ('these'), ('they'), ('theyd'), ('they''d'), ('they''ll'), ('theyre'), ('they''re'), ('they''ve'), ('think'), ('third'), ('this')

, ('thorough'), ('thoroughly'), ('those'), ('thou'), ('though'), ('thoughh'), ('thousand'), ('three'), ('throug'), ('through'), ('throughout'), ('thru'), ('thus')

, ('til'), ('tip'), ('to'), ('together'), ('too'), ('took'), ('toward'), ('towards'), ('tried'), ('tries'), ('truly'), ('try'), ('trying'), ('ts'), ('t''s'), ('twice')

, ('two'), ('u'), ('un'), ('under'), ('unfortunately'), ('unless'), ('unlike'), ('unlikely'), ('until'), ('unto'), ('up'), ('upon'), ('ups'), ('us'), ('use'), ('used')

, ('useful'), ('usefully'), ('usefulness'), ('uses'), ('using'), ('usually'), ('v'), ('value'), ('various'), ('''ve'), ('very'), ('via'), ('viz'), ('vol'), ('vols'), ('vs')

, ('w'), ('want'), ('wants'), ('was'), ('wasnt'), ('wasn''t'), ('way'), ('we'), ('wed'), ('we''d'), ('welcome'), ('well'), ('we''ll'), ('went'), ('were'), ('we''re')

, ('werent'), ('weren''t'), ('we''ve'), ('what'), ('whatever'), ('what''ll'), ('whats'), ('what''s'), ('when'), ('whence'), ('whenever'), ('where'), ('whereafter')

, ('whereas'), ('whereby'), ('wherein'), ('wheres'), ('where''s'), ('whereupon'), ('wherever'), ('whether'), ('which'), ('while'), ('whim'), ('whither'), ('who'), ('whod')

, ('whoever'), ('whole'), ('who''ll'), ('whom'), ('whomever'), ('whos'), ('who''s'), ('whose'), ('why'), ('widely'), ('will'), ('willing'), ('wish'), ('with'), ('within')

, ('without'), ('wonder'), ('wont'), ('won''t'), ('words'), ('world'), ('would'), ('wouldnt'), ('wouldn''t'), ('www'), ('x'), ('y'), ('yes'), ('yet'), ('you'), ('youd')

, ('you''d'), ('you''ll'), ('your'), ('youre'), ('you''re'), ('yours'), ('yourself'), ('yourselves'), ('you''ve'), ('z'), ('zero')) [StopList] (StopWord)

ORDER BY LEN(StopWord) DESC

No comments:

Post a Comment