Discussion:
Getting the Word[s] Out
Chuck Pelto
2007-01-18 13:47:31 UTC
Permalink
Is anyone familiar with a way to get all the words out of a string as
succinct elements?

Regards,

Chuck
Terry Ford
2007-01-18 14:12:44 UTC
Permalink
Post by Chuck Pelto
Is anyone familiar with a way to get all the words out of a string
as succinct elements?
You could use the Split Function.

-Terry
Chuck Pelto
2007-01-18 14:24:49 UTC
Permalink
Thanks!

I was ignorant of its existence.

Doing several searches up on the archive on how to parse words was a
total failure. Now, with this thread, maybe people will find SPLIT
more readily.

Regards,

Chuck
Post by Terry Ford
Post by Chuck Pelto
Is anyone familiar with a way to get all the words out of a string
as succinct elements?
You could use the Split Function.
-Terry
_______________________________________________
<http://www.realsoftware.com/support/listmanager/>
<http://support.realsoftware.com/listarchives/lists.html>
Phil Heycock
2007-01-18 17:28:18 UTC
Permalink
How about RegEx? (I didn't test the following, but something like it ought
to work).

Dim reg As New RegEx
Dim regResults As RegMatch

reg.SearchPattern = "\b(\d+)\b" // word boundary - word - word boundary

regResults = reg.Search(yourText)

Dim matchNum As Integer
For matchNum = 1 To regResults.SubExpressionCount
nextWord = regResults.SubExpressionString(matchNum)
Next matchNum

P.
***********
Post by Chuck Pelto
Thanks!
I was ignorant of its existence.
Doing several searches up on the archive on how to parse words was a
total failure. Now, with this thread, maybe people will find SPLIT
more readily.
Regards,
Chuck
Post by Terry Ford
Post by Chuck Pelto
Is anyone familiar with a way to get all the words out of a string
as succinct elements?
You could use the Split Function.
-Terry
Chuck Pelto
2007-01-22 14:50:08 UTC
Permalink
Post by Phil Heycock
How about RegEx? (I didn't test the following, but something like it ought
to work).
Dim reg As New RegEx
Dim regResults As RegMatch
reg.SearchPattern = "\b(\d+)\b" // word boundary - word - word
boundary
regResults = reg.Search(yourText)
Dim matchNum As Integer
For matchNum = 1 To regResults.SubExpressionCount
nextWord = regResults.SubExpressionString(matchNum)
Next matchNum
I'm teaching myself RegEx and developed something akin to this
methodology you suggested.

However, what I'm experiencing is 'odd'.

I've got a source text that has the word "test" in it several times.

But for some strange reason the regResults.SubExpressionCount is only
returning a value of 1. Not the number of instances of the word
"test" in the target string.

Here's my code....

// method to test aspects of Regular Expressions

dim rg as New RegEx
dim theMatch as RegExMatch

dim strInput as string
dim strQuots as string
dim srchPatt as string

dim iCount as integer
dim i as integer

strInput = "test this 'chuck pelto' test 'susan pelto' test another"

srchPatt = "test" // look for word "test"

rg.SearchPattern = srchPatt

theMatch = rg.Search(strInput)

iCount = theMatch.SubExpressionCount

for i = 1 to iCount

strQuots = theMatch.SubExpressionString(i)

next

Why is it only returning a value of 1 when there are three instances
of the word "test" in the target string?

Regards,

Chuck
Phil Heycock
2007-01-22 15:55:38 UTC
Permalink
rg.Options.Greedy = False

****************
Post by Chuck Pelto
Post by Phil Heycock
How about RegEx? (I didn't test the following, but something like it ought
to work).
Dim reg As New RegEx
Dim regResults As RegMatch
reg.SearchPattern = "\b(\d+)\b" // word boundary - word - word
boundary
regResults = reg.Search(yourText)
Dim matchNum As Integer
For matchNum = 1 To regResults.SubExpressionCount
nextWord = regResults.SubExpressionString(matchNum)
Next matchNum
I'm teaching myself RegEx and developed something akin to this
methodology you suggested.
However, what I'm experiencing is 'odd'.
I've got a source text that has the word "test" in it several times.
But for some strange reason the regResults.SubExpressionCount is only
returning a value of 1. Not the number of instances of the word
"test" in the target string.
Here's my code....
// method to test aspects of Regular Expressions
dim rg as New RegEx
dim theMatch as RegExMatch
dim strInput as string
dim strQuots as string
dim srchPatt as string
dim iCount as integer
dim i as integer
strInput = "test this 'chuck pelto' test 'susan pelto' test another"
srchPatt = "test" // look for word "test"
rg.SearchPattern = srchPatt
theMatch = rg.Search(strInput)
iCount = theMatch.SubExpressionCount
for i = 1 to iCount
strQuots = theMatch.SubExpressionString(i)
next
Why is it only returning a value of 1 when there are three instances
of the word "test" in the target string?
Regards,
Chuck
Chuck Pelto
2007-01-22 17:50:25 UTC
Permalink
Post by Phil Heycock
rg.Options.Greedy = False
No joy....

Same result. Only 1 is returned by theMatch.SubExpressionCount.

Regards,

Chuck
Charles Yeomans
2007-01-22 18:09:23 UTC
Permalink
Post by Chuck Pelto
Post by Phil Heycock
rg.Options.Greedy = False
No joy....
Same result. Only 1 is returned by theMatch.SubExpressionCount.
This is how RegEx.Search works. If you want to find all matches,
you'll need to call Search more than once.

Charles Yeomans
Chuck Pelto
2007-01-22 18:29:43 UTC
Permalink
Post by Charles Yeomans
Post by Chuck Pelto
Post by Phil Heycock
rg.Options.Greedy = False
No joy....
Same result. Only 1 is returned by theMatch.SubExpressionCount.
This is how RegEx.Search works. If you want to find all matches,
you'll need to call Search more than once.
So.....

....what is the functionality of SubExpressionCount?

What is meant by....

Number of SubExpressions that are available with the search just
performed.

Is it supposed to return the number of matched patterns? Or is it
supposed to return something else? If the latter, what must be done
to find the number of patterns that matched the search pattern?

I've tried using another example I found while rummaging around in
the archives. It seems to support the recursion approach you suggest.
But it doesn't indicate 3 uses of the word "test" either.

Here is the new code....

// method to test aspects of Regular Expressions

dim rg as New RegEx
dim theMatch as RegExMatch

dim strInput as string
dim strQuots as string
dim srchPatt as string

dim iCount as integer
dim i as integer

strInput = "test this 'chuck pelto' test 'susan pelto' test another"

srchPatt = "test" // just a test search for the word "test"

rg.SearchPattern = srchPatt

rg.Options.Greedy = False

theMatch = rg.Search(strInput)

while theMatch<> nil and theMatch.subExpressionCount >= 1
strQuots = theMatch.subExpressionString(1)
theMatch = rg.search()
wend

It seems that no matter what I do, SubExpressionCount always
returns 1. Is there a problem with this call?

Regards,

Chuck
Charles Yeomans
2007-01-22 18:43:08 UTC
Permalink
Post by Chuck Pelto
Post by Charles Yeomans
Post by Chuck Pelto
Post by Phil Heycock
rg.Options.Greedy = False
No joy....
Same result. Only 1 is returned by theMatch.SubExpressionCount.
This is how RegEx.Search works. If you want to find all matches,
you'll need to call Search more than once.
So.....
....what is the functionality of SubExpressionCount?
In a regular expression you can define subexpressions using
parentheses. You then use the Subexpression methods of a RegExMatch
object to get the matches of the subexpressions.

Charles Yeomans
Chuck Pelto
2007-01-22 19:03:33 UTC
Permalink
Post by Charles Yeomans
In a regular expression you can define subexpressions using
parentheses. You then use the Subexpression methods of a
RegExMatch object to get the matches of the subexpressions.
Okay....I think I follow, now. SubExpressionCount is NOT a count of
the number of items that match the search pattern. It's something
else that I don't quite understand just yet.

With that hurtle overcome, THIS code actually works.....

Sub TestRegEx()
// method to test aspects of Regular Expressions

dim rg as New RegEx
dim theMatch as RegExMatch

dim strInput as string
dim strQuots as string
dim srchPatt as string

dim iCount as integer
dim i as integer

iCount = 0

strInput = "test1 this 'chuck pelto' test2 'susan pelto' test3
another"

srchPatt = "test"

rg.SearchPattern = srchPatt

rg.Options.Greedy = False

theMatch = rg.Search(strInput)

while theMatch<> nil

iCount = iCount + 1 // catch how many times we go through this
process to see if it matches the number of patterns matched
strQuots = theMatch.subExpressionString(0) // use of 1 in the
parens here doesn't work
theMatch = rg.search() // leave the parens empty to use the
previously searched on target text

wend

Thanks for your patience and support.

Regards,

Chuck
Phil Heycock
2007-01-22 19:43:10 UTC
Permalink
Your search expression must be enclosed in parentheses... at least the part
of it where you expect to find multiple hits. Why don't you just try the
example that I gave you in the first place?

Anyway... in your example, your search expression should be "(test)"... not
"test". That should yield "test" in subexpressions.

P.
Post by Chuck Pelto
Post by Phil Heycock
rg.Options.Greedy = False
No joy....
Same result. Only 1 is returned by theMatch.SubExpressionCount.
Regards,
Chuck
Chuck Pelto
2007-01-22 19:52:29 UTC
Permalink
Post by Phil Heycock
Anyway... in your example, your search expression should be
"(test)"... not
"test". That should yield "test" in subexpressions.
Thanks for the additional information.

Regards,

Chuck

Octave Julien
2007-01-18 14:40:32 UTC
Permalink
Post by Chuck Pelto
Is anyone familiar with a way to get all the words out of a string as
succinct elements?
Regards,
Chuck
If by 'succinct elements' you mean words separated by spaces, the split
function will do the work.
But if you're working on a text, the words will often be separated by
dots, coma, and other delimiters. You would need to run the split
function too many times to separate sentences (delimited with dots)
into sub sentences (delimited by comas) and then into words (delimited
by spaces). I couldn't find any plug-in to get the words out of a text
in a straightforward and easy way. (Personal note : I'm interrested in
lexicography (statistics applied to texts), if anyone shares the same
interest, please let me know if you know any RB plug-ins or any
application (free and running on a Mac) for that kind of work.)
So if that's what you're looking for, check this code below :

' 1) s is the string with the text you want to extract the words
from. The do...loop replace double spaces by single spaces (It seems
that when you paste a text into an editfield, some spaces or carriage
returns are added). To do that, we need to work with a temporary string
: st.

dim st as String
do
st = s
s=ReplaceAll(s," "," ")
loop until st=s
s=LTrim(s)
s=RTrim(s)

'2) nbChar is the number of chars of the text. It will be useful later.

dim nbChar as Integer
nbChar=s.len()


'3) We extract each word and store it in the array aListeMots, that
needs to be declared first (the french for aListWords, if you wonder).
aListeMotsC1 and aListeMotsCd are two others arrays that store the
position of the first and last char of each word stored in aListeMots.
Maybe you don't need these informations ; in that case, some lines of
code can be deleted. And if you do, you can probably use a single
multidimensional array instead of three, but I wasn't sure how to do
that.
'The string separateurs ('delimiters') is a list of chars that should
be considered as blank spaces, in as much as they separate words. It
includes rc, the RB name for return carriage.


dim separateurs as string
dim rc as string
rc=EndOfLine.Macintosh
separateurs=",.! ?'¡¿:;<>()"+rc
dim i,j,c1,cd,n,nlleLigne as integer
dim vChar,vChar2 as string

redim aListeMots (-1)
redim aListeMotsC1 (-1)
redim aListeMotsCd (-1)


i=1
c1=0
cd=0

'The following loops reads each char, checks if it's a delimiter, and
if it is, fills aListeMots, aListeMotsC1 and aListeMotsCd with the
word, and the positions of its first and last character. A second loop
is within the first one because, when a delimiter is found before a
word, we then need to find the delimiter after this word.

do until i>nbChar
vChar = mid(s,i,1)
if inStr(separateurs, vChar)=0 then
c1=i
j=i+1
vChar2 = mid(s,j,1)
do until j>nbChar or inStr(separateurs,vChar2)>0
j=j+1
vChar2 = mid(s,j,1)
loop
cd=j-1
aListeMots.append mid(s,c1,(cd-c1+1))
aListeMotsC1.append c1
aListeMotsCd.append cd
i=j
else
i=i+1
end if
loop

'The work is done. A famous sentence would give three arrays :

array aListeMots :
The
quick
brown
fox
jumps
over
the
lazy
dog

array aListeMotsC1
1
5
11
17
21
27
32
36
41

array aListeMotsCd
3
9
15
19
25
30
34
39
43

Hope this helps.
Regards,

Octave
Norman Palardy
2007-01-18 15:23:22 UTC
Permalink
Post by Octave Julien
Post by Chuck Pelto
Is anyone familiar with a way to get all the words out of a string
as succinct elements?
Regards,
Chuck
If by 'succinct elements' you mean words separated by spaces, the
split function will do the work.
But if you're working on a text, the words will often be separated
by dots, coma, and other delimiters. You would need to run the
split function too many times to separate sentences (delimited with
dots) into sub sentences (delimited by comas) and then into words
(delimited by spaces). I couldn't find any plug-in to get the words
I'm interrested in lexicography (statistics applied to texts), if
anyone shares the same interest, please let me know if you know any
RB plug-ins or any application (free and running on a Mac) for that
kind of work.)
Applescript can break text into words very nicely
Chuck Pelto
2007-01-18 19:32:38 UTC
Permalink
Post by Norman Palardy
Applescript can break text into words very nicely
True. But....

[1] AppleScript (AS) doesn't run on Windows-based platforms.
[2] I need something to work inside of an RB app, without dropping
into AS.

It's part of an SQL search engine I'm working on. Something to deal
with multiple words being sought in records of a DB. I'm not quite in
the depth of Octave's situation. But I WOULD like to see something
done that would allow us the ease that AS affords for this sort of
thing.

Perhaps a form of SPLIT that would not take merely ONE delimiter, but
would take an array of delimiters that could be set by the programmer
or end-user.

Regards,

Chuck
Michael Rebar
2007-01-19 00:30:56 UTC
Permalink
Post by Chuck Pelto
Post by Norman Palardy
Applescript can break text into words very nicely
True. But....
[1] AppleScript (AS) doesn't run on Windows-based platforms.
[2] I need something to work inside of an RB app, without dropping
into AS.
It's part of an SQL search engine I'm working on. Something to deal
with multiple words being sought in records of a DB. I'm not quite in
the depth of Octave's situation. But I WOULD like to see something
done that would allow us the ease that AS affords for this sort of
thing.
Perhaps a form of SPLIT that would not take merely ONE delimiter, but
would take an array of delimiters that could be set by the programmer
or end-user.
Regards,
Chuck
You could also use regular expressions. It can locate word boundaries with
reasonable accuracy.

Google 'regex word list word boundary' or 'regex word list'

This can also get quite complex. For example:

* periods aren't accurate word boundaries
* commas aren't accurate word boundaries
* hyphens might be accurate word boundaries

You might have to maintain a dictionary/corpus against which to compare for
validity.

Michael
Chuck Pelto
2007-01-19 22:13:20 UTC
Permalink
Post by Michael Rebar
Post by Chuck Pelto
Post by Norman Palardy
Applescript can break text into words very nicely
True. But....
[1] AppleScript (AS) doesn't run on Windows-based platforms.
[2] I need something to work inside of an RB app, without dropping
into AS.
It's part of an SQL search engine I'm working on. Something to deal
with multiple words being sought in records of a DB. I'm not quite in
the depth of Octave's situation. But I WOULD like to see something
done that would allow us the ease that AS affords for this sort of
thing.
Perhaps a form of SPLIT that would not take merely ONE delimiter, but
would take an array of delimiters that could be set by the programmer
or end-user.
Regards,
Chuck
You could also use regular expressions. It can locate word
boundaries with
reasonable accuracy.
Google 'regex word list word boundary' or 'regex word list'
* periods aren't accurate word boundaries
* commas aren't accurate word boundaries
* hyphens might be accurate word boundaries
You might have to maintain a dictionary/corpus against which to compare for
validity.
Michael
Interesting report. I'll have to look into this more deeply, as I'm
not familiar with RegEx. But seeing that it is built into RB and
looks rather powerful, it might serve my purposes....if I can warp my
mind around it [transposition fully intended ;.-) ]
Chuck Pelto
2007-01-19 22:18:17 UTC
Permalink
Post by Chuck Pelto
Post by Michael Rebar
You could also use regular expressions. It can locate word
boundaries with
reasonable accuracy.
Google 'regex word list word boundary' or 'regex word list'
* periods aren't accurate word boundaries
* commas aren't accurate word boundaries
* hyphens might be accurate word boundaries
You might have to maintain a dictionary/corpus against which to compare for
validity.
Michael
Interesting report. I'll have to look into this more deeply, as I'm
not familiar with RegEx. But seeing that it is built into RB and
looks rather powerful, it might serve my purposes....if I can warp
my mind around it [transposition fully intended ;.-) ]
Which brings up a question....

WHERE CAN I GET A GOOD REFERENCE, w/examples, TO TEACH MYSELF ABOUT
REGEX?
Charles Yeomans
2007-01-19 22:35:36 UTC
Permalink
Post by Chuck Pelto
Post by Chuck Pelto
Post by Michael Rebar
You could also use regular expressions. It can locate word
boundaries with
reasonable accuracy.
Google 'regex word list word boundary' or 'regex word list'
* periods aren't accurate word boundaries
* commas aren't accurate word boundaries
* hyphens might be accurate word boundaries
You might have to maintain a dictionary/corpus against which to compare for
validity.
Michael
Interesting report. I'll have to look into this more deeply, as
I'm not familiar with RegEx. But seeing that it is built into RB
and looks rather powerful, it might serve my purposes....if I can
warp my mind around it [transposition fully intended ;.-) ]
Which brings up a question....
WHERE CAN I GET A GOOD REFERENCE, w/examples, TO TEACH MYSELF ABOUT
REGEX?
_______________________________________________
Get a copy of "Mastering Regular Expressions" by Jeffrey Friedl.

Charles Yeomans
j***@strout.net
2007-01-20 14:33:10 UTC
Permalink
Post by Chuck Pelto
WHERE CAN I GET A GOOD REFERENCE, w/examples, TO TEACH MYSELF ABOUT
REGEX?
I use the online help of TextWrangler, which is quite good. It's also
handy to try out your regular expressions there, since I often find it
takes several tries to get them right.

Best,
- Joe
--
Joe Strout -- ***@strout.net
Verified Express, LLC "Making the Internet a Better Place"
http://www.verex.com/
Chuck Pelto
2007-01-22 13:29:16 UTC
Permalink
Post by j***@strout.net
Post by Chuck Pelto
WHERE CAN I GET A GOOD REFERENCE, w/examples, TO TEACH MYSELF ABOUT
REGEX?
I use the online help of TextWrangler, which is quite good. It's also
handy to try out your regular expressions there, since I often find it
takes several tries to get them right.
Got TextWrangler. Did the tutorial. Much impressed with this tool.
Thanks for pointing its existence out to me.

Have, as usual, a question......

What is the syntax for using RegEx in RB in order to SEARCH/EXTRACT
text from a string?

I see, in the LangRef, that RegEx is oriented towards SEARCH/REPLACE
but I see no mention of how to apply a RegEx expression in RB in the
LangRef, nor up on the Archive, either.

So, what does a call look like? E.g.....

dim sourceText as string
dim capturedText as string
dim srchCode as string

srchCode = "!@#$((!@!!!!" // just something to stuff in there, as I
don't know enough about RegEx to make a good expression


capturedText = RegEx(sourceText, srchCode) // Would this
work?????????????

Regards,
Chuck Pelto
2007-01-22 13:42:44 UTC
Permalink
Disregard previous message....

....I found the example in the LangRef.
Post by Chuck Pelto
Post by j***@strout.net
Post by Chuck Pelto
WHERE CAN I GET A GOOD REFERENCE, w/examples, TO TEACH MYSELF ABOUT
REGEX?
I use the online help of TextWrangler, which is quite good. It's also
handy to try out your regular expressions there, since I often find it
takes several tries to get them right.
Got TextWrangler. Did the tutorial. Much impressed with this tool.
Thanks for pointing its existence out to me.
Have, as usual, a question......
What is the syntax for using RegEx in RB in order to SEARCH/EXTRACT
text from a string?
I see, in the LangRef, that RegEx is oriented towards SEARCH/
REPLACE but I see no mention of how to apply a RegEx expression in
RB in the LangRef, nor up on the Archive, either.
So, what does a call look like? E.g.....
dim sourceText as string
dim capturedText as string
dim srchCode as string
don't know enough about RegEx to make a good expression
capturedText = RegEx(sourceText, srchCode) // Would this
work?????????????
Regards,
_______________________________________________
<http://www.realsoftware.com/support/listmanager/>
<http://support.realsoftware.com/listarchives/lists.html>
CV
2007-01-19 22:52:12 UTC
Permalink
Post by Chuck Pelto
Post by Norman Palardy
Applescript can break text into words very nicely
True. But....
<snip>
[2] I need something to work inside of an RB app, without dropping
into AS.
Just as an aside, a compiled AppleScript file can be dropped into an
Rb Project and run directly by name.

Best,

Jack
Loading...