Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » U++ Library support » U++ Core » String should implement the Boyer Moore algo
String should implement the Boyer Moore algo [message #42198] Thu, 27 February 2014 16:05 Go to next message
victorb is currently offline  victorb
Messages: 78
Registered: December 2005
Location: Nice, France
Member
Hi all,

I used to use and contribute to U++ but it has been a long time since I haven't even started theIDE.

It's even more impressive that it was back then... Mirek, you're a genius!

I'm more developping web apps today and I have been looking at the
Boyer-Moore algo to speed up string search. It seems that U++ uses
a brute force algo, it should be able to do much better.

Please find some reference implementation below:
https://github.com/facebook/folly/blob/master/folly/FBString .h#L1796

http://doxygen.postgresql.org/varlena_8c_source.html

Keep up the good waork guys !

Re: String should implement the Boyer Moore algo [message #42220 is a reply to message #42198] Fri, 28 February 2014 11:50 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Thing is, Find is meant for searching in relatively small Strings. IMO, in that case, the cost of precreating search tables in B-M algo outweights any possible gains. It also seams to me that for the same reason, such utility, which would definitely be useful in some cases, should be rather implemented as "search class" - because you can then "compile" the pattern once, then use for searching several times.

So for String::Find, let us consider the "brute force" a feature.

I would be nice to have BM as separate thing (class probably).

But another thing is that perhaps full PCRE has similar characteristics to BM but much more flexible use... (but I might be wrong about this one...)

Mirek
Re: String should implement the Boyer Moore algo [message #42221 is a reply to message #42220] Fri, 28 February 2014 12:36 Go to previous messageGo to next message
victorb is currently offline  victorb
Messages: 78
Registered: December 2005
Location: Nice, France
Member
Mirek,

The one implementation I've linked from FB is without pre-computations. It should be faster in any case - may be there should be a switch when looking one char with a tighter loop.

Folly (the lib from Facebook) is designed to be very fast, they claim a 30x speed increase for "casual cases" FWIW.

I'm really unsure if PCRE can perform as good as BM.

Victor

[Updated on: Fri, 28 February 2014 12:37]

Report message to a moderator

Re: String should implement the Boyer Moore algo [message #42223 is a reply to message #42221] Fri, 28 February 2014 13:54 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Well, but that code is definitely not Boyer-Moore, but " Boyer-Moore-like trick " (it is in the comments Wink

Albeit it looks like quite a clever trick, fixing the most profound cases where brute force fails.

Mirek
Re: String should implement the Boyer Moore algo [message #42246 is a reply to message #42223] Sat, 01 March 2014 17:05 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
After further analysing folly code, I am quite confused by where that 30x times speedup claim comes from.

If I can see the code right, the about only situation where any speedup compared to brute force happens is when the last character of string is matched, which is perhaps important for some corner cases, but for general case it is virtually useless...

(In contrast, _proper_ Boyer-Moore skips almost always, but well, that requires prebuilding those skip tables...)

But perhaps I am missing something?

Mirek
Re: String should implement the Boyer Moore algo [message #42249 is a reply to message #42246] Sat, 01 March 2014 20:19 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Well, I could not sleep about this, so I have decided to put it to test Smile

I have created "proper" BM and BMH algorithm, then started benchamrking. Loaded 120MB XML file into String, appended "Hello world" (not present in the file otherwise) and started benchmarking.

At first, U++ brute force was 10 times slower than others. Analyzing it I have found that it is not particulary well optimized (as brute force), calling memcmp on each input byte. So I have added some microoptimizations and this is what I got then:

------------
Needle: Hello world
U++ Brute force: 127158988
  Time elapsed: 0.037
Folly: 127158988
  Time elapsed: 0.059
Boyer-Moore: 127158988
  Time elapsed: 0.082
Boyer–Moore–Horspool: 127158988
  Time elapsed: 0.049
------------
Needle: Hel
U++ Brute force: 127158988
  Time elapsed: 0.036
Folly: 127158988
  Time elapsed: 0.063
Boyer-Moore: 127158988
  Time elapsed: 0.270
Boyer–Moore–Horspool: 127158988
  Time elapsed: 0.154


First numbers are for the whole "Hello world" search, then i have searched only for "Hel", which accidentally is not in the file too.

I believe that the only real speed advantage the folly algorithm has lies in that simple loop

     while (i[nsize_1] != lastNeedle) {
        if (++i == iEnd) {
          // not found
          return -1;
        }
      }


implementing de-facto brutal force approach. Any "smart" skips are not frequent enough to change anything.

As for BM and BMH, it looks like the management of skips in reality and in modern CPU is too costly as compared to streamlined comparison loops.... and becomes really a burden when needle is short. Also interesting is that simpler, less clever variant BMH (BMH is in fact simplified BM) is faster... I guess complexity in the loop shows.

Well, at any case, the real result of this endeavour is optimized, albeit still brute-force, String::Find - IMO not bad at all Smile

Mirek

[Updated on: Sat, 01 March 2014 20:21]

Report message to a moderator

Re: String should implement the Boyer Moore algo [message #42254 is a reply to message #42249] Sun, 02 March 2014 15:36 Go to previous messageGo to next message
zsolt is currently offline  zsolt
Messages: 693
Registered: December 2005
Location: Budapest, Hungary
Contributor
Thanks Smile
Re: String should implement the Boyer Moore algo [message #42256 is a reply to message #42254] Sun, 02 March 2014 17:32 Go to previous messageGo to next message
victorb is currently offline  victorb
Messages: 78
Registered: December 2005
Location: Nice, France
Member
Mirek,

Great work !
Re: String should implement the Boyer Moore algo [message #42261 is a reply to message #42198] Sun, 02 March 2014 19:17 Go to previous messageGo to next message
nlneilson is currently offline  nlneilson
Messages: 644
Registered: January 2010
Location: U.S. California. Mojave &...
Contributor
Great!

It seemed to be fast enough before but should be even faster now.

Adding the option to not close the search window as often was great also.
Re: String should implement the Boyer Moore algo [message #42262 is a reply to message #42261] Sun, 02 March 2014 19:22 Go to previous message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Frankly, I am still wondering why "smart" Boyer-Moore is slower. It is possible that I have faulty implementation (I just adopted the code from wikipedia).

Anyone wishing to play with this is encouraged to try benchmarks/StringFind.

Mirek
Previous Topic: Bug: AString<B>::ReverseFind broken for wchar
Next Topic: AMap's FindAdd, FindPut API changed
Goto Forum:
  


Current Time: Fri Mar 29 09:19:05 CET 2024

Total time taken to generate the page: 0.01161 seconds