Help - Search - Members - Calendar
Full Version: NEW and bugfixed MT Community Harvester
Productivity Talk > General > GnuT - MT Community Reference Project
Mike DeTuri
I spent some time this morning adding a new feature to the harvester software I had uploaded a few weeks ago.

I added a new field that lets you specify a maximum frequency value. Words that appear more times than the maximum frequency will not be included in your output list. This should help cut down on very common words like: patient, with, and date.

I also fixed a bug that was causing the program to only harvest words from one document, even if you used a wildcard (*) in the filename.

Lastly, I removed the 50,000-word limit on the word lists. It is now limited only by the amount of memory that you have in your system.

That's it. Here's the new version...
14tonks
Thanks, Mike! I really have to download this and check it out. What would you suggest setting maximum frequency at for best results, or is that going to be a trial and error kind of thing, depending on how many documents one is harvesting? I'm willing to pare some of the common words out of a list, but would want to be sure I got all the medical terms in a batch of documents for the same account.
Mike DeTuri
QUOTE (14tonks @ Sep 5 2005, 01:33 PM)
What would you suggest setting maximum frequency at for best results, or is that going to be a trial and error kind of thing, depending on how many documents one is harvesting? 
*


It really depends. My guess is that you won't need any word repeated more than once per document on average. That's going to depend on what you scan and how much repetition there is in your documents. I set the initial value for max frequency at 50,000. That should be high enough so it won't take the most frequent of the frequent words, unless you scan in 10,000 files or something.
Harrie
Wow Mike, I cannot wait to check this out later today! Why, we've got a lot of "methods" here already for this project. I look forward to using this and helping with the project. I also look forward to reading all the new posts here. I've been sorely put behind but not for long!
shipaddict
I downloaded this program and tried to run it but it didn't produce anything. Does anyone have any secrets for it?
Harrie
Hey, shipaddict. It seems to be working from my end. What did you specify for the output file? For instance, if I want the results in a new text file on my Desktop, I could put C:\Documents and Settings\Harrie\Desktop\newtextfile.txt (which I did), and it did create the file with the harvested words from the input file I fed it. Does this help?
shipaddict
it always comes up as a blank file for me.
Harrie
Hmm, I don't know why that would be. I'll make sure Mike sees this post, shipaddict - which I'm sure he will anyway but just to be certain.
Mike DeTuri
QUOTE (shipaddict @ Jul 2 2007, 06:49 AM) *
it always comes up as a blank file for me.


What type of files are you trying to harvest? I don't think it will work on a Word 2007 (docx) or an OpenOffice (odt) file. Those file types will need to be converted to Notepad (txt), Word Perfect (wpd), or Word 2003/XP/2000/97 (doc) format.
shipaddict
It is a Word 2007 document, but i save them in a Word97-2003 compatible format. i will try to save them in txt and see if i have any better luck.
shipaddict
nope still nothing.
Harrie
Well, I'll take a WAG here - I see your profile says you are using Vista, so you could see if right clicking the .exe and seeing if run as....administrator, would do anything. I don't have Vista, haven't studied Vista, but have seen that in enough searches just for other things that I throw it out. Remember, I know not of what I speak, but what the heck, can't hurt to see.
shipaddict
well that worked!! now i'm trying to figure out how I can use this -- I thought it was going to spit out phrases the doctor said quite a bit, not just single words.
Harrie
Good.

No, this one is meant to be a single-word harvester. As I'm sure you can tell from reading through this forum, its purpose was for a project which admittedly lies dormant now, perhaps forever, don't know, but anyway, it can be very useful to see what frequently-used words you might want to put in your expander, if not already there.

There are other programs you can use for harvesting phrases. Of course, Instant Text will do this for you in its own particular way (continuations). But if you use another expander and would like to do this to enable you to manually put frequent phrases into your expander, one such program that 14tonks really recommends is linked to in this thread and also, I believe one or two like programs may be listed in the Software Vault forum. I'm on a short no-work break now or otherwise I'd get the links for you.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2010 Invision Power Services, Inc.