Welcome, Guest ( Login | Register )
Remember Me  
 
 
All times are UTC
 
 
Chinese on-page keywords processing with problem in WA.
 
Page 1 of 1 [ 6 posts ]
Author Message
Post subject: Chinese on-page keywords processing with problem in WA.
Posted: Thu Jul 15, 2010 2:17 pm

Tenderfoot

Tenderfoot

Posts: 5

Online

I have downloaded and tried the WebSite Auditor software, It is a great tool, But when i was using it to ananlyze some chinese pages, I found a problem when dealling with the on-page chinese keywords.

The software consider the chinese sentence as one words(I think it was caused by chinese words were not having space between them). So I think the software's on-page calculation such as Keywords count, keywords density, keywords prominence should all be wrong :? on chinese pages(or other asia language pages), and the advice about chinese pages on-page optimize should also be wrong. :(
I think the problem should also be exists in SEO SpyGlass, Ranker Check and Link Assistant if they anlyze chinese(or other asia language) page's on-page keywords.

Could you add a preprocessor about spliting the unspaced sentence to spaced words(automaticly processing or using an predefined words dictionary), and then continue on-page keywords analyzing? :D

pls: I also added a PowerStuite feature request ticket JCN-821404.


Top
Post subject:
Posted: Fri Jul 23, 2010 11:58 am

Small God

Small God

Posts: 205

Location: Hong Kong

Online

Hi Abus,

Yes, to a degree you are right... some of the data can not be calculated correctly (or more accurately is not broken down, to show all the words which appear on a page) e.g. Page Elements -> Keyword might not show (if never appeared on the page as a single phrase with spaces then it will not show in list).

However, for the keywords being targeted for reports (i.e. those you entered for say, the targeted 3 kw for a page), the software searches for a "word match" and those calculations will be correct.

I am not sure if there is a workaround for this for link-assistant (as really, the software would have to be given a dictionary of words to match, to be able to identify all words, without a space or comma between words exisiting on a page).
As you probably know, this is actually a general problem of searching of words, using Asian languages (or any language, where spaces do not appear between words).

Perhaps link-assistant has some ideas on how to better improve this for Asian languages???
LINKYYYYYYYY???

Cheers - Asiaplay


Top
Post subject: Keyword Density - Japanese, Korean, Thai, Hindi, Chinese, Ar
Posted: Wed Aug 18, 2010 9:56 am

Small God

Small God

Posts: 205

Location: Hong Kong

Online

Asiaplay wrote:
...
However, for the keywords being targeted for reports (i.e. those you entered for say, the targeted 3 kw for a page), the software searches for a "word match" and those calculations will be correct."


Linky, I should reclarify this - it doesn't actually work as I suggested at the moment.
Basically the data output for the "webpage" tab is near useless for many languages.

Namely. there are problems for some languages, which I believe part of could be overcome easily enough with a change to web-auditor
i.e. Japanese, Korean, Chinese, Thai, Hindi, Arabic, Hebrew and any other languages where a "space" might not ALWAYS be used between words.

For example... if one has this sentance as the text being analysed (e.g. page title)
"mylongsentancewithnospaces withaword whichshouldmatch"

AND these are the target 3 keywords
"word"
"wordmatch"
"mylongsentancewithnospaceswithawordwhichshouldmatch"
- where "word" is the same character term (without quotes), in each of these cases

Unfortunately, the % density is calculated incorrectly in web-auditor (as the split of words / word phrares is always based on SPACES).
i.e. "word" is never found (unless appears as " word ")
I would have thought that the software could be told to work out that "word" is also appearing, when for a sentance, it also appears as a keyword within a long chain withour any spaces
e.g. "mylongsentancewithnospaceswithawordwhichshouldmatch" and "word" should both show % density, for the target keyword terms (even though no space exists in the text being analysed).
Likewise "wordmatch" should show as ZERO % (as it does not exist as a phrase)

Unfortunately at the moment "word" in the above example is totally ignored and gets ZERO % appearence...

Perhaps the following would help solve this problem:-
A) Adding a tick box in settings - "Language does not use spaces"
For Asian / Middle Eastern etc. languages, all spaces are removed from the text being analysed and phrase match based on word direction order is used. i.e. please treat match of "word" for density calculations to incude "matchingwords" or "mywordphrase" i.e. based on left to right word match and not based on spaces only.

B) adding a tick box in settings - "Language is read from Right to Left"
The "word" matching is based on Right to Left sentance order (this would be required for matching Arabic and Hebrew for example).

C) adding a BOX, that allows someone to feed in a list of single "words" that web-audior uses for matching.
(i.e. essentially a 800, 1000 or 2000 single words mini dictionary (that can be set for a SINGLE WEBPAGE), webauditor then uses this word list, to match word "combinations" from, for working out density / webpage analysis).

The reason I suggest having these extra settings - is that for languages which use spaces between words (e.g. latin based languages, this change / switch would not be wanted / desirable), but without it, the density part of web-auditor is pretty useless for languages that do not ALWAYS use spaces between words.

Please let me know you thought Linky???

Thanks and cheers, Asiaplay


Top
Post subject:
Posted: Wed Aug 18, 2010 1:41 pm

Site Admin

User avatar

Posts: 2720

Online

Thanks for the suggestions Asiaplay. Yes, it does seem that it is time we introduced those changes. However we need to be 100% sure we do it completely right - and we could really use some help here, since we do not know any of those languages.

You are saying that words are "Not always" separated by spaces - are there cases when spaces are used?

In any case we are now starting to work on this seriously and hopefully we will add a proper parsing algorithm with the help of our users.

_________________
Search Engine Optimization Software SEO PowerSuite

See a spammer? Click "Report this Post" (bottom right) and help keep our forum clean!


Top
Profile   |   Website
Post subject:
Posted: Wed Aug 18, 2010 1:48 pm

Tenderfoot

Tenderfoot

Posts: 5

Online

or add another extra setting that can set an service url which pass unspaced sentence and return spaced words list.
and then WebSite Auditor can call this url to make unspaced sentence to be spaced words.
and this service url can be made by others.
so we can extend this functions.


Top
Post subject:
Posted: Thu Aug 19, 2010 6:30 pm

Small God

Small God

Posts: 205

Location: Hong Kong

Online

LinkAssistant wrote:
In any case we are now starting to work on this seriously and hopefully we will add a proper parsing algorithm with the help of our users.


Yes, as Abus (I think was saying) and you mentioned, using a word segmentation parser, is the most sensible approach.
I did a bit of reading and will send you some links / discussion on word segmentation options.
(i.e. thinking again, forget about the ability I mentioned in my above post, for people to specify words / use there own self generated customized dictionary... use a regonized word segmentation parser).

I am happy to test any beta for you for Chinese and Korea (perhaps can get others to help for Thai, Japanese and maybe Arabic).
However in reality, I think the success of the data output within web-auditor will be directly correlated to the precision of the word segmentation parser used.

This by definition I beleive will require you to have at least 3 language parsers (which parser gets used, being chosen via a setting option in Web-Auditor preferences... under languages perhaps???)
i.e. Japanese (+ Korean), Chinese and Arabic

As I understand, once you have the focus text for these language segmented into words (with spaces), then I suspect your current data analysis will also work very well for the languages in question.

I will send you a PM now on this topic - cheers, Asiaplay


Top
Display topics from previous:  Sort by  
Page 1 of 1 [ 6 posts ]

 
 
All times are UTC
Jump to: