General search engine index all text of web page, but vertical search engine index partial data only. The spider which vertical search engine employed, always use regular expression to locate and extract data, low performance, low data accuracy. For vertical search, we think that the best method is to structurize web page, transform HTML to XML, then extract data by DOM, XPATH or XSLT. Data analyst is easy, and data accuracy should be high. For example, use //title get page title, use //meta[@name='keywords']/@content get page keywords, use //a/@href|//frame/@src|//iframe/@src get all links, including anchor link, and frame/iframe link, use //img/@src get all image address ...
The key point of this technology is to transform HTML to XML. A few open source projects can do some work, but they are all not active for many years, bad transform quantity , bad chinese support, can make you crazy. And now, Seastar comes, a magic happend.
What is Seastar? Seastar is the web page structurize server for vertical search, developed by a famous professional firefox browser development engineer, who he designed Seaflower and Seaspider. Seastar can transform HTML to XML easily, extract data accuracy.
1. Based on Firefox browser, provides the best transform quantity.
Seastar use Firefox detect charset, parse web page, and output XML results.
2. Native code, multi-threaded
Seastar written by C++, runs on background. It can can serve many user's parse request simultaneously.
3. Support various operation systems
Seastar can run on LINUX, WINDOWS.
4. Simple http like interface, parse data easily
Using PARSE command, user can send data to Seasar, and get XML results. This command executes very fast, vertical search spider should use this method. You can also use FETCH command to get XML results of specific URL. This command executes slowly, especially for test.
5. Provide browser access interface, analysis data easily
User can visit port 6373 which Seastar listen on by browser, analysis web page. For example, visit http://localhost:6373/fetch?type=xml&url=http://www.sohu.com , the browser will show XML results of Seastar fetched. When replace type=xml as type=html, the browser will display web pages which source is constructed by Seastar.
6.You can use Firefox browser as your development environment
You should install FIREBUG and XPATHER extension. Use FIREBUG locate data nodes, get some information related, such as ID value, class value, and so on. Then open XPATHER window, input XPATH expression, verify the selected nodes. You can use these XPATH in your spider when verify succeeded. This is the advantage of Seastar, Firefox browser as a strong development tools here, you can visit web page, you can analysis data also.
Difference of Seastar and Seaflower: Seaflower just as multiple browser, you can get all data of a web page, including datas javascript generated, and contents in frame/iframe. You can execute javascripts in Seaflower. Seastar cannot execute javascripts, XML results it returned doesn't include dynamic contents. Seastar can only transform HTML to XML, the speed is very fast, I can bet that it's the best web structurize tools for vertical search.
Select Seastar menu items in Start Menu, or Goto contron panel -- management tools -- services, select Seastar, can start or stop Seastar service.
service seastar start
service seastar stop
service seastar status
usage: seastarctl command where command is: 1) list list current settings 2) set [port|rcj] value set config 3) help print this help info
Example 1: seastarctl list List current settings. Example 2: seastarctl set port 4444 Set Seastar listen on 4444 port. Example 3: seastarctl set rcj 4444-5555-6666 Set register code to 4444-5555-6666.
usage: struct [string | <url>
Example 1: structurize string from stdin struct string <<!EOF hello,seastar! EOF Example 2: structurize file contents through pipe struct string < /root/a.html Example 3: structurize web page struct http://www.zhuatang.com
Default listen on port: 6373
Structurize string - client send request as follows, and charset of string must be utf-8 (Each line ends with <LF>, <LF> stands for new line character, request ends with blank line)
PARSE <length><LF>
<LF>
<contents>
Note:
<length> - string length
<contents> - string data
Structurize web page - client send request as follows (Each line ends with <LF>, <LF> stands for new line character, request ends with blank line)
FETCH <url><LF>
<LF>
Note:
<url> - The url to structurize
package com.zhsoft88.commons; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.InputStreamReader; import java.io.OutputStream; import java.io.OutputStreamWriter; import java.net.Socket; import java.util.StringTokenizer; import org.apache.commons.lang.math.NumberUtils; /** * seastar web page structurize * @author zhsoft88 * @since 2008-03-28 * @update 2008-08-03 */ public class Seastar { public static final int PORT = 6373; /** * seastar result * @author zhsoft88 * @since 2008-3-28 */ public static class SeastarResult { private int status; private String contents; private long elapsedTime; public SeastarResult(int status,String contents,long elapsedTime) { this.status = status; this.contents = contents; this.elapsedTime = elapsedTime; } public int getStatus() { return status; } public String getContents() { return contents; } public long getElapsedTime() { return elapsedTime; } @Override public String toString() { return "[status="+status+",contents="+contents+",elapsedTime="+elapsedTime+"]"; } } private String host; private int port; /** * constructor: localhost */ public Seastar() { this("localhost"); } /** * constructor: host * @param host */ public Seastar(String host) { this(host,PORT); } /** * constructor for host,port * @param host * @param port */ public Seastar(String host,int port) { this.host = host; this.port = port; } /** * struct contents for specified url * @param url * @return * @throws Exception */ public SeastarResult structURL(String url) throws Exception { long t1 = System.currentTimeMillis(); Socket socket = new Socket(host,port); BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(socket.getOutputStream())); bw.write("FETCH "+url+"\r\n"); bw.write("\r\n"); bw.flush(); BufferedReader br = new BufferedReader(new InputStreamReader(socket.getInputStream(),"utf-8")); String line = br.readLine(); int status = -1; StringTokenizer st = new StringTokenizer(line," "); st.nextToken(); status = NumberUtils.toInt(st.nextToken()); while ((line=br.readLine())!=null) { if (line.length()==0) break; } StringBuilder sb = new StringBuilder(100); int c; while ((c=br.read())!=-1) { sb.append((char)c); } socket.close(); long t2 = System.currentTimeMillis(); return new SeastarResult(status,sb.toString(),t2-t1); } /** * struct string content * @param str * @return * @throws Exception */ public SeastarResult structString(String str) throws Exception { long t1 = System.currentTimeMillis(); Socket socket = new Socket(host,port); OutputStream out = socket.getOutputStream(); byte[] ba = str.getBytes("utf-8"); out.write(("PARSE "+ba.length+"\r\n\r\n").getBytes()); out.write(ba); out.flush(); BufferedReader br = new BufferedReader(new InputStreamReader(socket.getInputStream(),"utf-8")); String line = br.readLine(); int status = -1; StringTokenizer st = new StringTokenizer(line," "); st.nextToken(); status = NumberUtils.toInt(st.nextToken()); while ((line=br.readLine())!=null) { if (line.length()==0) break; } StringBuilder sb = new StringBuilder(100); int c; while ((c=br.read())!=-1) { sb.append((char)c); } socket.close(); long t2 = System.currentTimeMillis(); return new SeastarResult(status,sb.toString(),t2-t1); } }
package com.zhsoft88.commons.tests; import com.zhsoft88.commons.Seastar; import com.zhsoft88.commons.Seastar.SeastarResult; /** * Test of Seastar * @author zhsoft88 * @since 2008-08-03 */ public class TestSeastar { /** * @param args */ public static void main(String[] args) throws Exception { Seastar ss = new Seastar(); { SeastarResult result = ss.structString("good<span>抓糖好<div>goodtest"); System.out.println(result); } { SeastarResult result = ss.structURL("http://localhost:8080/docs/"); System.out.println(result); } } }
package com.zhsoft88.commons.tests; import java.net.URL; import java.util.HashSet; import java.util.List; import java.util.Set; import org.apache.commons.httpclient.HttpClient; import org.apache.commons.httpclient.methods.GetMethod; import org.apache.commons.io.IOUtils; import org.dom4j.Attribute; import org.dom4j.Document; import org.dom4j.DocumentHelper; import com.zhsoft88.commons.Seastar; import com.zhsoft88.commons.Seastar.SeastarResult; /** * Test of Seastar * @author zhsoft88 * @since 2008-08-03 */ public class TestSeastar2 { /** * @param args */ public static void main(String[] args) throws Exception { String url = "http://www.sohu.com"; URL base = new URL(url); long t1 = System.currentTimeMillis(); HttpClient client = new HttpClient(); GetMethod get = new GetMethod(url); client.executeMethod(get); String origContent = IOUtils.toString(get.getResponseBodyAsStream(),"gbk"); long t2 = System.currentTimeMillis(); System.out.println("httpclient: "+(t2-t1)+" ms"); Seastar ss = new Seastar(); SeastarResult result = ss.structString(origContent); System.out.println("seastar: "+result.getElapsedTime()+" ms"); Document doc = DocumentHelper.parseText(result.getContents()); List<Attribute> list = doc.selectNodes("//a/@href|//frame/@src|//iframe/@src"); Set<String> set = new HashSet<String>(); for (Attribute a : list) { String v = a.getValue(); if (v.startsWith("javascript:")||v.startsWith("mailto:")||v.startsWith("#")) continue; set.add(new URL(base,v).toExternalForm()); } System.out.println("size="+set.size()); for (String s : set) { System.out.println(s); } } }