中文

About Seastar

General search engine index all text of web page, but vertical search engine index partial data only. The spider which vertical search engine employed, always use regular expression to locate and extract data, low performance, low data accuracy. For vertical search, we think that the best method is to structurize web page, transform HTML to XML, then extract data by DOM, XPATH or XSLT. Data analyst is easy, and data accuracy should be high. For example, use //title get page title, use //meta[@name='keywords']/@content get page keywords, use //a/@href|//frame/@src|//iframe/@src get all links, including anchor link, and frame/iframe link, use //img/@src get all image address ...

The key point of this technology is to transform HTML to XML. A few open source projects can do some work, but they are all not active for many years, bad transform quantity , bad chinese support, can make you crazy. And now, Seastar comes, a magic happend.

What is Seastar? Seastar is the web page structurize server for vertical search, developed by a famous professional firefox browser development engineer, who he designed Seaflower and Seaspider. Seastar can transform HTML to XML easily, extract data accuracy.

Features

1. Based on Firefox browser, provides the best transform quantity.

Seastar use Firefox detect charset, parse web page, and output XML results.

2. Native code, multi-threaded

Seastar written by C++, runs on background. It can can serve many user's parse request simultaneously.

3. Support various operation systems

Seastar can run on LINUX, WINDOWS.

4. Simple http like interface, parse data easily

Using PARSE command, user can send data to Seasar, and get XML results. This command executes very fast, vertical search spider should use this method. You can also use FETCH command to get XML results of specific URL. This command executes slowly, especially for test.

5. Provide browser access interface, analysis data easily

User can visit port 6373 which Seastar listen on by browser, analysis web page. For example, visit http://localhost:6373/fetch?type=xml&url=http://www.sohu.com , the browser will show XML results of Seastar fetched. When replace type=xml as type=html, the browser will display web pages which source is constructed by Seastar.

6.You can use Firefox browser as your development environment

You should install FIREBUG and XPATHER extension. Use FIREBUG locate data nodes, get some information related, such as ID value, class value, and so on. Then open XPATHER window, input XPATH expression, verify the selected nodes. You can use these XPATH in your spider when verify succeeded. This is the advantage of Seastar, Firefox browser as a strong development tools here, you can visit web page, you can analysis data also.

Difference of Seastar and Seaflower: Seaflower just as multiple browser, you can get all data of a web page, including datas javascript generated, and contents in frame/iframe. You can execute javascripts in Seaflower. Seastar cannot execute javascripts, XML results it returned doesn't include dynamic contents. Seastar can only transform HTML to XML, the speed is very fast, I can bet that it's the best web structurize tools for vertical search.

Download

seastar-2.2-installer.exe (For Windows)
seastar-2.2-1.en_US.fc9.i386.rpm (For Fedora Core 9 Linux)
seastar-2.2-1.en_US.el5.i386.rpm (For RedHat EL 5/CentOS Linux)

Articles

Install

Windows: run seastar installer.
Linux: rpm -ivh seastar*.rpm

Seastar server management

Windows:

Select Seastar menu items in Start Menu, or Goto contron panel -- management tools -- services, select Seastar, can start or stop Seastar service.

Linux:

Tool of configuration management - seastarctl

usage: seastarctl command
where command is: 
1)  list
list current settings
2) set [port|rcj] value
set config
3) help
print this help info
Example 1: seastarctl list
List current settings.
Example 2: seastarctl set port 4444
Set Seastar listen on 4444 port.
Example 3: seastarctl set rcj 4444-5555-6666
Set register code to 4444-5555-6666.

Command line structurize tools - struct

usage: struct [string | <url> 
Example 1: structurize string from stdin
struct string <<!EOF
hello,seastar!
EOF

Example 2: structurize file contents through pipe  
struct string < /root/a.html

Example 3: structurize web page
struct http://www.zhuatang.com

Register

Seastar is a shareware, free trial time is 30 days. For your proper use, please register it on time.
Contact with zhsoft88@gmail.com (Email/MSN). Price: RMB3000.00.

Seastar Protocol

Example codes

Seastar.java Download

package com.zhsoft88.commons;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.Socket;
import java.util.StringTokenizer;

import org.apache.commons.lang.math.NumberUtils;

/**
 * seastar web page structurize
 * @author zhsoft88
 * @since 2008-03-28
 * @update 2008-08-03
 */
public class Seastar {
	
	public static final int PORT = 6373;

	/**
	 * seastar result
	 * @author zhsoft88
	 * @since 2008-3-28
	 */
	public static class SeastarResult {
		private int status;
		private String contents;
		private long elapsedTime;
		
		public SeastarResult(int status,String contents,long elapsedTime) {
			this.status = status;
			this.contents = contents;
			this.elapsedTime = elapsedTime;
		}
		public int getStatus() {
			return status;
		}
		public String getContents() {
			return contents;
		}
		public long getElapsedTime() {
			return elapsedTime;
		}
		@Override
		public String toString() {
			return "[status="+status+",contents="+contents+",elapsedTime="+elapsedTime+"]";
		}
	}

	private String host;
	private int port;

	/**
	 * constructor: localhost
	 */
	public Seastar() {
		this("localhost");
	}
	
	/**
	 * constructor: host
	 * @param host
	 */
	public Seastar(String host) {
		this(host,PORT);
	}
	
	/**
	 * constructor for host,port
	 * @param host
	 * @param port
	 */
	public Seastar(String host,int port) {
		this.host = host;
		this.port = port;
	}
	
	/**
	 * struct contents for specified url
	 * @param url
	 * @return
	 * @throws Exception
	 */
	public SeastarResult structURL(String url) throws Exception {
		long t1 = System.currentTimeMillis();
		Socket socket = new Socket(host,port);
		BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(socket.getOutputStream()));
		bw.write("FETCH "+url+"\r\n");
		bw.write("\r\n");
		bw.flush();
		BufferedReader br = new BufferedReader(new InputStreamReader(socket.getInputStream(),"utf-8"));
		String line = br.readLine();
		int status = -1;
		StringTokenizer st = new StringTokenizer(line," ");
		st.nextToken();
		status = NumberUtils.toInt(st.nextToken());
		while ((line=br.readLine())!=null) {
			if (line.length()==0) break;
		}
		StringBuilder sb = new StringBuilder(100);
		int c;
		while ((c=br.read())!=-1) {
			sb.append((char)c);
		}
		socket.close();
		long t2 = System.currentTimeMillis();
		return new SeastarResult(status,sb.toString(),t2-t1);
	}

	/**
	 * struct string content
	 * @param str
	 * @return
	 * @throws Exception
	 */
	public SeastarResult structString(String str) throws Exception {
		long t1 = System.currentTimeMillis();
		Socket socket = new Socket(host,port);
		OutputStream out = socket.getOutputStream();
		byte[] ba = str.getBytes("utf-8");
		out.write(("PARSE "+ba.length+"\r\n\r\n").getBytes());
		out.write(ba);
		out.flush();
		BufferedReader br = new BufferedReader(new InputStreamReader(socket.getInputStream(),"utf-8"));
		String line = br.readLine();
		int status = -1;
		StringTokenizer st = new StringTokenizer(line," ");
		st.nextToken();
		status = NumberUtils.toInt(st.nextToken());
		while ((line=br.readLine())!=null) {
			if (line.length()==0) break;
		}
		StringBuilder sb = new StringBuilder(100);
		int c;
		while ((c=br.read())!=-1) {
			sb.append((char)c);
		}
		socket.close();
		long t2 = System.currentTimeMillis();
		return new SeastarResult(status,sb.toString(),t2-t1);
	}
	
	
}

TestSeastar.java Download

package com.zhsoft88.commons.tests;

import com.zhsoft88.commons.Seastar;
import com.zhsoft88.commons.Seastar.SeastarResult;

/**
 * Test of Seastar
 * @author zhsoft88
 * @since 2008-08-03
 */
public class TestSeastar {

	/**
	 * @param args
	 */
	public static void main(String[] args) throws Exception {
		Seastar ss = new Seastar();
		{
			SeastarResult result = ss.structString("good<span>抓糖好<div>goodtest");
			System.out.println(result);
		}
		{
			SeastarResult result = ss.structURL("http://localhost:8080/docs/");
			System.out.println(result);
		}
	}

}

TestSeastar2.java Download Extract all links in Sohu homepage

package com.zhsoft88.commons.tests;

import java.net.URL;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.io.IOUtils;
import org.dom4j.Attribute;
import org.dom4j.Document;
import org.dom4j.DocumentHelper;

import com.zhsoft88.commons.Seastar;
import com.zhsoft88.commons.Seastar.SeastarResult;

/**
 * Test of Seastar
 * @author zhsoft88
 * @since 2008-08-03
 */
public class TestSeastar2 {

	/**
	 * @param args
	 */
	public static void main(String[] args) throws Exception {
		String url = "http://www.sohu.com";
		URL base = new URL(url);
		long t1 = System.currentTimeMillis();
		HttpClient client = new HttpClient();
		GetMethod get = new GetMethod(url);
		client.executeMethod(get);
		String origContent = IOUtils.toString(get.getResponseBodyAsStream(),"gbk");
		long t2 = System.currentTimeMillis();
		System.out.println("httpclient: "+(t2-t1)+" ms");
		Seastar ss = new Seastar();
		SeastarResult result = ss.structString(origContent);
		System.out.println("seastar: "+result.getElapsedTime()+" ms");
		Document doc = DocumentHelper.parseText(result.getContents());
		List<Attribute> list = doc.selectNodes("//a/@href|//frame/@src|//iframe/@src");
		Set<String> set = new HashSet<String>();
		for (Attribute a : list) {
			String v = a.getValue();
			if (v.startsWith("javascript:")||v.startsWith("mailto:")||v.startsWith("#")) continue;
			set.add(new URL(base,v).toExternalForm());
		}
		System.out.println("size="+set.size());
		for (String s : set) {
			System.out.println(s);
		}
	}

}

Products: Sealion Seacat Seaflower Seaspider Seasnipe Seastar Seadog Jiong WBXL Xultray webapp
iDocSet iDocSetHelper Blink templateJS skiafy tranid xiliplayer xilihelper i.zhuatang 原创歌曲
(C) 2024 ZHUATANG.COM, All rights reserved

update: 2015-06-26