When starting the Dubbo project and registering Zookeeper, it prompts zookeeper not connected abnormal principle analysis

created at 10-27-2021 views: 2

problem

Encountered a very weird problem. When I started multiple Dubbo projects with the same zookeeper configuration, the other projects started normally. Only one project encountered such an abnormality when Dubbo registered the zookeeper protocol during the startup process. hint--

Caused by: java.lang.IllegalStateException: zookeeper not connected
    at org.apache.dubbo.remoting.zookeeper.curator.CuratorZookeeperClient.<init>(CuratorZookeeperClient.java:80)
    ... 79 common frames omitted

I was stunned for a moment. I thought it was the zookeeper cluster that was down. Then I checked it and it was all normal. The strange thing is that other systems are also connected normally. Why does one have such an abnormality?

After looking at the exception prompt, when I studied the error in depth, I suddenly understood why the exception occurred.

It can be said that everything is naked in front of the source code.

Let’s first look at the class method CuratorZookeeperClient that the exception prompt appears. The function of this method is to establish a connection to the zookeeper client, similar to http communication. Before communication is established, a three-way handshake connection needs to be established. Similarly, create various nodes in the zookeeper client. Before, you also need to establish a client connection to the server first-

 public CuratorZookeeperClient(URL url) {
        super(url);
        try {
            int timeout = url.getParameter(TIMEOUT_KEY, DEFAULT_CONNECTION_TIMEOUT_MS);
            int sessionExpireMs = url.getParameter(ZK_SESSION_EXPIRE_KEY, DEFAULT_SESSION_TIMEOUT_MS);
            CuratorFrameworkFactory.Builder builder = CuratorFrameworkFactory.builder()
                    .connectString(url.getBackupAddress())
                    .retryPolicy(new RetryNTimes(1, 1000))
                    .connectionTimeoutMs(timeout)
                    .sessionTimeoutMs(sessionExpireMs);
            String authority = url.getAuthority();
            if (authority != null && authority.length() > 0) {
                builder = builder.authorization("digest", authority.getBytes());
            }
            client = builder.build();
            client.getConnectionStateListenable().addListener(new CuratorConnectionStateListener(url));
            client.start();
            boolean connected = client.blockUntilConnected(timeout, TimeUnit.MILLISECONDS);
            if (!connected) {
                throw new IllegalStateException("zookeeper not connected");
            }
        } catch (Exception e) {
            throw new IllegalStateException(e.getMessage(), e);
        }
    }

According to the CuratorZookeeperClient method, the zookeeper not connected exception message occurred in this piece of code——

if (!connected) {
    throw new IllegalStateException("zookeeper not connected");
}

connected represents the connection state. When its value is false, this code will be executed. So, what is the situation that causes its value to be false?

Next, let us hit a breakpoint and analyze this code step by step.

First, the dubbo and zookeeper used for testing are configured as follows-

dubbo:
  application:
    name: testervice
  registry:
    address: zookeeper://120.77.217.245
#    timeout: 20000
  protocol:
    name: dubbo
    port: 20880

After parsing, start debugging, break point, CuratorZookeeperClient method parameter url mainly contains the following information——

 url

The first step is to get the timeout parameter from the url——

int timeout = url.getParameter(TIMEOUT_KEY, DEFAULT_CONNECTION_TIMEOUT_MS);

The general logic here is that if the yaml configuration registry contains timeout in some parameters of the zookeeper registration, then the timeout defined in the configuration is returned. If the yaml is not configured, then the default timeout time is used, which is the constant DEFAULT_CONNECTION_TIMEOUT_MS by default. It is 5 * 1000, which is 5 seconds. This parameter is actually the core of this article.

If this parameter is configured in a custom form, the form is as follows: timeout: 20000——

dubbo:
  application:
    name: testervice
  registry:
    address: zookeeper://120.77.217.245
    timeout: 20000

The second step, get the client expiration time-

 int sessionExpireMs = url.getParameter(ZK_SESSION_EXPIRE_KEY, DEFAULT_SESSION_TIMEOUT_MS);

Similarly, if there is no custom configuration, use the default value DEFAULT_SESSION_TIMEOUT_MS = 60 * 1000, which is 6 minutes;

The third step is to create a set expiration time of 6 minutes, a connection timeout of 5 seconds, a retry policy of retry every second, and a connection server to url.getBackupAddress() (Note: What I got here is 120.77.217.245: 9090, which is the CuratorFramework client instance of the configured zookeeper connection url)——

CuratorFrameworkFactory.Builder builder = CuratorFrameworkFactory.builder()
          .connectString(url.getBackupAddress())
          .retryPolicy(new RetryNTimes(1, 1000))
          .connectionTimeoutMs(timeout)
          .sessionTimeoutMs(sessionExpireMs);
client = builder.build();

The fourth step is to add the monitoring of the connection status, which can monitor the operation node and the connection status ——

client.getConnectionStateListenable().addListener(new CuratorConnectionStateListener(url));

The fifth step, open the client-

client.start();

The last step is to monitor the connection of the client. If the connection is successful, it proves that the creation of the client is successful, otherwise, it fails. It can be seen that if zookeeper not connected appears, the problem is that the client connection process fails. As for why it fails, the principle lies in the client.blockUntilConnected(timeout, TimeUnit.MILLISECONDS) code.

 boolean connected = client.blockUntilConnected(timeout, TimeUnit.MILLISECONDS);
if (!connected) {
       throw new IllegalStateException("zookeeper not connected");
}

Enter the source code of client.blockUntilConnected(timeout, TimeUnit.MILLISECONDS), where maxWaitTime is the previous timeout, and the default value is 5 seconds. Let’s probably analyze the code below——

public synchronized boolean blockUntilConnected(int maxWaitTime, TimeUnit units) throws InterruptedException
{
    //Get the current time
    long startTime = System.currentTimeMillis();
    //Here is true
    boolean hasMaxWait = (units != null);
    //maxWaitTimeMs is equal to 5000 milliseconds, which is 5 seconds
    long maxWaitTimeMs = hasMaxWait? TimeUnit.MILLISECONDS.convert(maxWaitTime, units): 0;

    while (!isConnected())
    {
        //hasMaxWait is true
        if (hasMaxWait)
        {
            //Countdown 5 seconds
            long waitTime = maxWaitTimeMs-(System.currentTimeMillis()-startTime);
            //The execution is here, if 5 seconds have passed, the following method will be executed and the isConnected() value will be returned
            if (waitTime <= 0)
            {
                return isConnected();
            }
           //If it hasn't reached 5 seconds, if there are still 3 seconds to execute here, then the Object.wait(long timeout) method will be executed, that is, the thread will be automatically awakened after being blocked for 3 seconds, and then continue to execute
            wait(waitTime);
        }
        else
        {
            wait();
        }
    }
    return isConnected();
}

The core of this method will wait for the maxWaitTime time. Once the time is up, it will return the isConnected() value. It is actually well understood here that after the client initiates a connection, a while loop is used here to wait for the specified timeout time. The default is 5 seconds. , If 5 seconds have passed, the isConnected() value is returned, and the isConnected() here is to verify whether the connection is successful,

So, here is the last answer, what is isConnected()?

public synchronized boolean isConnected(){
     return (currentConnectionState != null) && currentConnectionState.isConnected();
}

This should be to judge the client connection state, that is, in the client.start() method, there will be a state, if the connection is created successfully, then currentConnectionState.isConnected() can get the true value, here is more like an observation mode, observe Whether the connection is successful within the specified connection timeout period.

According to debug, when it is found that the connection is not successful, the value is null, and the value obtained is false. When we set the default connection timeout of 5 seconds to timeout: 20000, wait for the connection process and find that the connection is successful, the value of currentConnectionState returned is RECONNECTED .

It can be seen that the zookeeper not connected abnormal problem occurred before, that is, the connection timeout setting is too short!

connection timeout setting is too short

currentConnectionState.isConnected() gets an enumeration value, and RECONNECTED returns true——

  CONNECTED {
        public boolean isConnected() {
            return true;
        }
    },
    SUSPENDED {
        public boolean isConnected() {
            return false;
        }
    },
    RECONNECTED {
        public boolean isConnected() {
            return true;
        }
    },
    LOST {
        public boolean isConnected() {
            return false;
        }
    },
    READ_ONLY {
        public boolean isConnected() {
            return true;
        }
    };

When it returns true, then !connected is false, and the following exception prompt will not be executed-

if (!connected) {
       throw new IllegalStateException("zookeeper not connected");
}

According to the above analysis, it can be seen that zookeeper not connected is abnormal when starting the Dubbo project to register Zookeeper. The reason is that the connection timeout is not set in the configuration, but the default 5 seconds is used. As a result, if the connection is not successful within 5 seconds, there will be a connection abnormality and failure. Successfully connected. After adjusting for a long time, the normal connection is successful. It also shows that the local connection to the zookeeper cluster has exceeded five seconds this time.

created at:10-27-2021
edited at: 10-27-2021: