Uploaded image for project: 'SnappyData'
  1. SnappyData
  2. SNAP-1911

JVM crash, while executing select query in smart connector mode, using NW schema

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Priority: Highest
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: 1.0.1
    • Component/s: Query
    • Labels:

      Description

      The test starts 3 locators, 3 servers, 2 leads, 1 master and 3 workers(spark cluster). This a column table test, in which NW schema tables with default data size are created and loaded using a snappy job. Queries are run and Validatedin smart connector mode. Queries are also run in the snappy shell using sql script.

      Execution of Q6 from NW schema failed in smart connector mode with below exception (From spark driver log):

      Query Q6
      == Parsed Logical Plan ==
      'Sort [1 ASC NULLS FIRST, 3 ASC NULLS FIRST], true
      +- 'Project [unresolvedalias('TITLE, None), unresolvedalias('FIRSTNAME, None), unresolvedalias('LASTNAME, None)]
         +- 'UnresolvedRelation `EMPLOYEES`
      
      == Analyzed Logical Plan ==
      TITLE: string, FIRSTNAME: string, LASTNAME: string
      Sort [TITLE#535 ASC NULLS FIRST, LASTNAME#533 ASC NULLS FIRST], true
      +- Project [TITLE#535, FIRSTNAME#534, LASTNAME#533]
         +- SubqueryAlias EMPLOYEES
            +- Relation[EMPLOYEEID#532,LASTNAME#533,FIRSTNAME#534,TITLE#535,TITLEOFCOURTESY#536,BIRTHDATE#537,HIREDATE#538,ADDRESS#539,CITY#540,REGION#541,POSTALCODE#542,COUNTRY#543,HOMEPHONE#544,EXTENSION#545,PHOTO#546,NOTES#547,REPORTSTO#548,PHOTOPATH#549] ColumnFormatRelation[APP.EMPLOYEES]
      
      == Optimized Logical Plan ==
      Sort [TITLE#535 ASC NULLS FIRST, LASTNAME#533 ASC NULLS FIRST], true
      +- Project [TITLE#535, FIRSTNAME#534, LASTNAME#533]
         +- Relation[EMPLOYEEID#532,LASTNAME#533,FIRSTNAME#534,TITLE#535,TITLEOFCOURTESY#536,BIRTHDATE#537,HIREDATE#538,ADDRESS#539,CITY#540,REGION#541,POSTALCODE#542,COUNTRY#543,HOMEPHONE#544,EXTENSION#545,PHOTO#546,NOTES#547,REPORTSTO#548,PHOTOPATH#549] ColumnFormatRelation[APP.EMPLOYEES]
      
      == Physical Plan ==
      *CachedPlanHelper
      +- *Sort [TITLE#535 ASC NULLS FIRST, LASTNAME#533 ASC NULLS FIRST], true, 0
         +- Exchange rangepartitioning(TITLE#535 ASC NULLS FIRST, LASTNAME#533 ASC NULLS FIRST, 200)
            +- *CachedPlanHelper
               +- *Partitioned Scan ColumnFormatRelation[APP.EMPLOYEES] , Requested Columns = [TITLE#535,FIRSTNAME#534,LASTNAME#533] partitionColumns = [City#540,Country#543] numBuckets= 113 numPartitions= 113
      Job started with 1 stages: SparkListenerJobStart(20,1502287422760,WrappedArray(org.apache.spark.scheduler.StageInfo@755a20b7),{spark.sql.execution.id=21, spark.scheduler.pool=default, spark.rdd.scope={"id":"260","name":"Exchange"}, spark.rdd.scope.noOverride=true})
      Stage 23 completed with  113 tasks.
      Job completed with Result :  JobSucceeded
      Job started with 2 stages: SparkListenerJobStart(21,1502287423523,WrappedArray(org.apache.spark.scheduler.StageInfo@3238118b, org.apache.spark.scheduler.StageInfo@79d462b7),{spark.sql.execution.id=21, spark.scheduler.pool=default})
      17/08/09 07:04:35.931 PDT task-result-getter-3<tid=0x> WARN TaskSetManager: Lost task 1.0 in stage 24.0 (TID 1768, 10.80.141.108, executor 2): java.sql.SQLException: (SQLState=X0Z01 Severity=20000) (dev8.telx.snappydata.io[21967]) Node '[dev8.telx.snappydata.io[21967],connId=52] {caused by: Channel closed. [SOCKET END OF TRANSMISSION]}' went down or data no longer available while iterating the results (method 'executeStatement'). Please retry the operation.
              at io.snappydata.thrift.common.ThriftExceptionUtil.newSnappyException(ThriftExceptionUtil.java:164)
              at io.snappydata.thrift.internal.ClientService.newExceptionForNodeFailure(ClientService.java:745)
              at io.snappydata.thrift.internal.ClientService.newSnappyExceptionForNodeFailure(ClientService.java:787)
              at io.snappydata.thrift.internal.ClientService.handleException(ClientService.java:678)
              at io.snappydata.thrift.internal.ClientService.execute(ClientService.java:909)
              at io.snappydata.thrift.internal.ClientStatement.execute(ClientStatement.java:720)
              at io.snappydata.thrift.internal.ClientStatement.execute(ClientStatement.java:371)
              at io.snappydata.impl.SparkShellRDDHelper.executeQuery(SparkShellRDDHelper.scala:111)
              at org.apache.spark.sql.execution.columnar.impl.SmartConnectorColumnRDD.compute(JDBCSourceAsColumnarStore.scala:702)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
              at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:113)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
              at org.apache.spark.sql.execution.WholeStageCodegenRDD.compute(WholeStageCodegenExec.scala:506)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
              at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:103)
              at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58)
              at org.apache.spark.scheduler.Task.run(Task.scala:126)
              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.thrift.transport.TTransportException: Channel closed.
              at io.snappydata.thrift.common.SnappyTSocket.read(SnappyTSocket.java:373)
              at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
              at org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:634)
              at org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:501)
              at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
              at io.snappydata.thrift.SnappyDataService$Client.recv_execute(SnappyDataService.java:256)
              at io.snappydata.thrift.SnappyDataService$Client.execute(SnappyDataService.java:239)
              at io.snappydata.thrift.internal.ClientService.execute(ClientService.java:889)
              ... 22 more
      

      server with port dev8.telx.snappydata.io[21967] (vm_7_) had actually crashed. Below is the contents of hs_err_pid2456.log*

      #
      # A fatal error has been detected by the Java Runtime Environment:
      #
      #  Internal Error (sharedRuntime.cpp:834), pid=2456, tid=0x00007fc60c989700
      #  fatal error: exception happened outside interpreter, nmethods and vtable stubs at pc 0x00007fca4106c7e2
      #
      # JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 1.8.0_131-b11)
      # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode linux-amd64 compressed oops)
      # Core dump written. Default location: /export/dev8a/users/sagarwal/snappyHydraLogs/northWind/2017-08-09-06-15-54/nwColumnTablesTest-0809-065119/vm_7_snappyStore3_dev8_32542/core or core.2456
      #
      # If you would like to submit a bug report, please visit:
      #   http://bugreport.java.com/bugreport/crash.jsp
      #
      
      ---------------  T H R E A D  ---------------
      
      Current thread (0x00007fc95052b000):  JavaThread "pool-3-thread-38" [_thread_in_Java, id=9732, stack(0x00007fc60c889000,0x00007fc60c98a000)]
      
      Stack: [0x00007fc60c889000,0x00007fc60c98a000],  sp=0x00007fc60c986d50,  free space=1015k
      Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
      V  [libjvm.so+0xac826a]  VMError::report_and_die()+0x2ba
      V  [libjvm.so+0x4fd089]  report_fatal(char const*, int, char const*)+0x59
      V  [libjvm.so+0x9c391a]  SharedRuntime::continuation_for_implicit_exception(JavaThread*, unsigned char*, SharedRuntime::ImplicitExceptionKind)+0x33a
      V  [libjvm.so+0x92bbfa]  JVM_handle_linux_signal+0x48a
      V  [libjvm.so+0x921e13]  signalHandler(int, siginfo*, void*)+0x43
      C  [libpthread.so.0+0xf100]
      

      Checked with Hemant and searched over the internet and found that it can be either a kernel issue or a JDK issue. Following are the links for few articles:
      https://stackoverflow.com/questions/44625844/what-does-this-jvm-crash-log-mean
      https://blogs.oracle.com/poonam/analysis-of-strange-hotspot-crashes
      http://bugs.java.com/view_bug.do?bug_id=6730810

      There is also a work around suggested - to use -XX:-UseOnStackReplacement (suggested for Solaris tough). But could not find any clear documentation for it. Need to investigate further on the implications of it, if planned to use.

      Assigning it to myself for now. If this is observed in regressions again, I will update the ticket, then needs to be reassigned.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                sagarwal Sonal Agarwal
                Reporter:
                sagarwal Sonal Agarwal
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: