Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: defer/recover support in race detector #26813

Open
adilbaig opened this issue Aug 5, 2018 · 33 comments
Open

runtime: defer/recover support in race detector #26813

adilbaig opened this issue Aug 5, 2018 · 33 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. RaceDetector
Milestone

Comments

@adilbaig
Copy link

adilbaig commented Aug 5, 2018

Hi Go Devs,

I've noticed a strange memory leak when running my Go program with the Race Detector. The program is an ad-hit tracker, it receives data from RabbitMQ and writes them down to the database. When there are errors in the DB, the program leaks memory (the RES mem usage grows unchecked until the machine runs out of RAM).

Run the program (below) with and without the -race detector to see the difference.

What version of Go are you using (go version)?

go version go1.10.3 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/home/adil/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/adil/go"
GORACE=""
GOROOT="/snap/go/2130"
GOTMPDIR=""
GOTOOLDIR="/snap/go/2130/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build049513405=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Run this program without the race detector:

> go build -o race experimental/rabbitmqmemoryleak.go && ./race

Observe the memory usage in top. It should flatline. Then, run it with the -race detector

> go build -race -o race experimental/rabbitmqmemoryleak.go && ./race

The memory usage grows unchecked.

package main

import (
	"database/sql"
	"log"

	"github.com/go-errors/errors"
	_ "github.com/lib/pq"
	"github.com/streadway/amqp"
)

// This program demonstrates a memory leak in Go's built-in race detector.
// You will need RabbitMQ and Postgres installed on your machine.

// First, run the program like this:
// > 	go build -o race experimental/rabbitmqmemoryleak.go && ./race
// Observe the 'RES' memory in `top`. It should level after a bit
//
// Then, run the program like this:
// > 	go build -race -o race experimental/rabbitmqmemoryleak.go && ./race
// Watch as RAM usage counts unchecked

// Strangely running `go run -race experimental/rabbitmqmemoryleak.go` does not leak memory.

func main() {

	var err error

	conn, err := amqp.Dial("amqp://guest:guest@localhost:5672")
	if err != nil {
		log.Fatal(err)
	}
	defer conn.Close()

	channel, err := conn.Channel()
	if err != nil {
		log.Fatal(err)
	}
	defer channel.Close()

	queue := "testing-memory-leak"

	// 1. Create the queue
	var args amqp.Table
	_, err = channel.QueueDeclare(queue, false, true, false, false, args)
	defer channel.QueueDelete(queue, false, false, true)

	// 2. Fill in messages
	cnt := 100000
	log.Printf("Inserting %d messages", cnt)
	for index := 0; index < cnt; index++ {
		msg := `{"date":"2018-07-17 08:44:30.202358+02","campaignID":18275,"affiliateSiteID":70002,"materialID":969206,"referenceID":0,"referenceHash":null,"referenceName":null,"isUnique":true,"merchantCommission":"0","affiliateCommission":"0","IP":"37.74.164.6","postImpressionHash":"0ebdd4720ef7eba5f09a84b8c68367c6","universalFingerprintHash":"c47e486da2ff1a4695b1499052b09807","pushToDynamoDB":true,"timestamp":"1531809870","handlerType":"impression","host":"snipped","pid":7962,"userAgentID":null,"userAgentHash":"08ac6aeb466ccd8f05ad37a144afe0f8","userAgentName":"Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/58.0.3029.110 Safari\/537.36 Edge\/16.16299","requestHash":"29c2363878fee77174d0b1e2491ea93b"}`
		err := channel.Publish("", queue, true, false, amqp.Publishing{ContentType: "application/json", Body: []byte(msg), DeliveryMode: 2})
		if err != nil {
			log.Panic(err)
		}
	}
	log.Printf("Done")

	pgb, err := sql.Open("postgres", "host=localhost user=user password=pass dbname=db sslmode=disable")
	if err != nil {
		log.Fatal(err)
	}
	defer pgb.Close()

	// 2. Listen on the messages
	forever := make(chan bool)
	for i := 0; i < 10; i++ {
		go handler(queue, pgb)
	}
	<-forever
}

func handler(queue string, pgb *sql.DB) {
	conn, err := amqp.Dial("amqp://guest:guest@localhost:5672")
	if err != nil {
		log.Fatal(err)
	}

	channel, err := conn.Channel()
	if err != nil {
		log.Fatal(err)
	}

	for {
		message, ok, err := channel.Get(queue, false)
		if err != nil {
			log.Fatal(err)
		}
		if !ok {
			continue
		}

		log.Printf("%+v", message)

		err = someErroneousSQLTransaction(pgb)
		if err != nil {
			log.Printf("ERROR: %s", err)
			message.Nack(false, true)
		} else {
			message.Ack(false)
		}
	}
}

func someErroneousSQLTransaction(pgb *sql.DB) error {
	refTx, err := pgb.Begin()
	if err != nil {
		return errors.Wrap(err, 0)
	}
	defer func() {
		if err != nil {
			refTx.Rollback()
		} else {
			refTx.Commit()
		}
	}()
	_, err = refTx.Exec(`set transaction isolation level repeatable read`)
	if err != nil {
		return err
	}

	var referenceID int64
	switch err = pgb.QueryRow(`SELECT "ID" FROM public.blah LIMIT 1`).Scan(&referenceID); err {
	case sql.ErrNoRows:
		break
	case nil:
		break
	default:
		return err
	}

	return nil
}

What did you expect to see?

Flatline memory usage, as measured with top when running with -race. I expect it to take more memory to maintain internal data structures, but it should flatline at some point.

What did you see instead?

Constantly growing RES mem usage in top.

Strangely running go run -race experimental/rabbitmqmemoryleak.go does not leak memory.

@meirf meirf added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 5, 2018
@ianlancetaylor ianlancetaylor changed the title Go's race detector is leaking memory on RabbitMQ and Postgres runtime: Go's race detector is leaking memory on RabbitMQ and Postgres Aug 5, 2018
@ianlancetaylor ianlancetaylor added this to the Go1.12 milestone Aug 5, 2018
@ianlancetaylor
Copy link
Contributor

CC @dvyukov

@dvyukov
Copy link
Member

dvyukov commented Aug 6, 2018

Difference between go build and go run looks strange. Do we have any known precedent of differences?

@dvyukov
Copy link
Member

dvyukov commented Aug 6, 2018

@adilbaig can you provide a step-by-step repro instructions?

@agnivade
Copy link
Contributor

agnivade commented Aug 6, 2018

It is there in the code comments I believe -

// This program demonstrates a memory leak in Go's built-in race detector.
// You will need RabbitMQ and Postgres installed on your machine.

// First, run the program like this:
// > 	go build -o race experimental/rabbitmqmemoryleak.go && ./race
// Observe the 'RES' memory in `top`. It should level after a bit
//
// Then, run the program like this:
// > 	go build -race -o race experimental/rabbitmqmemoryleak.go && ./race
// Watch as RAM usage counts unchecked

Did you want anything specific ?

@dvyukov
Copy link
Member

dvyukov commented Aug 6, 2018

Something that I can copy-paste into my console.

@adilbaig
Copy link
Author

adilbaig commented Aug 6, 2018

@dvyukov How about this.

# Have RabbitMQ and Postgres setup (tested with version 9.6), on Ubuntu
sudo apt install rabbitmq-server postgresql-9.6
go get "github.com/streadway/amqp"
go build -o race rabbitmqmemoryleak.go && ./race
# Open `top` in another shell and see that memory usage (RES) is flat for this process. CTRL+C ./race.
go build -race -o race rabbitmqmemoryleak.go && ./race
# Now, open `top` in another shell and see the memory leak

You don't actually need to create Postgres tables. I see the leaks when the SQL errors out (for example due to bad login credentials).

@dvyukov
Copy link
Member

dvyukov commented Aug 6, 2018

Can reproduce it now.
The problem is that the program grows infinite stack with infinite recursion:

...
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
  main.someErroneousSQLTransaction()
      /tmp/rabbitmqmemoryleak.go:107 +0x5b
  main.handler()
      /tmp/rabbitmqmemoryleak.go:96 +0x22b
  database/sql.(*DB).Begin()
      database/sql/sql.go:1637 +0x6d
...

So it will crash without race detector sooner or later too, makes GC slow, crash messages unreadable, etc.
It's just that race detector is more sensitive to thousands of unique stacks as it memorizes tons of contextual information about the program as it runs. In this case it memorizes several variations of the stack per recursion frame, which is quadratic memory consumption.

@adilbaig
Copy link
Author

adilbaig commented Aug 6, 2018

@dvnagorny

So it will crash without race detector sooner or later too

I haven't seen that happen, as I ran it without the race detector over the weekend. Can you confirm that it's stable without -race?

I'm a novice at Go, but, given your explanation, am i right to think this will be an issue for any evented system (that errors out on events).?

@dvyukov
Copy link
Member

dvyukov commented Aug 6, 2018

Can you confirm that it's stable without -race?

no.

I'm a novice at Go, but, given your explanation, am i right to think this will be an issue for any evented system (that errors out on events).?

I don't know about these libraries, but what first comes to my mind is that there is a way to use them without infinite recursion. You both have an infinite loop in handler and call it recursively. I would expect that you need to do only 1 of these things, but not both. Have you looked at other usage examples of these libraries? Do all of them use it in the same way?

@adilbaig
Copy link
Author

adilbaig commented Aug 7, 2018

I don't know about these libraries, but what first comes to my mind is that there is a way to use them without infinite recursion. You both have an infinite loop in handler and call it recursively. I would expect that you need to do only 1 of these things, but not both.

I guess it only looks recursive in the dump, but it isn't, it's just running in a loop one message after another erroring out at every message. So I'm guessing with the race detector on, a reference to something in the sql package is still being bandied around. I tried replacing the sql error with a plain fmt.Errorf, memory usage remained flat. So it isn't any error that causes the leak, but something specific to the sql library. That's my guess.

for {
		message, ok, err := channel.Get(queue, false)
		if err != nil {
			log.Fatal(err)
		}
		if !ok {
			continue
		}

		log.Printf("%+v", message)

		// err = someErroneousSQLTransaction(pgb)
		err = fmt.Errorf("FAKE ERROR!") //Didn't leak
		if err != nil {
			log.Printf("ERROR: %s", err)
			message.Nack(false, true)
		} else {
			message.Ack(false)
		}
	}
...

Have you looked at other usage examples of these libraries? Do all of them use it in the same way?

Yep, it's largely just fetching and processing a message in a loop. Here's a canonical example from the site: https://www.rabbitmq.com/tutorials/tutorial-one-go.html (Code is at https://github.com/rabbitmq/rabbitmq-tutorials/blob/master/go/receive.go)

@dvnagorny

@dvyukov
Copy link
Member

dvyukov commented Aug 8, 2018

I guess it only looks recursive in the dump, but it isn't, it's just running in a loop one message after another erroring out at every message.

No, it calls someErroneousSQLTransaction which calls sql.(*DB).Begin which calls handler again.

@agnivade
Copy link
Contributor

Maybe I am missing something, but I fail to see how a sql.(*DB).Begin() call can call into user level code.

Also, @adilbaig, you are creating a transaction but are not using the tx to make your queries but instead, you are using the main db handle itself to make the QueryRow call. Any reason for that ? If you want your queries to happen through the transaction, you should use the tx handle itself to make the queries.

Also, we already have options to transaction types - https://golang.org/pkg/database/sql/#IsolationLevel. You can set them directly while creating a tx object.

My suspicion is that creating multiple transactions from mulitple goroutines is somehow causing this behavior in the race detector. Can you try with pgb.BeginTx() by setting opts to repeatable read, and use the tx to make the query and see if it makes any difference ?

@adilbaig
Copy link
Author

adilbaig commented Aug 12, 2018

@agnivade That was a mistake. Thanks for pointing it out!

Also, we already have options to transaction types ..

Nice! I will use that in my code.

The change doesn't make a difference though. I also think the bug resides somewhere in sql creating connections (not necessarily transactions) inside goroutines. Here's the code:

func someErroneousSQLTransaction(pgb *sql.DB) error {
	ctx := context.Background()
	refTx, err := pgb.BeginTx(ctx, &sql.TxOptions{Isolation: sql.LevelRepeatableRead})
	if err != nil {
		return err
	}
	defer func() {
		if err != nil {
			refTx.Rollback()
		} else {
			refTx.Commit()
		}
	}()

	var referenceID int64
	switch err = refTx.QueryRow(`SELECT "ID" FROM public.blah LIMIT 1`).Scan(&referenceID); err {
	case sql.ErrNoRows:
		break
	case nil:
		break
	default:
		return err
	}

	return nil
}

@agnivade
Copy link
Contributor

If that is the case, then I think we can take out rabbitMQ from the equation. Perhaps you can shorten the test by just spawning 10 goroutines which share a (*sql.DB) connection and perform the same query. And run that with race detector and see what happens.

Also, we are still unclear as to why the behavior does not show with go run -race, but only with go build -race. @dvyukov - do you have any clues on that ?

@adilbaig
Copy link
Author

@agnivade - Well no, it only leaks when you're processing RMQ messages. Just plain channels throwing sql errors doesn't leak. The combination of the two libraries is playing a part.

@dvyukov
Copy link
Member

dvyukov commented Aug 13, 2018

Right. It seems that the trace is indeed broken.
First thing that comes to mind is broken panic/defer/recover support. On recover we should unwind race stack accordingly calling the necessary amount of racefuncexit, but I don't see this code at all. Did we ever do this? Does this program recover from panics? Defer should work out-of-the-box, at least that was the intention, but I wonder if it somehow leads to unmatched racefuncenter/racefuncexit.
I am travelling for next 3 weeks and most likely won't be be able to look closer.

@dvyukov dvyukov changed the title runtime: Go's race detector is leaking memory on RabbitMQ and Postgres runtime: defer/recover support in race detector Sep 2, 2018
@dvyukov
Copy link
Member

dvyukov commented Sep 2, 2018

Here is minimized reproducer:

package main

var x uint64

func main() {
	go func() {
		x++
	}()
	for i := 0; i < 10; i++ {
		baz()
	}
	x++
}

func baz() {
	defer func() {
		recover()
	}()
	foo()
}

func foo() {
	panic(0)
}
$ go version
go version devel +c9cc20bd3a Sat Sep 1 15:43:42 2018 +0000 linux/amd64

$ go run -race -gcflags=-l /tmp/test.go
==================
WARNING: DATA RACE
Read at 0x0000005253e8 by main goroutine:
  main.main()
      /tmp/test.go:12 +0x72
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54
  main.main()
      /tmp/test.go:10 +0x54

Previous write at 0x0000005253e8 by goroutine 5:
  main.main.func1()
      /tmp/test.go:7 +0x56

Goroutine 5 (finished) created at:
  main.main()
      /tmp/test.go:6 +0x46
==================

The repeated frames must not be present in the trace.
The root cause is that we don't unwind tsan stack on panic, i.e. not calling racefuncexit for each unwound frame. The bug is more visible with recover, but strictly saying affects just defer too. Consider that a race happens in a deferred func, or that main application loop runs in a deferred function.
This is not a regression, panics were never properly supported.

We could either defer racefuncexit instead of calling it, or manually call racefuncexit necessary amount of times before calling deferred functions (can we figure out how many frames we skip?). In both cases we need to be careful handling runtime functions because runtime is not instrumented, so we can potentially call racefuncexit too many times, or rely on racefuncexit being deferred when it's actually not.

@dvyukov dvyukov removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 2, 2018
@andybons andybons modified the milestones: Go1.12, Go1.13 Feb 12, 2019
@odeke-em
Copy link
Member

odeke-em commented Jun 5, 2019

Thank you for filing this issue @adilbaig and thanks @agnivade and @dvyukov for chasing down the bug and for the minimal repro!

@dvyukov perhaps do you think that we should punt this feature request to Go1.14 as it is this late in the Go1.13 cycle?

@dvyukov
Copy link
Member

dvyukov commented Jun 5, 2019

I think this is unstaffed, so we can move this right to Unplanned.

@odeke-em
Copy link
Member

odeke-em commented Jun 5, 2019

Roger that @dvyukov! I might be interested in this so will kindly page you sometime for CL reviews on it. I'll move it over.

@odeke-em odeke-em self-assigned this Jun 5, 2019
@odeke-em odeke-em removed this from the Go1.13 milestone Jun 5, 2019
@odeke-em odeke-em added this to the Unplanned milestone Jun 5, 2019
@kevinburke1
Copy link

Hi @dvyukov @odeke-em, can you describe a little bit more about the current behavior of these types of programs?

I have a program which uses lib/pq to connect to a Postgres database. lib/pq uses panics extensively, as a way of short-circuiting the call stack. I was using it with the race detector on. I also observe large and growing memory usage when I keep the program running, on both Mac and Linux, with Go 1.13 and tip.

Is it possible that panics are somehow not being garbage collected properly when the race detector is on?

@dvyukov
Copy link
Member

dvyukov commented Feb 15, 2020

I would expect the memory leak to be minimal (8 bytes per frame). That is tied to goroutine, so if the goroutine exits, the memory should be released.

@kevinburke1
Copy link

That actually fits the profile perfectly. I had a long-running program that was reading rows out of a database in a for loop, then sending them to a downstream server, all in one blocking thread. The database library (github.com/lib/pq) uses panic and recover while sending and reading data, and I was using defer cancel() with context.WithTimeout throughout the program.

It also makes sense why I didn't see this on HTTP servers that are also long running. Those also use defer and recover() but inside of the spawned goroutine for an incoming connection so the recover will get cleaned up.

See #37233 for more information.

@gopherbot
Copy link

Change https://golang.org/cl/220586 mentions this issue: doc/articles: mention memory leak in race detector docs

gopherbot pushed a commit that referenced this issue Feb 25, 2020
As far as I can tell, there is no public documentation on this topic,
which cost me several days of debugging.

I am possibly unusual in that I run binaries in production with the
race detector turned on, but I think that others who do the same may
want to be aware of the risk.

Updates #26813.
Updates #37233.

Change-Id: I1f8111bd01d0000596e6057b7cb5ed017d5dc655
Reviewed-on: https://go-review.googlesource.com/c/go/+/220586
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/221019 mentions this issue: [release-branch.go1.14] doc/articles/race_detector: mention memory leak potential

gopherbot pushed a commit that referenced this issue Feb 25, 2020
…ak potential

As far as I can tell, there is no public documentation on this topic,
which cost me several days of debugging.

I am possibly unusual in that I run binaries in production with the
race detector turned on, but I think that others who do the same may
want to be aware of the risk.

Updates #26813.
Updates #37233.

Change-Id: I1f8111bd01d0000596e6057b7cb5ed017d5dc655
Reviewed-on: https://go-review.googlesource.com/c/go/+/220586
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
(cherry picked from commit ba093c4)
Reviewed-on: https://go-review.googlesource.com/c/go/+/221019
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 26, 2020
@edaniels
Copy link
Contributor

We just realized we were hitting this for many months in our dev and qa environments where we are running a server compiled with the race defector enabled. Ultimately disabling it kept the resident memory stable and not leaking to the point of being OOM killed over the course of the day. We however were not seeing 8 bytes of leakage at a time as opposed to many MB. When I core dumped the process and ran strings on it, I saw many duplicates of a record we pull from the database although we cache that value in a go map internally.

I eventually ran strace -k filtering for mmap calls since malloc can call mmap if the size requested is over a threshold. Doing this showed a lot of calls into tsan/sanitizer library shadow methods. I’m wondering if the shadow variables maintained end up making wholesale copies of certain values or regions of memory?

Either way, the detector is too useful and we will keep it on, but getting it to not leak memory would excise a lot of noise from our development and testing processes.

@dvyukov
Copy link
Member

dvyukov commented Jul 15, 2020

I’m wondering if the shadow variables maintained end up making wholesale copies of certain values or regions of memory?

No, the additional regions allocated by the race runtime should only hold internal metadata and never user data, or copies of user data.

@edaniels
Copy link
Contributor

edaniels commented Jul 15, 2020

I see. So around 7/15 02:00 is when we turned off the detector and as soon as we turned it back on today we started leaking again.
data

We've had that type of trend for at least 3 months now and not running it for a 12 hour period showed no kind of abnormal leakage of memory. So maybe something else is going on that is causing a ton of 8 byte leakages?

Screen Shot 2020-07-15 at 12 44 58 PM

Keeping in mind that this is mostly HTTP traffic load that stays consistent day over day.

@anacrolix
Copy link
Contributor

I've had users report this when using https://github.com/anacrolix/stm with the race detector enabled. Per other comments above, there is a lot of defer/recover, and very long call stacks that don't look correct.

@gopherbot
Copy link

Change https://go.dev/cl/458218 mentions this issue: runtime: implement traceback iterator

@aclements
Copy link
Member

Once this is fixed, we should unwind the documentation change in CL 221019

gopherbot pushed a commit that referenced this issue Mar 10, 2023
Currently, all stack walking logic is in one venerable, large, and
very, very complicated function: runtime.gentraceback. This function
has three distinct operating modes: printing, populating a PC buffer,
or invoking a callback. And it has three different modes of unwinding:
physical Go frames, inlined Go frames, and cgo frames. It also has
several flags. All of this logic is very interwoven.

This CL reimplements the monolithic gentraceback function as an
"unwinder" type with an iterator API. It moves all of the logic for
stack walking into this new type, and gentraceback is now a
much-simplified wrapper around the new unwinder type that still
implements printing, populating a PC buffer, and invoking a callback.
Follow-up CLs will replace uses of gentraceback with direct uses of
unwinder.

Exposing traceback functionality as an iterator API will enable a lot
of follow-up work such as simplifying the open-coded defer
implementation (which should in turn help with #26813 and #37233),
printing the bottom of deep stacks (#7181), and eliminating the small
limit on CPU stacks in profiles (#56029).

Fixes #54466.

Change-Id: I36e046dc423c9429c4f286d47162af61aff49a0d
Reviewed-on: https://go-review.googlesource.com/c/go/+/458218
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
@mdempsky
Copy link
Member

mdempsky commented Aug 8, 2023

FYI, @dvyukov 's test case no longer reproduces for me.

I also tried forcing baz to use stack-allocated defers (by adding a named result parameter) and heap-allocated defers (by adding a for { ...; break } wrapper around the defer statement).

I also tried adding a runtime.Gosched() call after the go func() { x++ }() statement, to ensure the "Read at" and "Previous write at" match the same goroutines as in @dvyukov 's output.

@anacrolix
Copy link
Contributor

Could something have changed in recent versions is Go (I presume 1.21 @mdempsky ?) that could have fixed this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. RaceDetector
Projects
None yet
Development

No branches or pull requests